Simplex Frank-Wolfe: Linear Convergence and Its Numerical Efficiency for Convex Optimization over Polytopes

Haoning Wang Department of Mathematical Sciences, Tsinghua University, Beijing 100084, China Houduo Qi Corresponding author: houduo.qi@polyu.edu.hk Department of Data Science and Artificial Intelligence, and Department of Applied Mathematics, The Hong Kong Polytechnic University, Hong Kong Liping Zhang Department of Mathematical Sciences, Tsinghua University, Beijing 100084, China
Abstract

We investigate variants of the Frank-Wolfe (FW) algorithm for smoothing and strongly convex optimization over polyhedral sets, with the goal of designing algorithms that achieve linear convergence while minimizing per-iteration complexity as much as possible. Starting from the simple yet fundamental unit simplex, and based on geometrically intuitive motivations, we introduce a novel oracle called Simplex Linear Minimization Oracle (SLMO), which can be implemented with the same complexity as the standard FW oracle. We then present two FW variants based on SLMO: Simplex Frank-Wolfe and the refined Simplex Frank-Wolfe (rSFW). Both variants achieve a linear convergence rate for all three common step-size rules. Finally, we generalize the entire framework from the unit simplex to arbitrary polytopes. Furthermore, the refinement step in rSFW can accommodate any existing FW strategies such as the well-known away-step and pairwise-step, leading to outstanding numerical performance. We emphasize that the oracle used in our rSFW method requires only one more vector addition compared to the standard LMO, resulting in the lowest per-iteration computational overhead among all known Frank-Wolfe variants with linear convergence.

keywords: Frank-Wolfe algorithm, conditional gradient methods, linear convergence, convex optimization, first-order methods, linear programming

1 Introduction

Over the past decades, Frank-Wolfe (FW) algorithms [10] (a.k.a. conditional gradients [25]) have been extensively investigated due to its lower per-iteration complexity compared to projected or proximal gradient-based methods, in particular for large-scale machine learning applications and sparse optimization. This topic has been comprehensively covered in several recent publications including [3, 4, 27] and [24, Chapter 7],[2, Chapter 10], to just name a few. The key step in FW algorithms is Linear Minimization Oracle (LMO). We refer to [23] for (worst-case) complexity analysis for general LMOs. One of the most often cited examples is LMO over the unit Simplex Sn:={𝒙n|xi=1,𝒙0}S_{n}:=\left\{{\boldsymbol{x}}\in\mathbb{R}^{n}|\ \sum x_{i}=1,{\boldsymbol{x}}\geq 0\right\}. Projection onto SnS_{n} is much expensive than LMO over SnS_{n}. Research effort has been on developing LMOs that may lead to linear convergence while keeping the computation of each LMO as low as possible. Therefore, the total computational complexity of a typical FW-type algorithm can be calculated as follows.

Total Computation=(#Iterations)×(Computation of LMOs per iteration).\mbox{Total Computation}=(\#\mbox{Iterations})\times\mbox{(Computation of LMOs per iteration)}.

Note that some existing algorithms may require more than one LMO each iteration. The purpose of this paper is to propose a new LMO, whose computational complexity is probably the cheapest among all existing algorithms. Furthermore, it also guarantees a linear convergence rate comparable to the known ones for the convex optimization over a polytope:

minf(𝒙)s.t.𝒙𝒫=Conv(𝒱),\min\ f({\boldsymbol{x}})\qquad\mbox{s.t.}\quad{\boldsymbol{x}}\in\mathcal{P}=\mbox{Conv}(\mathcal{V}), (1.1)

where 𝒱n\mathcal{V}\subseteq\mathbb{R}^{n} is a finite set of vectors that we call atoms. For the moment, we only assume f:𝒞f:\mathcal{C}\mapsto\mathbb{R} is convex and differentiable for the convenience of discussion below. Here, 𝒞\mathcal{C} is an open set containing 𝒫\mathcal{P}. Later, we will enforce strong convexity as well as other properties for our analysis.

1.1 Related Work

There are a large number of publications that directly or remotely motivated this work. We are only able to list a few of them below with some critical analysis. Given an LMO, the original FW algorithm [10] states as:

{𝒚k=LMO(f(𝒙k1),𝒫)argmin{f(𝒙k1),𝒚|𝒚𝒫},𝒙k=(1δk)𝒙k1+δk𝒚k,δk(0,1],\left\{\begin{array}[]{l}{\boldsymbol{y}}_{k}=\mbox{LMO}(\nabla f({\boldsymbol{x}}_{k-1}),\mathcal{P})\in\operatorname*{\arg\min}\left\{\langle\nabla f({\boldsymbol{x}}_{k-1}),{\boldsymbol{y}}\rangle\ |\ {\boldsymbol{y}}\in\mathcal{P}\right\},\\[0.86108pt] {\boldsymbol{x}}_{k}=(1-\delta_{k}){\boldsymbol{x}}_{k-1}+\delta_{k}{\boldsymbol{y}}_{k},\ \ \delta_{k}\in(0,1],\end{array}\right.

where δk\delta_{k} is a steplength often satisfying certain conditions [6, 18, 19]. One of the key advantages of the FW method over the well-known projected gradient method is its lower cost per iteration in many common scenarios, such as the simplex [6], flow polytope [7, 21], spectrahedron [18, 12], and nuclear norm ball [20]. This efficiency makes the FW method particularly advantageous for large-scale problems. Numerous studies [19, 23, 11] have demonstrated that the convergence rate of the FW method is 𝒪(1k)\mathcal{O}(\frac{1}{k}) and that this rate is generally not improvable, except for some special cases, e.g., when the optimal solution lies in the interior of the constraint set [16]. In fact, there exist examples for which the convergence rate of the FW method does not improve even when the objective function is strongly convex, see [19, 23].

Therefore, modifications on the original FW method must be made in order to achieve linear convergence rate. Significant advances have been made alone this line of research and there exist a large number of variants of FW methods that enjoy linear convergence rate, see [4, Chapter 3]. The well-known ones include FW-method with away-step (AFW) and the pairwise FW (PFW) [22, 9, 13]. Most of those modified methods can be cast in the following framework:

{𝒚k=LMO(f(𝒙k1),𝒫k)argmin{f(𝒙k1),𝒚|𝒚𝒫k},𝒈k=direction-correction satisfying certain conditions,𝒙k=(1δk)𝒙k1+δk(𝒚k+𝒈k),δk(0,1],\left\{\begin{array}[]{l}{\boldsymbol{y}}_{k}=\mbox{LMO}(\nabla f({\boldsymbol{x}}_{k-1}),\mathcal{P}_{k})\in\operatorname*{\arg\min}\left\{\langle\nabla f({\boldsymbol{x}}_{k-1}),{\boldsymbol{y}}\rangle\ |\ {\boldsymbol{y}}\in\mathcal{P}_{k}\right\},\\[0.86108pt] {\boldsymbol{g}}_{k}=\mbox{direction-correction satisfying certain conditions},\\[0.86108pt] {\boldsymbol{x}}_{k}=(1-\delta_{k}){\boldsymbol{x}}_{k-1}+\delta_{k}({\boldsymbol{y}}_{k}+{\boldsymbol{g}}_{k}),\ \ \delta_{k}\in(0,1],\end{array}\right. (1.2)

where 𝒫k𝒫\mathcal{P}_{k}\subseteq\mathcal{P} is a well-constructed convex subset of 𝒫\mathcal{P} at the current iterate 𝒙k1{\boldsymbol{x}}_{k-1}. This framework has a flexibility for more technical strategies to be added. For example, one may mix 𝒚k{\boldsymbol{y}}_{k} and 𝒈k{\boldsymbol{g}}_{k} through certain combinations with some linesearch strategies. Both AFW and PFW make use of such flexibility. One major concern is that the computation of LMO over 𝒫k\mathcal{P}_{k} may be significantly higher than that over 𝒫\mathcal{P}. This is the case when 𝒫\mathcal{P} is Simplex-like polytopes including SnS_{n}.

In a significant development aiming to address this issue, Garber and Hazan [14] proposed the methodology of LLOO (Local Linear Optimization Oracle), where 𝒫k\mathcal{P}_{k} is the intersection of SnS_{n} and a 1\ell_{1}-ball:

𝒫k=SnB1(𝒙k1,dk),withB1(𝒙,d)={𝒚|𝒙𝒚1d}.\mathcal{P}_{k}=S_{n}\cap B_{1}({\boldsymbol{x}}_{k-1},d_{k}),\quad\mbox{with}\quad B_{1}({\boldsymbol{x}},d)=\left\{{\boldsymbol{y}}\ |\ \|{\boldsymbol{x}}-{\boldsymbol{y}}\|_{1}\leq d\right\}.

A key result is that LMO over this 𝒫k\mathcal{P}_{k} is LLOO, which is referred to as 1\ell_{1}-LMO. Hence, linear convergence follows when the radius is exponentially reduced at each iteration under the strongly convex setting. We note that the framework of LLOO can be cast as a special case of Shrinking Conditional Gradient Methods (sCndG) by Lan [23, Eq. 3.34] and [24, Alg. 7.2], where an arbitrary norm is used. The LLOO framework does not require the step of 𝒈k{\boldsymbol{g}}_{k} in (1.2). To understand its actual performance, Fig. 2(a) in Sect. 5 illustrates its computational time in comparison to the LMO over the unit simplex as well a projection algorithm.

It can be clearly observed that the time taken by 1\ell_{1}-LMO is roughly same as the projection method, but is significantly slower (e.g.,100×100\times slower when nn gets big) than LMO(𝒄,Sn)\mbox{LMO}({\boldsymbol{c}},S_{n}). There is a deep reason behind this performance and it can be best appreciated from the perspective of geometric intuition by considering the situation of n=3n=3. Fig. 1(a) illustrates the intersection of 1\ell_{1}-ball with the unit simplex. Note that for any point 𝒙S3{\boldsymbol{x}}\in S_{3}, the intersection of 1\ell_{1}-ball with the hyperplane containing S3S_{3} forms a regular hexagon. As the center 𝒙{\boldsymbol{x}} and radius dd vary, the shape corresponding to the intersection of this regular hexagon and unit simplex becomes more complex, as shown by the blue region in Fig. 1(a). This increased complexity of the constraint set makes solving 1\ell_{1}-LMO more challenging. From a computational point of view, 1\ell_{1}-LMO requires a sorting procedure [14] to handle the complexity and hence takes up too much time.

We also observe that LLOO/sCndG framework was largely omitted from the recent surveys [3, 4, 27] probably due to the following two reasons. One is on the concern of computational cost per iteration discussed above. The other is that there lacks flexibility of incorporating existing accelerating strategies such as Away-steps. In this paper, we propose a new framework of constructing the subset 𝒫k\mathcal{P}_{k} that is not based on any norms. In the meantime, the computational cost per iteration is reduced probably to minimum and there is flexibility to include various acceleration strategies. We explain our framework below.

Refer to caption
(a)
Refer to caption
(b)
Figure 1: Schematic diagram of the feasible sets for solving 1\ell_{1}-LMO and SLMO when n=3n=3.

1.2 Simplex LMO and Simplex FW: a New Proposal

Ideally, we would like our subset 𝒫k\mathcal{P}_{k} to be like the simplex SnS_{n} so that linear optimization over it can be fast executed. We here introduce the simplex ball S(𝒙,d)S({\boldsymbol{x}},d) with centroid 𝒙{\boldsymbol{x}} and radius d>0d>0 (detailed definition later). It coincides with the atom norm of the unit simplex as introduced by [5]. Moreover, the unit simplex SnS_{n} is a simplex ball. A very useful property is that the intersection of two simplex balls is again a simplex ball:

S(𝒙1,d1)S(𝒙2,d2)=S(𝒙3,d3),S({\boldsymbol{x}}_{1},d_{1})\cap S({\boldsymbol{x}}_{2},d_{2})=S({\boldsymbol{x}}_{3},d_{3}),

where 𝒙3{\boldsymbol{x}}_{3} and d3d_{3} can be cheaply computed from (𝒙i,di)({\boldsymbol{x}}_{i},d_{i}), i=1,2i=1,2. This property is illustrated in Fig. 1(b). Consequently, given 𝒙Sn{\boldsymbol{x}}\in S_{n}, a radius d>0d>0 and 𝒄n{\boldsymbol{c}}\in\mathbb{R}^{n}, we define the Simplex Linear Minimization Oracle SLMO(𝒙,d,𝒄)\mbox{SLMO}({\boldsymbol{x}},d,{\boldsymbol{c}}) by

SLMO(𝒙,d,𝒄)=argmin𝒚{𝒄,𝒚|𝒚SnS(𝒙,d)}.\mbox{SLMO}({\boldsymbol{x}},d,{\boldsymbol{c}})=\operatorname*{\arg\min}_{\boldsymbol{y}}\Big\{\langle{\boldsymbol{c}},{\boldsymbol{y}}\rangle\ |\ {\boldsymbol{y}}\in S_{n}\cap S({\boldsymbol{x}},d)\Big\}.

Since the constraint is again a simplex ball, SLMO has a closed-form formula (see Alg. 1) and its complexity is roughly same as LMO(𝒄,Sn)\mbox{LMO}({\boldsymbol{c}},S_{n}). Furthermore, we prove that SLMO is LLOO in Lemma 3.2. The consequence is that linearly convergent algorithm can be developed by following the template in [14]. This part forms the first contribution of the paper.

Casting SLMO as an instance of LLOO does not benefit too much in terms of computational efficiency because, as a common practice in FW methods, direction-correction step in (1.2) is essential in improving numerical performance. To accommodate this need, we make two additions. The first one is on a flexible rule to update the radius of the simplex ball. Any lower-bound for the objective function ff is permitted and the lower-bound by SLMO is a choice. We opt for the use of the best lower-bound available at the current iterate. This leads to the Simplex Frank-Wolfe (SFW) method in Alg. 2, which is proved to be linearly convergent in Thm. 3.4.

This second addition makes use of an important observation that SLMO can be split into two parts. The first part is the construction of the simplex ball and the second part is linear optimization over the simplex ball. Linear optimization is much cheaper than construction of the simplex ball. It would be much economical if we perform linear optimization a few more times for each simplex ball:

𝒑kargmin𝒚{f(𝒚)|𝒚SnS(𝒙k1,dk1)}.{\boldsymbol{p}}_{k}\approx\operatorname*{\arg\min}_{{\boldsymbol{y}}}\Big\{f({\boldsymbol{y}})\ |\ {\boldsymbol{y}}\in S_{n}\cap S({\boldsymbol{x}}_{k-1},d_{k-1})\Big\}.

This 𝒑k{\boldsymbol{p}}_{k} functions like the direction-correction used in the general framework (1.2). This allows us to take advantage of existing FW algorithms. For example, AFW and PFW can be used for this part. This leads to our refined SFW method (rSFW) (see Alg. 3 and Alg. 6). We emphasize that the oracle used in our rSFW method requires only one additional vector addition compared to the standard LMO, whose computational complexity is probably the cheapest among all existing FW-type algorithms. Our numerical experiments show that rSFW with Away step and Pairwise steps improves its performance significantly. The resulting algorithmic scheme is hence different from LLOO scheme and we provide complete convergence analysis. This part may be treated as our second contribution.

Our third contribution is on extending the simplex case to general polytope case. We will make use of some fundamental connections between them established in [14]. Since the simplex ball is not defined from any norm, some part of the extension is highly non-trivial. In particular, the iteration complexity of the extended SFW depends only on the problem dimension nn instead of the number of extreme points NN of 𝒫\mathcal{P}, see Thm. 4.4. Computationally, this can all be achieved for three popular polytopes: Hypercube, Flow polytope and 1\ell_{1}-ball.

Our final part is to address the implementation issues including adaptive backtracking techniques on choosing the problem prameters LL and μ\mu, and incorporating Away-FW and Pairwise-FW steps to SFW methods. Numerical experiments demonstrate that our SFW methods are highly competitive.

1.3 Organization

In the preceding discussion, we only focus on the framework (1.2) that may lead to linearly convergent FW methods. We avoid specifying the actual conditions on ff because various conditions can ensure such linear rate. In Section 2, we describe such conditions as well as some background on polytopes. We will explain the key concept of LLOO proposed in [14]. Section 3 contains the detailed development of SLMO and the resulting Simplex FW methods (SFW and rSFW) for the case 𝒫=Sn\mathcal{P}=S_{n}. The extension to the general polytope case is conducted in Section 4. Lengthy proofs are moved to the Supplement for the benefit of readability of the paper. Section 5 reports some illustrative examples to demonstrate the advantage of SFW methods over some existing algorithms. We conclude the paper in Section 6.

2 Notation and Background

2.1 Notation

We employ lower-case letters, bold lower-case letters, and capital letters to denote scalars, vectors, and matrices, respectively (e.g., x,𝒙x,{\boldsymbol{x}} and XX). For two column vectors 𝒙,𝒚n{\boldsymbol{x}},{\boldsymbol{y}}\in\mathbb{R}^{n}, max{𝒙,𝒚}\max\{{\boldsymbol{x}},{\boldsymbol{y}}\} is a new column vector that takes the component-wise maximum of 𝒙{\boldsymbol{x}} and 𝒚{\boldsymbol{y}}. The vector min{𝒙,𝒚}\min\{{\boldsymbol{x}},{\boldsymbol{y}}\} is similarly defined. For vectors, we denote the standard Euclidean norm by \|\cdot\| and the standard inner-product by ,\langle\cdot,\cdot\rangle. For a vector 𝒙n{\boldsymbol{x}}\in\mathbb{R}^{n}, a subset CnC\subseteq\mathbb{R}^{n}, and τ>0\tau>0, we define

𝒙+C:={𝒙+𝒚|𝒚C}andτC:={τ𝒚|𝒚C},{\boldsymbol{x}}+C:=\left\{{\boldsymbol{x}}+{\boldsymbol{y}}\ |\ {\boldsymbol{y}}\in C\right\}\qquad\mbox{and}\qquad\tau C:=\left\{\tau{\boldsymbol{y}}\ |\ {\boldsymbol{y}}\in C\right\},

where “:=:=” means “define”.

We let 𝔹(𝒙,r)\mathbb{B}({\boldsymbol{x}},r) to denote the Euclidean ball of radius rr centered at 𝒙{\boldsymbol{x}}. For matrices, we let \|\cdot\| denote the spectral norm. For a vector 𝒙n{\boldsymbol{x}}\in\mathbb{R}^{n}, we use xix_{i} or x(i)x(i) to denote the ii-th component. For a matrix AA, we use A(i)A(i) to denote the ii-th row of AA. The vector 𝟏n{\boldsymbol{1}}_{n} represents a vector with all entries equal to 1, and 𝒆i{\boldsymbol{e}}_{i} is the standard iith unit vector in n\mathbb{R}^{n} which takes value 11 at its iith position and 0 elsewhere. Given a set 𝒱\mathcal{V}, we denotes its convex hull as Conv{𝒱}\mbox{Conv}\{\mathcal{V}\}. For any positive integer nn, we use the notation [n][n] to represent the set {1,,n}\{1,\dots,n\}. We use Sn:={𝒙n|i=1nxi=1,𝒙0}S_{n}:=\{{\boldsymbol{x}}\in\mathbb{R}^{n}|\sum_{i=1}^{n}x_{i}=1,{\boldsymbol{x}}\geq 0\} to denote the unit simplex.

2.2 Smoothness, Strong Convexity and Stepsizes

Throughout the paper, we will assume LL-smoothness and μ\mu-strong convexity of ff.

Definition 1 (Smooth function).

We say that a function f(𝐱):nf({\boldsymbol{x}}):\mathbb{R}^{n}\to\mathbb{R} is LL-smooth over a convex set 𝒫n\mathcal{P}\subset\mathbb{R}^{n}, if for every 𝐱,𝐲𝒫{\boldsymbol{x}},{\boldsymbol{y}}\in\mathcal{P} there holds

f(𝒚)f(𝒙)+𝒚𝒙,f(𝒙)+L2𝒙𝒚2.f({\boldsymbol{y}})\leq f({\boldsymbol{x}})+\langle{\boldsymbol{y}}-{\boldsymbol{x}},\nabla f({\boldsymbol{x}})\rangle+\frac{L}{2}\|{\boldsymbol{x}}-{\boldsymbol{y}}\|^{2}.
Definition 2 (Strongly convex function).

We say that a function f(𝐱):nf({\boldsymbol{x}}):\mathbb{R}^{n}\to\mathbb{R} is μ\mu-strongly convex over a convex set 𝒫n\mathcal{P}\subset\mathbb{R}^{n}, if for every 𝐱,𝐲𝒫{\boldsymbol{x}},{\boldsymbol{y}}\in\mathcal{P} there holds

f(𝒚)f(𝒙)+𝒚𝒙,f(𝒙)+μ2𝒙𝒚2.f({\boldsymbol{y}})\geq f({\boldsymbol{x}})+\langle{\boldsymbol{y}}-{\boldsymbol{x}},\nabla f({\boldsymbol{x}})\rangle+\frac{\mu}{2}\|{\boldsymbol{x}}-{\boldsymbol{y}}\|^{2}.

The above definition combined with first order optimality conditions imply that for a μ\mu-strongly convex function ff, if 𝒙=argmin𝒙𝒫f(𝒙){\boldsymbol{x}}^{*}=\operatorname*{\arg\min}_{{\boldsymbol{x}}\in\mathcal{P}}f({\boldsymbol{x}}), then for any 𝒙𝒫{\boldsymbol{x}}\in\mathcal{P} we have

f(𝒙)fμ2𝒙𝒙2.f({\boldsymbol{x}})-f^{*}\geq\frac{\mu}{2}\|{\boldsymbol{x}}-{\boldsymbol{x}}^{*}\|^{2}. (2.1)

This property, while weaker than strong convexity, is essential for demonstrating linear convergence rather than relying solely on strong convexity. A natural generalization of this property is known as the quadratic growth property.

There are three popular step size strategies:

  1. 1.

    Simple step size:

    δk=2/(k+1),k=1,.\delta_{k}=2/(k+1),\quad k=1,\dots. (2.2)
  2. 2.

    Line-search step size:

    δk=argminδ[0,1]f((1δ)𝒙k1+δ𝒚k),k=1,.\delta_{k}=\operatorname*{arg\,min}_{\delta\in[0,1]}f((1-\delta){\boldsymbol{x}}_{k-1}+\delta{\boldsymbol{y}}_{k}),\quad k=1,\dots. (2.3)
  3. 3.

    Short step size:

    δk=min{1,f(𝒙k1),𝒙k1𝒚kL𝒙k1𝒚k2},k=1,.\delta_{k}=\min\left\{1,\frac{\langle\nabla f({\boldsymbol{x}}_{k-1}),{\boldsymbol{x}}_{k-1}-{\boldsymbol{y}}_{k}\rangle}{L\|{\boldsymbol{x}}_{k-1}-{\boldsymbol{y}}_{k}\|^{2}}\right\},\quad k=1,\dots. (2.4)

For the three step size strategies described above, it can be shown that the standard Frank-Wolfe method exhibits the following convergence rates. For a detailed proof, refer to the modern surveys by [19], [23] or [11].

Theorem 2.1.

Let {𝐱k}\{{\boldsymbol{x}}_{k}\} be the sequences generated by standard FW method with step size policy for {δk}\{\delta_{k}\} in (2.2), (2.3), or (2.4). Then, for k1k\geq 1, we have

f(𝒙k)ff(𝒙k),𝒙k𝒚k+12LD2/(k+1),f({\boldsymbol{x}}_{k})-f^{*}\leq\langle\nabla f({\boldsymbol{x}}_{k}),{\boldsymbol{x}}_{k}-{\boldsymbol{y}}_{k+1}\rangle\leq 2LD^{2}/(k+1), (2.5)

where 𝐲k+1=LMO(f(𝐱k),𝒫){\boldsymbol{y}}_{k+1}=\text{LMO}(\nabla f({\boldsymbol{x}}_{k}),\mathcal{P}) and DD is the diameter of 𝒫\mathcal{P}.

2.3 Quantities of Polytope

The quantities reviewed in this part are well defined and investigated in [14] and they are mainly used in the extension of SFW to polytopes.

Let 𝒫\mathcal{P} be a polytope described by linear equations and inequalities, specifically 𝒫={𝒙n|A1𝒙=𝒃1,A2𝒙𝒃2}\mathcal{P}=\{{\boldsymbol{x}}\in\mathbb{R}^{n}|A_{1}{\boldsymbol{x}}={\boldsymbol{b}}_{1},A_{2}{\boldsymbol{x}}\leq{\boldsymbol{b}}_{2}\}, where A2m×nA_{2}\in\mathbb{R}^{m\times n}. Without loss of generality, we assume that all rows of A2A_{2} have been scaled to possess a unit l2l_{2} norm. We denote the set of vertices of 𝒫\mathcal{P} as 𝒱(𝒫)\mathcal{V}(\mathcal{P}) and let N=|𝒱(𝒫)|N=|\mathcal{V}(\mathcal{P})| represent the number of vertices.

Next, we introduce several geometric parameters related to 𝒫\mathcal{P} that will naturally arise in our analysis. The Euclidean diameter of 𝒫\mathcal{P} is defined as D(𝒫)=max𝒙,𝒚𝒫𝒙𝒚D(\mathcal{P})=\max_{{\boldsymbol{x}},{\boldsymbol{y}}\in\mathcal{P}}\lVert{\boldsymbol{x}}-{\boldsymbol{y}}\rVert. We define

ξ(𝒫)=min𝒗𝒱(𝒫)(min{b2(j)A2(j)𝒗j[m],A2(j)𝒗<b2(j)}).\xi(\mathcal{P})=\min_{{\boldsymbol{v}}\in\mathcal{V}(\mathcal{P})}\left(\min\left\{{b}_{2}(j)-A_{2}(j){\boldsymbol{v}}\mid j\in[m],A_{2}(j){\boldsymbol{v}}<b_{2}(j)\right\}\right).

This means that for any inequality constraint defining the polytope and for a given vertex, that vertex either satisfies the constraint with equality or is at least ξ(𝒫)\xi(\mathcal{P}) units away from satisfying it with equality. Let r(A2)r(A_{2}) denote the row rank of A2A_{2}, and let 𝔸(𝒫)\mathbb{A}(\mathcal{P}) represent the set of all r(A2)×nr(A_{2})\times n matrices with linearly independent rows selected from the rows of A2A_{2}. We then define ψ(𝒫)=maxM𝔸(𝒫)M\psi(\mathcal{P})=\max_{M\in\mathbb{A}(\mathcal{P})}\|M\|. Finally, we introduce condition number of 𝒫\mathcal{P} as

η(𝒫)=ψ(𝒫)D(𝒫)/ξ(𝒫).\eta(\mathcal{P})=\psi(\mathcal{P})D(\mathcal{P})/\xi(\mathcal{P}). (2.6)

It is important to note that the translation, rotation and scaling of the polytope 𝒫\mathcal{P} are invariant to η(𝒫)\eta(\mathcal{P}). For convenience we use 𝒱,D,ξ,ψ,η\mathcal{V},D,\xi,\psi,\eta without explicitly mentioning the polytope when 𝒫\mathcal{P} is clear from context. It is worth noting that in many relevant scenarios—particularly in cases where efficient algorithms exist for linear optimization over the given polytope—estimating the parameters D,ξ,ψD,\xi,\psi is often straightforward. This is particularly true in convex domains encountered in combinatorial optimization, such as flow polytopes, matching polytopes, and matroid polytopes, among others. Furthermore, our algorithm relies primarily on the parameter η\eta and DD.

2.4 LLOO

A major concept proposed by Garber and Hazan [14, Def. 2.5] is LLOO. Consider the problem (1.1). We say a procedure 𝒜(𝒙,d,𝒄)\mathcal{A}({\boldsymbol{x}},d,{\boldsymbol{c}}), where 𝒙𝒫{\boldsymbol{x}}\in\mathcal{P}, d>0d>0, 𝒄n{\boldsymbol{c}}\in\mathbb{R}^{n}, is an LLOO with parameter ρ1\rho\geq 1 for polytope 𝒫\mathcal{P} if 𝒜(𝒙,d,𝒄)\mathcal{A}({\boldsymbol{x}},d,{\boldsymbol{c}}) returns a feasible point 𝒑𝒫{\boldsymbol{p}}\in\mathcal{P} such that

  • (i)

    𝒚,𝒄𝒑,𝒄\langle{\boldsymbol{y}},{\boldsymbol{c}}\rangle\geq\langle{\boldsymbol{p}},{\boldsymbol{c}}\rangle for all 𝒚𝔹(𝒙,d)𝒫{\boldsymbol{y}}\in\mathbb{B}({\boldsymbol{x}},d)\cap\mathcal{P}, and

  • (ii)

    𝒙𝒑ρd\|{\boldsymbol{x}}-{\boldsymbol{p}}\|\leq\rho d.

Suppose the optimal solution 𝒙{\boldsymbol{x}}^{*} of (1.1) is contained in 𝔹(𝒙,d)\mathbb{B}({\boldsymbol{x}},d) and LLOO 𝒜(𝒙,d,f(𝒙))\mathcal{A}({\boldsymbol{x}},d,\nabla f({\boldsymbol{x}})) return a feasible point 𝒑𝒫{\boldsymbol{p}}\in\mathcal{P}. The convexity of ff implies the following.

f(𝒙)\displaystyle f({\boldsymbol{x}}^{*}) f(𝒙)+f(𝒙),𝒙𝒙\displaystyle\geq f({\boldsymbol{x}})+\langle\nabla f({\boldsymbol{x}}),{\boldsymbol{x}}^{*}-{\boldsymbol{x}}\rangle
f(𝒙)+f(𝒙),𝒑𝒙(by 𝒙𝔹(𝒙,d) and Property LLOO(i))\displaystyle\geq f({\boldsymbol{x}})+\langle\nabla f({\boldsymbol{x}}),{\boldsymbol{p}}-{\boldsymbol{x}}\rangle\quad\mbox{(by ${\boldsymbol{x}}^{*}\in\mathbb{B}({\boldsymbol{x}},d)$ and Property LLOO(i))}

That is, LLOO naturally provides a lower bound for the optimal objective. Such lower bounds will be used in our updating scheme of the radius dd. We also note that LLOO 𝒜(𝒙,d,f(𝒙))\mathcal{A}({\boldsymbol{x}},d,\nabla f({\boldsymbol{x}})) often return an optimal solution over a subset 𝒫k\mathcal{P}_{k}, which should be constructed in (1.2) bearing in mind of its solution efficiency.

Given an LLOO procedure available, a general FW framework can be developed for it to enjoy a linear convergence rate over general polytope 𝒫\mathcal{P} provided ff being LL-smooth and μ\mu-strongly convex, see [14, Thm. 4.1]. It is proved that 1\ell_{1}-LMO is an LLOO over the simplex polytope SnS_{n}. The framework is then extended to general polytope. As we discussed in Introduction, 1\ell_{1}-LMO is much less efficient than the original LMO over the simplex polytope SnS_{n}. This is the one of the motivations for us to develop the simplex LMO below.

3 Simplex FW Method

This section is solely devoted to the case of simplex polytope: Problem (1.1) with 𝒫=Sn\mathcal{P}=S_{n}. We will then extend the obtained results to general polytopes in the next section. We start with the introduction of simplex ball.

3.1 Simplex Ball and Simplex-based Linear Minimization Oracle

In this subsection, we will formally define the concept of the simplex ball and present some of its useful properties. Following this, we will introduce the Simplex-based Linear Minimization Oracle (SLMO) and provide an efficient algorithm for solving it.

Definition 3 (Simplex ball).

Let S0:=Sn1n𝟏n.S_{0}:=S_{n}-\frac{1}{n}{\boldsymbol{1}}_{n}. For any 𝐱n{\boldsymbol{x}}\in\mathbb{R}^{n} and d>0d>0, we define S(𝐱,d)S({\boldsymbol{x}},d) as the simplex ball of radius dd centered at 𝐱{\boldsymbol{x}} by

S(𝒙,d):=𝒙+(nd)S0={(𝒙d𝟏n)+nd𝝀𝝀Sn}.S({\boldsymbol{x}},d):={\boldsymbol{x}}+(nd)S_{0}=\Big\{({\boldsymbol{x}}-d{\boldsymbol{1}}_{n})+nd\boldsymbol{\lambda}\mid\boldsymbol{\lambda}\in S_{n}\Big\}. (3.1)

The following properties of the simplex ball are crucial to our development. The proof is moved to the Supplement A.

Lemma 3.1.

Given 𝐱Sn{\boldsymbol{x}}\in S_{n} and d>0d>0, we have

  1. (1)

    The unit simplex is a simplex ball, i.e., Sn=S(1n𝟏n,1n)S_{n}=S(\frac{1}{n}{\boldsymbol{1}}_{n},\frac{1}{n}). Moreover, we have

    S(𝒙,d)={𝒙+d𝒓|𝒓Conv{n𝐞i𝟏n:i[n]}}.S({\boldsymbol{x}},d)=\Big\{{\boldsymbol{x}}+d{\boldsymbol{r}}|{\boldsymbol{r}}\in\rm{Conv}\{n{\boldsymbol{e}}_{i}-{\boldsymbol{1}}_{n}:i\in[n]\}\Big\}. (3.2)
  2. (2)

    The intersection of two simplex balls, if nonempty, is again a simplex ball. In particular,

    SnS(𝒙,d)=S(𝒙^,d^)where{d^=i=1nmin{d,xi}nx^i=max{xi,d}+d^d,i[n].S_{n}\cap S({\boldsymbol{x}},d)=S(\widehat{{\boldsymbol{x}}},\widehat{d})\ \ \mbox{where}\ \ \left\{\begin{array}[]{l}\widehat{d}=\frac{\sum_{i=1}^{n}\min\{d,\;x_{i}\}}{n}\\ \widehat{x}_{i}=\max\{x_{i},d\}+\widehat{d}-d,\quad i\in[n].\end{array}\right. (3.3)

    Moreover, for 𝒙1,𝒙2Sn{\boldsymbol{x}}_{1},{\boldsymbol{x}}_{2}\in S_{n} and radius d1,d2>0d_{1},d_{2}>0 such that S(𝒙1,d1)S(𝒙2,d2)S({\boldsymbol{x}}_{1},d_{1})\cap S({\boldsymbol{x}}_{2},d_{2})\neq\emptyset, it holds

    S(𝒙1,d1)S(𝒙2,d2)=S(𝒙3,d3),S({\boldsymbol{x}}_{1},d_{1})\cap S({\boldsymbol{x}}_{2},d_{2})=S({\boldsymbol{x}}_{3},d_{3}),

    where

    d3\displaystyle d_{3} =1+i=1nmin{d1x1(i),d2x2(i)}n,\displaystyle=\frac{1+\sum_{i=1}^{n}\min\{d_{1}-x_{1}(i),d_{2}-x_{2}(i)\}}{n}, (3.4)
    x3(i)\displaystyle x_{3}(i) =max{x1(i)d1,x2(i)d2}+d3,i[n]\displaystyle=\max\{x_{1}(i)-d_{1},x_{2}(i)-d_{2}\}+d_{3},\ \ i\in[n]

    Consequently, we have d3min{d1,d2}d_{3}\leq\min\{d_{1},d_{2}\}.

  3. (3)

    The linear optimization over a simplex ball has the following closed-form solution:

    𝒚:=𝒙+(nd)(𝒆i𝟏nn)argmin𝒚S(𝒙,d)𝒄,𝒚withi=argmini[n]ci.\displaystyle{\boldsymbol{y}}^{*}:={\boldsymbol{x}}+(nd)\Big({\boldsymbol{e}}_{i^{*}}-\frac{{\boldsymbol{1}}_{n}}{n}\Big)\in\operatorname*{\arg\min}_{{\boldsymbol{y}}\in S({\boldsymbol{x}},d)}\ \langle{\boldsymbol{c}},{\boldsymbol{y}}\rangle\ \ \mbox{with}\ \ i^{*}=\operatorname*{\arg\min}_{i\in[n]}c_{i}.
  4. (4)

    The diameter of the simplex ball S(𝒙,d)S({\boldsymbol{x}},d) is 2nd\sqrt{2}nd, i.e., max𝒚1,𝒚2S(𝒙,d)𝒚1𝒚2=2nd.\max_{{\boldsymbol{y}}_{1},{\boldsymbol{y}}_{2}\in S({\boldsymbol{x}},d)}\lVert{\boldsymbol{y}}_{1}-{\boldsymbol{y}}_{2}\rVert=\sqrt{2}nd.

  5. (5)

    For any point 𝒚Sn{\boldsymbol{y}}\in S_{n}, if 𝒙𝒚d\lVert{\boldsymbol{x}}-{\boldsymbol{y}}\rVert\leq d, then 𝒚S(𝒙,d){\boldsymbol{y}}\in S({\boldsymbol{x}},d). Moreover, for any point 𝒚S(𝒙,d){\boldsymbol{y}}\in S({\boldsymbol{x}},d), we have 𝒚𝒙nd\lVert{\boldsymbol{y}}-{\boldsymbol{x}}\rVert\leq nd.

We now give a formal definition of our LMO based on simplex ball.

Definition 4 (SLMO).

Given a linear objective 𝐜n{\boldsymbol{c}}\in\mathbb{R}^{n}, radius d>0d>0 and a point 𝐱Sn{\boldsymbol{x}}\in S_{n}, a solution 𝐲SLMO(𝐱,d,𝐜){\boldsymbol{y}}^{*}\in\rm{SLMO}({\boldsymbol{x}},d,{\boldsymbol{c}}) is called simplex-based linear minimization oracle if it solves the following optimization problem

min𝒚,𝒄\displaystyle\min\ \langle{\boldsymbol{y}},{\boldsymbol{c}}\rangle\quad s.t.𝐲S(𝐱,d)Sn.\displaystyle\rm{s.t.}\quad{\boldsymbol{y}}\in S({\boldsymbol{x}},d)\cap S_{n}. (3.5)

We have proved in Lemma 3.1(5) that 𝔹(𝒙,d)S(𝒙,d)𝔹(𝒙,nd)\mathbb{B}({\boldsymbol{x}},d)\subseteq S({\boldsymbol{x}},d)\subseteq\mathbb{B}({\boldsymbol{x}},nd). Therefore, for any 𝒚𝔹(𝒙,d)Sn{\boldsymbol{y}}\in\mathbb{B}({\boldsymbol{x}},d)\cap S_{n}, we must have 𝒚,𝒄𝒚,𝒄.\langle{\boldsymbol{y}},{\boldsymbol{c}}\rangle\geq\langle{\boldsymbol{y}}^{*},{\boldsymbol{c}}\rangle. This is the first property of LLOO. Moreover, since both 𝒙,𝒚S(𝒙,d){\boldsymbol{x}},{\boldsymbol{y}}^{*}\in S({\boldsymbol{x}},d), we must have 𝒙𝒚ρd\|{\boldsymbol{x}}-{\boldsymbol{y}}^{*}\|\leq\rho d with ρ=n\rho=n. This leads to the following key result.

Lemma 3.2.

Given 𝐱𝒫{\boldsymbol{x}}\in\mathcal{P}, d>0d>0 and 𝐜n{\boldsymbol{c}}\in\mathbb{R}^{n} such that S(𝐱,d)SnS({\boldsymbol{x}},d)\cap{S}_{n}\not=\emptyset, then SLMO(𝐱,d,𝐜)\rm{SLMO}({\boldsymbol{x}},d,{\boldsymbol{c}}) is an LLOO 𝒜(𝒙,d,𝒄)\mathcal{A}({\boldsymbol{x}},d,{\boldsymbol{c}}) with ρ=n\rho=n.

The implication of this result is far-reaching because the framework developed in [14] can be followed to get a linearly convergent algorithm with SLMO. An even more important result is that SLMO problem (3.5) can be solved by the following simple algorithm.

Algorithm 1 SLMO(𝒙,d,𝒄)\mbox{SLMO}({\boldsymbol{x}},d,{\boldsymbol{c}})
0:  point 𝒙Sn{\boldsymbol{x}}\in S_{n}, linear objective 𝒄n{\boldsymbol{c}}\in\mathbb{R}^{n}, radius d>0d>0.
1:  d^i=1nmin{d,xi}n\widehat{d}\leftarrow\frac{\sum_{i=1}^{n}\min\{d,x_{i}\}}{n}
2:  𝒙^𝒙min{𝒙,d𝟏n}+d^𝟏n\widehat{{\boldsymbol{x}}}\leftarrow{\boldsymbol{x}}-\min\{{\boldsymbol{x}},d{\boldsymbol{1}}_{n}\}+\widehat{d}{\boldsymbol{1}}_{n}
3:  𝒚+𝒙^d^𝟏n{\boldsymbol{y}}_{+}\leftarrow\widehat{{\boldsymbol{x}}}-\widehat{d}{\boldsymbol{1}}_{n}
4:  iargmini[n]cii^{*}\leftarrow\operatorname*{\arg\min}_{i\in[n]}c_{i}
5:  𝒚𝒚++nd^𝒆i{\boldsymbol{y}}^{*}\leftarrow{\boldsymbol{y}}_{+}+n\widehat{d}\;{\boldsymbol{e}}_{i^{*}}
5:  𝒚{\boldsymbol{y}}^{*}.

The algorithm follows these basic steps. Firstly, it represents the constraint set as a single simplex ball: S(𝒙^,d^).S(\widehat{{\boldsymbol{x}}},\widehat{d}). Secondly, it uses the existing theoretical results of linear programming over the simplex ball to find the optimal solution.

Lemma 3.3.

Alg. 1 finds an optimal solution to Problem (3.5).

Proof.

First, by Lemma 3.1(2), we have S(𝒙,d)Sn=S(𝒙^,d^)S({\boldsymbol{x}},d)\cap S_{n}=S(\widehat{{\boldsymbol{x}}},\widehat{d}), where the definitions of 𝒙^\widehat{{\boldsymbol{x}}} and d^\widehat{d} are given in (3.3). Consequently, Problem (3.5) is equivalent to min𝒚S(𝒙^,d^)𝒚,𝒄\min_{{\boldsymbol{y}}\in S(\widehat{{\boldsymbol{x}}},\widehat{d})}\langle{\boldsymbol{y}},{\boldsymbol{c}}\rangle. Note that this is the same form as the problem in Lemma 3.1(3). Thus, we have

𝒚=𝒙^+d^(n𝒆i𝟏n)=max{𝒙,d𝟏n}d𝟏n+nd^𝒆i.{\boldsymbol{y}}^{*}=\widehat{{\boldsymbol{x}}}+\widehat{d}(n{\boldsymbol{e}}_{i^{*}}-{\boldsymbol{1}}_{n})=\max\{{\boldsymbol{x}},d{\boldsymbol{1}}_{n}\}-d{\boldsymbol{1}}_{n}+n\widehat{d}\;{\boldsymbol{e}}_{i^{*}}.

Consequently, Alg. 1 solves Problem (3.5). ∎

Remark 1.

(Comparison with LMO(𝒄,Sn)\mbox{LMO}({\boldsymbol{c}},S_{n}) and 1\ell_{1}-LMO(𝒙,d,𝒄)\mbox{LMO}({\boldsymbol{x}},d,{\boldsymbol{c}})) If we treat the element-wise minimum between two vectors as a basic operation, then SLMO requires only one extra basic operation, one more vector summation, and one more vector addition compared to the the original LMO over the simplex SnS_{n}. Therefore, its total exact complexity is 4n4n flops, making it nearly as efficient as LMO(𝒄,Sn)\mbox{LMO}({\boldsymbol{c}},S_{n}). However, 1\ell_{1}-LMO(𝒙,d,𝒄)\mbox{LMO}({\boldsymbol{x}},d,{\boldsymbol{c}}) involves a sorting operation, whose overall complexity is usually O(nlog(n)O(n\log(n)), It also involves a few more vector additions. As will be illustrated in Fig. 2(c), it is far less efficient than the original LMO(𝒄,Sn)\mbox{LMO}({\boldsymbol{c}},S_{n}) and SLMO. Therefore, we expect that a linearly convergent algorithm with SLMO should be efficient as well. We develop it below.

3.2 SFW: Simplex Frank-Wolfe Method

In this subsection, we present a new variant of Frank-Wolfe method called Simplex Frank-Wolfe (abbreviated as SFW), obtained by replacing the LMO with SLMO. The algorithm is formally described as follows.

Algorithm 2 Simplex Frank-Wolfe Method: SFW
0:  𝒙0Sn{\boldsymbol{x}}_{0}\in S_{n}, initial lower bound B0B_{0}.
1:  Set: d02(f(𝒙0)B0)μ.d_{0}\leftarrow\sqrt{\frac{2(f({\boldsymbol{x}}_{0})-B_{0})}{\mu}}.
2:  for k=1,k=1,\dots do
3:   Compute 𝒚kSLMO(𝒙k1,dk1,f(𝒙k1)){\boldsymbol{y}}_{k}\in\mbox{SLMO}({{\boldsymbol{x}}}_{k-1},{d}_{k-1},\nabla f({\boldsymbol{x}}_{k-1})).
4:   Compute the working lower bound: Bkwf(𝒙k1)+f(𝒙k1),𝒚k𝒙k1B_{k}^{w}\leftarrow f({\boldsymbol{x}}_{k-1})+\langle\nabla f({\boldsymbol{x}}_{k-1}),{\boldsymbol{y}}_{k}-{\boldsymbol{x}}_{k-1}\rangle.
5:   Update best bound Bkmax{Bk1,Bkw}B_{k}\leftarrow\max\{B_{k-1},B_{k}^{w}\}.
6:   Set 𝒙k(1δk)𝒙k1+δk𝒚k{\boldsymbol{x}}_{k}\leftarrow(1-\delta_{k}){\boldsymbol{x}}_{k-1}+\delta_{k}{\boldsymbol{y}}_{k} for some δk[0,1]\delta_{k}\in[0,1].
7:   Set: dk2(f(𝒙k)Bk)μd_{k}\leftarrow\sqrt{\frac{2(f({\boldsymbol{x}}_{k})-B_{k})}{\mu}}.
8:  end for

Before stating its convergence rate result, we make the following remarks regarding Alg. 2.

Remark 2.

(Choice of dkd_{k} update strategy) We could follow the linear shrinking rule of dkd_{k} in [14, 24]: dk=γdk1d_{k}=\gamma d_{k-1} with properly chosen γ<1\gamma<1. The linear convergent rate would be guaranteed by invoking the LLOO property of SLMO  (Lemma 3.2). We do not take this route for convergence analysis because of the following two reasons. One is that the key property 𝔹(𝒙,d)S(𝒙,d)𝔹(𝒙,nd)\mathbb{B}({\boldsymbol{x}},d)\subseteq S({\boldsymbol{x}},d)\subseteq\mathbb{B}({\boldsymbol{x}},nd) ensuring LLOO will have to be modified when it comes to the general polytope as our simplex ball is not based on any norm. The corresponding relationship becomes 𝔹(𝒙,dD/η)S𝒫(𝒙,d)𝔹(𝒙,(n+1)dD)\mathbb{B}({\boldsymbol{x}},dD/\eta)\subseteq S_{\mathcal{P}}({\boldsymbol{x}},d)\subseteq\mathbb{B}({\boldsymbol{x}},(n+1)dD), see Lemma 4.3, where S𝒫S_{\mathcal{P}} is defined. At least at a technical level, the original LLOO will have to be generalized to suit this extension and the corresponding proofs have also to be reproduced. The proof we provided below is more direct. The other reason is that we use the best lower bound BkB_{k} provided by the algorithm to define dkd_{k}. This choice is important because we are going to incorporate other accelerating strategies to SFW resulting in rSFW. The stopping criterion used there will be also based on the best lower bounds obtained. Therefore, the convergence analysis for SFW will naturally be adapted to rSFW.

At the kk-th iteration, Alg. 2 first invokes SLMO to find the minimum point 𝒚k{\boldsymbol{y}}_{k} of the first-order approximate expansion of the objective function within the region S(𝒙k1,dk1)SnS({\boldsymbol{x}}_{k-1},d_{k-1})\cap S_{n}. Subsequently, using a suitable step size δk\delta_{k}, a convex combination of 𝒚k{\boldsymbol{y}}_{k} and 𝒙k1{\boldsymbol{x}}_{k-1} is computed to update the iteration point to 𝒙k+1{\boldsymbol{x}}_{k+1}. Finally, the algorithm updates the radius dd, ensuring that the optimal solution 𝒙{\boldsymbol{x}}^{*} progressively falls within a smaller neighborhood S(𝒙k,dk)S({\boldsymbol{x}}_{k},d_{k}). For Alg. 2, we propose the following simple step size, as an alternative to the simple step size selection in the original Frank-Wolfe algorithm:

δk=μ2Ln2.\delta_{k}=\frac{\mu}{2Ln^{2}}. (3.6)

We have the following linear convergence rate result. The induction technique in the proof below was taken from [14, Lemma 4.3].

Theorem 3.4.

Let {𝐱k}\{{\boldsymbol{x}}_{k}\} be the sequences generated by Alg. 2 with step size policy for {δk}\{\delta_{k}\} in (2.3), (2.4), or (3.6). Then, for k0k\geq 0, we have

f(𝒙k)ff(𝒙k)Bkμd022eμ4Ln2k.f({\boldsymbol{x}}_{k})-f^{*}\leq f({\boldsymbol{x}}_{k})-B_{k}\leq\frac{\mu d_{0}^{2}}{2}e^{-\frac{\mu}{4Ln^{2}}k}. (3.7)
Proof.

We first claim that 𝒙S(𝒙k,dk){\boldsymbol{x}}^{*}\in S({\boldsymbol{x}}_{k},d_{k}) and that f(𝒙k)Bkμdk22f({\boldsymbol{x}}_{k})-B_{k}\leq\frac{\mu d_{k}^{2}}{2}. We prove this by induction. First, we have

μd022=f(𝒙0)B0f(𝒙0)f(a)μ2𝒙0𝒙2,\frac{\mu d_{0}^{2}}{2}=f({\boldsymbol{x}}_{0})-B_{0}\geq f({\boldsymbol{x}}_{0})-f^{*}\stackrel{{\scriptstyle(a)}}{{\geq}}\frac{\mu}{2}\lVert{\boldsymbol{x}}_{0}-{\boldsymbol{x}}^{*}\rVert^{2},

where (a)(a) comes from (2.1). This implies that 𝒙0𝒙d0\lVert{\boldsymbol{x}}_{0}-{\boldsymbol{x}}^{*}\rVert\leq d_{0}, and by Lemma 3.1(5), we have 𝒙S(𝒙0,d0){\boldsymbol{x}}^{*}\in S({\boldsymbol{x}}_{0},d_{0}). Therefore, the claim holds for k=0k=0.

Now suppose that 𝒙S(𝒙t,dt){\boldsymbol{x}}^{*}\in S({\boldsymbol{x}}_{t},d_{t}) and f(𝒙t)Btμdt22f({\boldsymbol{x}}_{t})-B_{t}\leq\frac{\mu d_{t}^{2}}{2} for all tk1t\leq k-1. Let γ:=μ2Ln21\gamma:=\frac{\mu}{2Ln^{2}}\leq 1. For step size policy δk\delta_{k} in (2.3) (exact line search stepsize) or (3.6), we both have

f(𝒙k)\displaystyle f({\boldsymbol{x}}_{k}) =\displaystyle= f(𝒙k1+δk(𝒚k𝒙k1))f(𝒙k1+γ(𝒚k𝒙k1))\displaystyle f({\boldsymbol{x}}_{k-1}+\delta_{k}({\boldsymbol{y}}_{k}-{\boldsymbol{x}}_{k-1}))\leq f({\boldsymbol{x}}_{k-1}+\gamma({\boldsymbol{y}}_{k}-{\boldsymbol{x}}_{k-1})) (3.8)
\displaystyle\leq f(𝒙k1)+γf(𝒙k1),𝒚k𝒙k1+Lγ22𝒚k𝒙k12.\displaystyle f({\boldsymbol{x}}_{k-1})+\gamma\langle\nabla f({\boldsymbol{x}}_{k-1}),{\boldsymbol{y}}_{k}-{\boldsymbol{x}}_{k-1}\rangle+\frac{L\gamma^{2}}{2}\lVert{\boldsymbol{y}}_{k}-{\boldsymbol{x}}_{k-1}\rVert^{2}.

Similarly, for the step size policy (2.4) (short stepsize), we have

f(𝒙k)\displaystyle f({\boldsymbol{x}}_{k}) \displaystyle\leq f(𝒙k1)+δkf(𝒙k1),𝒚k𝒙k1+Lδk22𝒚k𝒙k12\displaystyle f({\boldsymbol{x}}_{k-1})+\delta_{k}\langle\nabla f({\boldsymbol{x}}_{k-1}),{\boldsymbol{y}}_{k}-{\boldsymbol{x}}_{k-1}\rangle+\frac{L\delta_{k}^{2}}{2}\lVert{\boldsymbol{y}}_{k}-{\boldsymbol{x}}_{k-1}\rVert^{2} (3.9)
\displaystyle\leq f(𝒙k1)+γf(𝒙k1),𝒚k𝒙k1+Lγ22𝒚k𝒙k12.\displaystyle f({\boldsymbol{x}}_{k-1})+\gamma\langle\nabla f({\boldsymbol{x}}_{k-1}),{\boldsymbol{y}}_{k}-{\boldsymbol{x}}_{k-1}\rangle+\frac{L\gamma^{2}}{2}\lVert{\boldsymbol{y}}_{k}-{\boldsymbol{x}}_{k-1}\rVert^{2}.

Combining (3.8) and (3.9), we have

f(𝒙k)\displaystyle f({\boldsymbol{x}}_{k})\leq f(𝒙k1)+γf(𝒙k1),𝒚k𝒙k1+Lγ22𝒚k𝒙k12\displaystyle f({\boldsymbol{x}}_{k-1})+\gamma\langle\nabla f({\boldsymbol{x}}_{k-1}),{\boldsymbol{y}}_{k}-{\boldsymbol{x}}_{k-1}\rangle+\frac{L\gamma^{2}}{2}\lVert{\boldsymbol{y}}_{k}-{\boldsymbol{x}}_{k-1}\rVert^{2}
(b)\displaystyle\stackrel{{\scriptstyle(b)}}{{\leq}} (1γ)f(𝒙k1)+γBkw+Lγ22𝒚k𝒙k12\displaystyle(1-\gamma)f({\boldsymbol{x}}_{k-1})+\gamma B_{k}^{w}+\frac{L\gamma^{2}}{2}\lVert{\boldsymbol{y}}_{k}-{\boldsymbol{x}}_{k-1}\rVert^{2}
(c)\displaystyle\stackrel{{\scriptstyle(c)}}{{\leq}} (1γ)(f(𝒙k1)Bk1)+Bk+Lγ22𝒚k𝒙k12\displaystyle(1-\gamma)(f({\boldsymbol{x}}_{k-1})-B_{k-1})+B_{k}+\frac{L\gamma^{2}}{2}\lVert{\boldsymbol{y}}_{k}-{\boldsymbol{x}}_{k-1}\rVert^{2}

holds for step size policy (2.3), (2.4), or (3.6), where (b)(b) comes from the definition of BkwB_{k}^{w}, and (c)(c) is due to Bkmax{Bk1,Bkw}B_{k}\geq\max\{B_{k-1},B_{k}^{w}\}. Subtracting BkB_{k} from the both sides of the above inequality, we obtain

f(𝒙k)Bk\displaystyle f({\boldsymbol{x}}_{k})-B_{k}\leq (1γ)(f(𝒙k1)Bk1)+Lγ22𝒚k𝒙k12\displaystyle(1-\gamma)(f({\boldsymbol{x}}_{k-1})-B_{k-1})+\frac{L\gamma^{2}}{2}\lVert{\boldsymbol{y}}_{k}-{\boldsymbol{x}}_{k-1}\rVert^{2}
(d)\displaystyle\stackrel{{\scriptstyle(d)}}{{\leq}} (1γ)μ2dk12+Lγ22n2dk12=[(1γ)μ2+Lγ2n22]dk12,\displaystyle(1-\gamma)\frac{\mu}{2}d_{k-1}^{2}+\frac{L\gamma^{2}}{2}n^{2}d_{k-1}^{2}=\left[(1-\gamma)\frac{\mu}{2}+\frac{L\gamma^{2}n^{2}}{2}\right]d_{k-1}^{2},

where (d)(d) is due to our inductive hypothesis and Lemma 3.1(5). By plugging in the value of γ\gamma, and using 1xex1-x\leq e^{-x}, we have that

f(𝒙k)Bkμ2(1μ4Ln2)dk12μ2eμ4Ln2dk12.f({\boldsymbol{x}}_{k})-B_{k}\leq\frac{\mu}{2}\left(1-\frac{\mu}{4Ln^{2}}\right)d_{k-1}^{2}\leq\frac{\mu}{2}e^{-\frac{\mu}{4Ln^{2}}}d_{k-1}^{2}. (3.10)

With the definition of dkd_{k}, we have f(𝒙k)Bk=μ2dk2f({\boldsymbol{x}}_{k})-B_{k}=\frac{\mu}{2}d_{k}^{2}. The bound in (3.10) implies

dkeμ8Ln2dk1.d_{k}\leq e^{-\frac{\mu}{8Ln^{2}}}{d}_{k-1}. (3.11)

By the inductive hypothesis, we know that 𝒙S(𝒙t,dt){\boldsymbol{x}}^{*}\in S({\boldsymbol{x}}_{t},d_{t}) holds for all tk1t\leq k-1. Thus Bt+1wB_{t+1}^{w} is a valid lower bound of ff^{*}, and consequently, BkB_{k} is also a lower bound of ff^{*}. Now by (2.1), we have

𝒙k𝒙22μ(f(𝒙k)f)2μ(f(𝒙k)Bk)=dk2.\lVert{\boldsymbol{x}}_{k}-{\boldsymbol{x}}^{*}\rVert^{2}\leq\frac{2}{\mu}(f({\boldsymbol{x}}_{k})-f^{*})\leq\frac{2}{\mu}(f({\boldsymbol{x}}_{k})-B_{k})=d_{k}^{2}.

This implies that 𝒙S(𝒙k,dk){\boldsymbol{x}}^{*}\in S({\boldsymbol{x}}_{k},d_{k}) by Lemma 3.1(5). Therefore, we have completed the proof of the claim.

We now start to prove the conclusion in Theorem 3.4. From the earlier proof, we know that BkfB_{k}\leq f^{*}, thus confirming the first part of the inequality. By the definition of dkd_{k} and the established reduction inequality (3.11), we have

f(𝒙k)Bk=μdk22μd022eμ4Ln2k.f({\boldsymbol{x}}_{k})-B_{k}=\frac{\mu d_{k}^{2}}{2}\leq\frac{\mu d_{0}^{2}}{2}e^{-\frac{\mu}{4Ln^{2}}k}.

The proof is thus completed. ∎

Remark 3.

(Iteration complexity of SFW) If we skip Lines 4–5 and replace Line 7 in Alg. 2 with 2f(𝒙k1),𝒙k1𝒚kμ\sqrt{\frac{2\langle\nabla f({\boldsymbol{x}}_{k-1}),{\boldsymbol{x}}_{k-1}-{\boldsymbol{y}}_{k}\rangle}{\mu}}, the algorithm remains correct and preserves its convergence guarantee. This modification avoids computing the objective value f(𝒙k)f({\boldsymbol{x}}_{k}), making the algorithm more practical and simple when the objective is expensive or difficult to evaluate. In this case, assuming that there is an oracle to obtain the gradient information f(𝒙)\nabla f({\boldsymbol{x}}) at each iteration, we are able to give the exact number of flops operations to compute the next iterate. The SLMO part to get 𝒚k{\boldsymbol{y}}_{k} is 4n4n flops, and computing the direction dkd_{k} requires an additional 3n3n flops. For the short step size strategy, evaluating the step size δk\delta_{k} incurs 2n2n flops, and updating the iterate 𝒙k{\boldsymbol{x}}_{k} takes another 3n3n flops. Thus, SFW needs 10n10n flops with a simple step size, or 12n12n flops with the short step size—–yet still achieves linear convergence for the simplex. To our knowledge, this is the lowest per-iteration cost among FW variants with linear convergence.

3.3 Refining SFW

In this subsection, we aim to further reduce the computational overhead of the proposed oracle as much as possible, while retaining the linear convergence rate. The motivation stems from an important observation. SLMO(𝒙,d,𝒄)\mbox{SLMO}({\boldsymbol{x}},d,{\boldsymbol{c}}) can be split into two parts. The first part is to construct a new Simplex-ball S(𝒙^,d^)=S(𝒙,d)SnS(\widehat{{\boldsymbol{x}}},\widehat{d})=S({\boldsymbol{x}},d)\cap S_{n}. For easy reference, we call it SLMO-1, which corresponds to Lines 1 in Alg. 1. The second part, which finds an optimal solution over S(𝒙^,d^)S(\widehat{{\boldsymbol{x}}},\widehat{d}), is referred to as SLMO-2 and corresponds to Lines 4-5 in Alg. 1. It is easy to see that the computation of SLMO-2 requires only one more vector addition compared to the standard LMO over SnS_{n} and hence its computation is already kept minimum. The extra computation of SLMO is from SLMO-1. Since the new Simplex ball is already constructed, we like to carry out a few more times of the SLMO-2 part. This is roughly to find an approximate solution 𝒑{\boldsymbol{p}} to the problem

minf(𝒚)s.t.𝒚S(𝒙^,d^)\min\;f({\boldsymbol{y}})\quad\mbox{s.t.}\quad{\boldsymbol{y}}\in S(\widehat{{\boldsymbol{x}}},\widehat{d})

with starting point 𝒙{\boldsymbol{x}}. Our control of this refinement step is for the overall computation to remain O(n)O(n) and the overall convergence rate to remain linear. The overall algorithm is given in Alg. 3 and it is called Refined SFW.

Algorithm 3 rSFW: Refined Simplex Frank-Wolfe Method
0:  Radius contraction ratio ρ>1\rho>1, initial lower bound B0B_{0}.
1:  Set: d01n,𝒙0𝟏nn,J8ρ2n2Lμ,𝒙¯0𝒙0,d¯0d0d_{0}\leftarrow\frac{1}{n},{\boldsymbol{x}}_{0}\leftarrow\frac{{\boldsymbol{1}}_{n}}{n},J\leftarrow\frac{8\rho^{2}n^{2}L}{\mu},\bar{{\boldsymbol{x}}}_{0}\leftarrow{\boldsymbol{x}}_{0},\bar{d}_{0}\leftarrow d_{0}.
2:  for k=1,k=1,\dots do
3:   Set: 𝒑0𝒙k1,C0Bk1{\boldsymbol{p}}_{0}\leftarrow{\boldsymbol{x}}_{k-1},C_{0}\leftarrow B_{k-1}.
4:   (SLMO-1) construct the new Simplex ball: S(𝒙^k1,d^k1)=S(𝒙¯k1,d¯k1)SnS(\widehat{{\boldsymbol{x}}}_{k-1},\widehat{d}_{k-1})=S({\bar{{\boldsymbol{x}}}}_{k-1},\bar{d}_{k-1})\cap S_{n}.
5:   for j=1,,Jj=1,\dots,J do
6:    (SLMO-2): Compute 𝒚j=argmin𝒚S(𝒙^k1,d^k1)f(𝒑j1),𝒚{\boldsymbol{y}}_{j}={\operatorname*{\arg\min}}_{{\boldsymbol{y}}\in S(\widehat{{\boldsymbol{x}}}_{k-1},\widehat{d}_{k-1})}\langle\nabla f({\boldsymbol{p}}_{j-1}),{\boldsymbol{y}}\rangle.
7:    Compute the current lower bound: Cjwf(𝒑j1)+f(𝒑j1),𝒚j𝒑j1C_{j}^{w}\leftarrow f({\boldsymbol{p}}_{j-1})+\langle\nabla f({\boldsymbol{p}}_{j-1}),{\boldsymbol{y}}_{j}-{\boldsymbol{p}}_{j-1}\rangle.
8:    Update the best lower bound Cjmax{Cj1,Cjw}C_{j}\leftarrow\max\{C_{j-1},C_{j}^{w}\}.
9:    if f(𝒑j)Cjμ2ρ2d^k12f({\boldsymbol{p}}_{j})-C_{j}\leq\frac{\mu}{2\rho^{2}}\widehat{d}_{k-1}^{2} then
10:     Break the inner loop.
11:    end if
12:    Set 𝒑j(1δj)𝒑j1+δj𝒚j{\boldsymbol{p}}_{j}\leftarrow(1-\delta_{j}){\boldsymbol{p}}_{j-1}+\delta_{j}{\boldsymbol{y}}_{j} for some δj[0,1]\delta_{j}\in[0,1].
13:   end for
14:   Set: 𝒙k𝒑j,dkd^k1ρ,BkCj{\boldsymbol{x}}_{k}\leftarrow{\boldsymbol{p}}_{j},d_{k}\leftarrow\frac{\widehat{d}_{k-1}}{\rho},B_{k}\leftarrow C_{j}.
15:   Find (𝒙¯k,d¯k)(\bar{{\boldsymbol{x}}}_{k},\bar{d}_{k}) such that S(𝒙¯k,d¯k)=S(𝒙k,dk)S(𝒙^k1,d^k1)S(\bar{{\boldsymbol{x}}}_{k},\bar{d}_{k})=S({\boldsymbol{x}}_{k},d_{k})\cap S(\widehat{{\boldsymbol{x}}}_{k-1},\widehat{d}_{k-1}) by using (3.4).
16:  end for

Alg. 3 keeps three sequences {(𝒙k,dk)}\{({\boldsymbol{x}}_{k},d_{k})\}, {(𝒙¯k,d¯k)}\{(\bar{{\boldsymbol{x}}}_{k},\bar{d}_{k})\}, and {(𝒙^k,d^k)}\{(\widehat{{\boldsymbol{x}}}_{k},\widehat{d}_{k})\}, each associated with a simplex ball, namely S(𝒙k,dk)S({\boldsymbol{x}}_{k},d_{k}), S(𝒙¯k,d¯k)S(\bar{{\boldsymbol{x}}}_{k},\bar{d}_{k}), and S(𝒙^k,d^k)S(\widehat{{\boldsymbol{x}}}_{k},\widehat{d}_{k}). Starting with (𝒙0,d0)=(𝒙¯0,d¯0)({\boldsymbol{x}}_{0},d_{0})=(\bar{{\boldsymbol{x}}}_{0},\bar{d}_{0}), we define (𝒙^0,d^0)(\widehat{{\boldsymbol{x}}}_{0},\widehat{d}_{0}) such that S(𝒙^0,d^0)=S(𝒙¯0,d¯0)SnS(\widehat{{\boldsymbol{x}}}_{0},\widehat{d}_{0})=S(\bar{{\boldsymbol{x}}}_{0},\bar{d}_{0})\cap S_{n}. At the kkth iteration, we first compute

𝒙kargmin{f(𝒚)|𝒚S(𝒙^k1,d^k1)}anddk=d^k1/ρ.{\boldsymbol{x}}_{k}\approx\operatorname*{\arg\min}\left\{f({\boldsymbol{y}})\ |\ {\boldsymbol{y}}\in S(\widehat{{\boldsymbol{x}}}_{k-1},\widehat{d}_{k-1})\right\}\quad\mbox{and}\quad d_{k}=\widehat{d}_{k-1}/\rho.

We then define (𝒙¯k,d¯k)(\bar{{\boldsymbol{x}}}_{k},\bar{d}_{k}) by its simplex ball, which satisfies S(𝒙¯k,d¯k)=S(𝒙k,dk)S(𝒙^k1,d^k1)S(\bar{{\boldsymbol{x}}}_{k},\bar{d}_{k})=S({\boldsymbol{x}}_{k},d_{k})\cap S(\widehat{{\boldsymbol{x}}}_{k-1},\widehat{d}_{k-1}). We further define the iterate (𝒙^k,d^k)(\widehat{{\boldsymbol{x}}}_{k},\widehat{d}_{k}) by its simplex ball satisfying S(𝒙^k,d^k)=S(𝒙¯k,d¯k)SnS(\widehat{{\boldsymbol{x}}}_{k},\widehat{d}_{k})=S(\bar{{\boldsymbol{x}}}_{k},\bar{d}_{k})\cap S_{n}. They can all be efficiently computed via the formula (3.4).

A great advantage of Alg. 3 is its computation of 𝒙k{\boldsymbol{x}}_{k}. The oracle we call is SLMO-2, which ensures that the iteration complexity remains the same as the original FW algorithm. It is also important to highlight that the inner loop of our algorithm (Lines 5-13) follows the standard FW algorithm. Its primary goal is to find a solution 𝒑j{\boldsymbol{p}}_{j} that satisfies f(𝒑j)Cjμ2ρ2d^k12f({\boldsymbol{p}}_{j})-C_{j}\leq\frac{\mu}{2\rho^{2}}\widehat{d}_{k-1}^{2}. Therefore, various speedup techniques for the classical FW algorithm can be directly applied to this inner loop without interfering with the outer loop of the algorithm. These include approaches such as the ‘away-step’ and ‘pairwise’ variants of FW proposed by [22], ‘fully-corrective’ variant of FW proposed by [19], as well as the warm start technique suggested by [11].

The following theorem summarizes the convergence result for this algorithm.

Theorem 3.5.

Let {𝐱k}\{{\boldsymbol{x}}_{k}\} be the sequences generated by Alg. 3 with step size policy for {δj}\{\delta_{j}\} in (2.2)-(2.4). Then, for k1k\geq 1, we have

f(𝒙k)ff(𝒙k)Bkμ2n2ρ2k.f({\boldsymbol{x}}_{k})-f^{*}\leq f({\boldsymbol{x}}_{k})-B_{k}\leq\frac{\mu}{2n^{2}\rho^{2k}}.
Proof.

We first claim that 𝒙S(𝒙^k,d^k){\boldsymbol{x}}^{*}\in S(\widehat{{\boldsymbol{x}}}_{k},\widehat{d}_{k}) for any k0k\geq 0 and we prove this by induction. For k=0k=0, we have (𝒙¯0,d¯0)=(𝒙0,d0)=(𝟏n/n,1/n)(\bar{{\boldsymbol{x}}}_{0},\bar{d}_{0})=({\boldsymbol{x}}_{0},d_{0})=({\boldsymbol{1}}_{n}/n,1/n). By the definition of S(𝒙^0,d^0)S(\widehat{{\boldsymbol{x}}}_{0},\widehat{d}_{0}), we have

S(𝒙^0,d^0)=S(𝒙¯0,d¯0)Sn=S(𝟏n/n,1/n)Sn=Sn.S(\widehat{{\boldsymbol{x}}}_{0},\widehat{d}_{0})=S(\bar{{\boldsymbol{x}}}_{0},\bar{d}_{0})\cap S_{n}=S({\boldsymbol{1}}_{n}/n,1/n)\cap S_{n}=S_{n}.

Hence, 𝒙S(𝒙^0,d^0){\boldsymbol{x}}^{*}\in S(\widehat{{\boldsymbol{x}}}_{0},\widehat{d}_{0}). Now suppose that 𝒙S(𝒙^k1,d^k1){\boldsymbol{x}}^{*}\in S(\widehat{{\boldsymbol{x}}}_{k-1},\widehat{d}_{k-1}) for some k1k\geq 1. Note that the inner loop of Alg. 3 corresponds to the standard Frank-Wolfe algorithm. By Theorem 2.1 and Lemma 3.1(4), which implies that the diameter of S(𝒙^k1,d^k1)S(\widehat{{\boldsymbol{x}}}_{k-1},\widehat{d}_{k-1}) is 2nd^k1\sqrt{2}n\widehat{d}_{k-1}, we have

f(𝒑j)f2Lj+1(2nd^k1)2=4Ln2d^k12j+1f({\boldsymbol{p}}_{j})-f^{*}\leq\frac{2L}{j+1}\left(\sqrt{2}n\widehat{d}_{k-1}\right)^{2}=\frac{4Ln^{2}\widehat{d}_{k-1}^{2}}{j+1}

hold for all j[J]j\in[J]. In the case where the inner loop terminates at j=Jj=J, we obtain

f(𝒙k)f=f(𝒑J)ff(𝒑J)CJ2L8ρ2n2Lμ2n2d^k12=μ2ρ2d^k12=μ2dk2.f({\boldsymbol{x}}_{k})-f^{*}=f({\boldsymbol{p}}_{J})-f^{*}\leq f({\boldsymbol{p}}_{J})-C_{J}\leq\frac{2L}{\frac{8\rho^{2}n^{2}L}{\mu}}2n^{2}\widehat{d}_{k-1}^{2}=\frac{\mu}{2\rho^{2}}\widehat{d}_{k-1}^{2}=\frac{\mu}{2}d_{k}^{2}.

Similarly, if the inner loop is interrupted due to lines 9-11 of the algorithm, we still have

f(𝒙k)ff(𝒑j)Cjμ2ρ2d^k12=μ2dk2.f({\boldsymbol{x}}_{k})-f^{*}\leq f({\boldsymbol{p}}_{j})-C_{j}\leq\frac{\mu}{2\rho^{2}}\widehat{d}_{k-1}^{2}=\frac{\mu}{2}d_{k}^{2}.

Using the fact that f(𝒙k)fμ2𝒙k𝒙2f({\boldsymbol{x}}_{k})-f^{*}\geq\frac{\mu}{2}\lVert{\boldsymbol{x}}_{k}-{\boldsymbol{x}}^{*}\rVert^{2}, we have 𝒙k𝒙2dk2\lVert{\boldsymbol{x}}_{k}-{\boldsymbol{x}}^{*}\rVert^{2}\leq d_{k}^{2}, which implies via Lemma 3.1(5) that 𝒙S(𝒙k,dk){\boldsymbol{x}}^{*}\in S({{\boldsymbol{x}}}_{k},{d}_{k}). This implies

𝒙\displaystyle{\boldsymbol{x}}^{*} \displaystyle\in S(𝒙k,dk)S(𝒙^k1,d^k1)(as𝒙S(𝒙^k1,d^k1)by induction)\displaystyle S({{\boldsymbol{x}}}_{k},{d}_{k})\cap S(\widehat{{\boldsymbol{x}}}_{k-1},\widehat{d}_{k-1})\qquad(\mbox{as}\ {\boldsymbol{x}}^{*}\in S(\widehat{{\boldsymbol{x}}}_{k-1},\widehat{d}_{k-1})\ \mbox{by induction})
=\displaystyle= S(𝒙k,dk)S(𝒙^k1,d^k1)Sn(as𝒙Sn)\displaystyle S({{\boldsymbol{x}}}_{k},{d}_{k})\cap S(\widehat{{\boldsymbol{x}}}_{k-1},\widehat{d}_{k-1})\cap S_{n}\qquad(\mbox{as}\ {\boldsymbol{x}}^{*}\in S_{n})
=\displaystyle= S(𝒙¯k,d¯k)Sn=S(𝒙^k,d^k).\displaystyle S(\bar{{\boldsymbol{x}}}_{k},\bar{d}_{k})\cap S_{n}=S(\widehat{{\boldsymbol{x}}}_{k},\widehat{d}_{k}).

We now start to prove the conclusion in Theorem 3.5. Since d0=1nd_{0}=\frac{1}{n} and d^kdk=d^k1ρdk1ρ\widehat{d}_{k}\leq d_{k}=\frac{\widehat{d}_{k-1}}{\rho}\leq\frac{d_{k-1}}{\rho}, we have d^k1nρk\widehat{d}_{k}\leq\frac{1}{n\rho^{k}} and thus

f(𝒙k)ff(𝒙k)Bkμ2dk2μ2n2ρ2k.f({\boldsymbol{x}}_{k})-f^{*}\leq f({\boldsymbol{x}}_{k})-B_{k}\leq\frac{\mu}{2}{d}_{k}^{2}\leq\frac{\mu}{2n^{2}\rho^{2k}}.

We complete the proof. ∎

Remark 4.

(Warm-start Strategy) For rSFW and rSFWP in the upcoming Section 4, when using the simple step size, the initial steps of each inner loop may perform poorly, causing the iteration point 𝒑j{\boldsymbol{p}}_{j} far away from the optimal solution. We found that the following heuristic warm-start strategy is effective in practice. Let JkJJ_{k}\ll J denote the actual number of inner loop iterations during the kk-th outer loop iteration. When initiating the (k+1)(k+1)-th outer loop, instead of starting the inner loop from j=1j=1, begin from either j=Jkρj=\frac{J_{k}}{\rho^{\prime}} or j=d¯kd¯k1Jkj=\sqrt{\frac{\bar{d}{k}}{\bar{d}{k-1}}}J_{k}. Here ρ>1\rho^{\prime}>1 is a hyperparameter, with ρ=2\rho^{\prime}=2 serving as a reasonable default value.

Remark 5.

(Extension to Quadratic Growth Condition) Although the two main Thms. 3.4 and 3.5 are established under the strong convexity of f()f(\cdot), we would like to point out that the assumption can be weakened to quadratic growth condition:

(f(𝒙)f(𝒙))1/2cdist(𝒙,X),𝒙𝒫,(f({\boldsymbol{x}})-f({\boldsymbol{x}}^{*}))^{1/2}\geq c~\mbox{dist}({\boldsymbol{x}},X^{*}),\qquad\forall\ {\boldsymbol{x}}\in\mathcal{P},

where c>0c>0 and XX^{*} is the solution set of Problem (1.1). This includes the well-known case f(𝒙)=g(A𝒙)+𝒃,𝒙f({\boldsymbol{x}})=g(A{\boldsymbol{x}})+\langle{\boldsymbol{b}},{\boldsymbol{x}}\rangle with gg strongly convex, Am×nA\in\mathbb{R}^{m\times n} and 𝒃n{\boldsymbol{b}}\in\mathbb{R}^{n}, for a detailed investigation of this class of functions with FW methods, see [1]. In this paper, we did not make effort for such extension as our main purpose is to introduce SLMO under the strong convexity setting for the sake of simplicity.

4 Generalization to Arbitrary Polytopes

This section extends the previous results for the unit simplex to arbitrary polytopes. This generalization allows for a broader application of our findings, facilitating their relevance to a wider range of optimization problems. Important properties between the standard simplex SnS_{n} and general polytopes have been established in [14]. Our extension heavily relies on some of those results. This section is patterned after Section 3 with some details omitted to avoid repeating. We first define the simplex ball for general polytope and the corresponding SLMO. We then describe the Simplex Frank-Wolfe for general polytope, followed by its refined version.

4.1 Simplex Ball and SLMO for Arbitrary Polytopes

Consider Problem (1.1) with 𝒫=Conv(𝒱)\mathcal{P}=\mbox{Conv}(\mathcal{V}) and 𝒱={𝒗1,,𝒗N}\mathcal{V}=\{{\boldsymbol{v}}_{1},\dots,{\boldsymbol{v}}_{N}\}. Therefore, any given 𝒙𝒫{\boldsymbol{x}}\in\mathcal{P} can be represented as a convex combination of those atoms 𝒗i{\boldsymbol{v}}_{i}. However, the convex combination may not be unique. We define a set-valued mapping \mathcal{M} from 𝒫\mathcal{P} to the following set:

(𝒙):={𝝀SN|𝒙=i=1Nλi𝒗i}.\mathcal{M}({\boldsymbol{x}}):=\left\{\boldsymbol{\lambda}\in S_{N}\ \left|\ {\boldsymbol{x}}=\sum_{i=1}^{N}\lambda_{i}{\boldsymbol{v}}_{i}\right.\right\}.

Recall that for 𝝀N\boldsymbol{\lambda}\in\mathbb{R}^{N} and d>0d>0, S(𝝀,d)S(\boldsymbol{\lambda},d) is the simplex ball defined in (3.2). The idea of defining a similar Simplex ball over the polytope can be summarized as follows:

𝒙𝒫𝝀x(𝒙)S(𝝀x,d)S𝒫(𝒙,d):={V𝝀|𝝀S(𝝀x,d)},{\boldsymbol{x}}\in\mathcal{P}\ \Longrightarrow\ \boldsymbol{\lambda}_{x}\in\mathcal{M}({\boldsymbol{x}})\ \Longrightarrow\ S(\boldsymbol{\lambda}_{x},d)\ \Longrightarrow\ S_{\mathcal{P}}({\boldsymbol{x}},d):=\left\{V\boldsymbol{\lambda}\ |\ \boldsymbol{\lambda}\in S(\boldsymbol{\lambda}_{x},d)\right\}, (4.1)

where VV consists of the columns 𝒗i{\boldsymbol{v}}_{i}, i[N]i\in[N]. It seems that the Simplex ball S𝒫(𝒙,d)S_{\mathcal{P}}({\boldsymbol{x}},d) depends on a particular choice of 𝝀x(𝒙)\boldsymbol{\lambda}_{x}\in\mathcal{M}({\boldsymbol{x}}). The following result dismisses this dependence.

Lemma 4.1.

Given 𝐱𝒫{\boldsymbol{x}}\in\mathcal{P} and d>0d>0, let 𝛌x,𝛌x(𝐱)\boldsymbol{\lambda}_{x},\boldsymbol{\lambda}_{x}^{\prime}\in\mathcal{M}({\boldsymbol{x}}). Then for any 𝛌S(𝛌x,d)\boldsymbol{\lambda}\in S(\boldsymbol{\lambda}_{x},d), there exist 𝛌S(𝛌x,d)\boldsymbol{\lambda}^{\prime}\in S(\boldsymbol{\lambda}_{x}^{\prime},d) such that

i=1Nλ(i)𝒗i=i=1Nλ(i)𝒗i.\sum_{i=1}^{N}\lambda(i){\boldsymbol{v}}_{i}=\sum_{i=1}^{N}\lambda^{\prime}(i){\boldsymbol{v}}_{i}.
Proof.

By definition of (𝒙)\mathcal{M}({\boldsymbol{x}}), we know

i=1Nλx(i)𝒗i=i=1Nλx(i)𝒗i.\sum_{i=1}^{N}\lambda_{x}(i){\boldsymbol{v}}_{i}=\sum_{i=1}^{N}\lambda_{x}^{\prime}(i){\boldsymbol{v}}_{i}. (4.2)

It follows from 𝝀x=𝝀x𝝀x+𝝀x\boldsymbol{\lambda}_{x}^{\prime}=\boldsymbol{\lambda}_{x}-\boldsymbol{\lambda}_{x}+\boldsymbol{\lambda}_{x}^{\prime} and the definition of Simplex ball (3.1) that

S(𝝀x,d)=S(𝝀x,d)𝝀x+𝝀x.S(\boldsymbol{\lambda}_{x}^{\prime},d)=S(\boldsymbol{\lambda}_{x},d)-\boldsymbol{\lambda}_{x}+\boldsymbol{\lambda}_{x}^{\prime}. (4.3)

Let 𝝀:=𝝀𝝀x+𝝀x\boldsymbol{\lambda}^{\prime}:=\boldsymbol{\lambda}-\boldsymbol{\lambda}_{x}+\boldsymbol{\lambda}_{x}^{\prime}. Since 𝝀S(𝝀x,d)\boldsymbol{\lambda}\in S(\boldsymbol{\lambda}_{x},d), the identity (4.3) implies 𝝀S(𝝀x,d)\boldsymbol{\lambda}^{\prime}\in S(\boldsymbol{\lambda}_{x}^{\prime},d). Moreover, we have

i=1Nλ(i)𝒗i=i=1N(λ(i)λx(i)+λx(i))𝒗i=i=1Nλ(i)𝒗i,\sum_{i=1}^{N}\lambda^{\prime}(i){\boldsymbol{v}}_{i}=\sum_{i=1}^{N}(\lambda(i)-\lambda_{x}(i)+\lambda_{x}^{\prime}(i)){\boldsymbol{v}}_{i}=\sum_{i=1}^{N}\lambda(i){\boldsymbol{v}}_{i},

where the last equation used (4.2). This completes the proof. ∎

Lemma 4.1 ensures that the definition is independent of choice of 𝝀x(𝒙)\boldsymbol{\lambda}_{x}\in\mathcal{M}({\boldsymbol{x}}). Hence, the definition is well defined. Given a linear objective 𝒄n{\boldsymbol{c}}\in\mathbb{R}^{n}, we extend it to 𝒄extN{\boldsymbol{c}}_{ext}\in\mathbb{R}^{N} such that cext(i)=𝒗i,𝒄c_{ext}(i)=\langle{\boldsymbol{v}}_{i},{\boldsymbol{c}}\rangle for all i[N]i\in[N]. Consequently, the following equivalence holds:

min𝒚𝒫𝒚,𝒄=min𝝀SN𝝀,𝒄ext.\min_{{\boldsymbol{y}}\in\mathcal{P}}\langle{\boldsymbol{y}},{\boldsymbol{c}}\rangle=\min_{\boldsymbol{\lambda}\in S_{N}}\langle\boldsymbol{\lambda},{\boldsymbol{c}}_{ext}\rangle.

Leveraging this equivalence, we define the generalized SLMO for 𝒫\mathcal{P} as follows.

Definition 5 (SLMO𝒫\mbox{SLMO}_{\mathcal{P}}: SLMO over 𝒫\mathcal{P}).

Given a linear objective 𝐜n{\boldsymbol{c}}\in\mathbb{R}^{n}, radius d>0d>0, a point 𝐱𝒫{\boldsymbol{x}}\in\mathcal{P}, and its corresponding 𝛌x(𝐱)\boldsymbol{\lambda}_{x}\in\mathcal{M}({\boldsymbol{x}}), a solution 𝐲SLMO𝒫(𝐱,d,𝐜,𝛌x){\boldsymbol{y}}^{*}\in\mbox{SLMO}_{\mathcal{P}}({\boldsymbol{x}},d,{\boldsymbol{c}},\boldsymbol{\lambda}_{x}) is referred to as a generalized simplex-based linear minimization oracle if

𝒚=i=1Nλi𝒗i,{\boldsymbol{y}}^{*}=\sum_{i=1}^{N}\lambda_{i}^{*}{\boldsymbol{v}}_{i},

where 𝛌\boldsymbol{\lambda}^{*} is an optimal solution to the following optimization problem

min𝝀,𝒄ext\displaystyle\min\quad\langle\boldsymbol{\lambda},{\boldsymbol{c}}_{ext}\rangle (4.4)
s.t. 𝝀S(𝝀x,d)SN.\displaystyle\mbox{s.t. }\quad\boldsymbol{\lambda}\in S(\boldsymbol{\lambda}_{x},d)\cap S_{N}.

We note that (4.4) can be efficiently solved by Alg. 1. Consequently, SLMO𝒫\mbox{SLMO}_{\mathcal{P}} can also be efficiently solved provided an element in (𝒙)\mathcal{M}({\boldsymbol{x}}) can be cheaply obtained. The detailed steps are outlined in Alg. 4.

Algorithm 4 SLMO𝒫(𝒙,d,𝒄,𝝀x)\mbox{SLMO}_{\mathcal{P}}({\boldsymbol{x}},d,{\boldsymbol{c}},\boldsymbol{\lambda}_{x})
0:  point 𝒙𝒫{\boldsymbol{x}}\in\mathcal{P} with 𝝀x(𝒙)\boldsymbol{\lambda}_{x}\in\mathcal{M}({\boldsymbol{x}}), linear objective 𝒄n{\boldsymbol{c}}\in\mathbb{R}^{n}, radius d>0d>0.
1:  d^i=1Nmin{λx(i),d}n+1\widehat{d}\leftarrow\frac{\sum_{i=1}^{N}\min\{\lambda_{x}(i),d\}}{n+1}
2:  𝝀^𝝀xmin{𝝀x,d𝟏N}+d^𝟏N\widehat{\boldsymbol{\lambda}}\leftarrow\boldsymbol{\lambda}_{x}-\min\{\boldsymbol{\lambda}_{x},d{\boldsymbol{1}}_{N}\}+\widehat{d}{\boldsymbol{1}}_{N}
3:  𝒚+i=1N(λ^id^)𝒗i{\boldsymbol{y}}_{+}\leftarrow\sum_{i=1}^{N}(\widehat{\lambda}_{i}-\widehat{d}){\boldsymbol{v}}_{i}
4:  𝒗iargmin𝒗𝒫𝒗,𝒄{\boldsymbol{v}}_{i^{*}}\leftarrow\operatorname*{\arg\min}_{{\boldsymbol{v}}\in\mathcal{P}}\langle{\boldsymbol{v}},{\boldsymbol{c}}\rangle
5:  𝒚𝒚++(n+1)d^𝒗i{\boldsymbol{y}}^{*}\leftarrow{\boldsymbol{y}}_{+}+(n+1)\widehat{d}{\boldsymbol{v}}_{i^{*}}
5:  𝒚{\boldsymbol{y}}^{*}.

One can observe that SLMO𝒫\mbox{SLMO}_{\mathcal{P}}-2—corresponding to Lines 4-5 in Alg. 4 and serving as the oracle in our subsequent Alg. 6—requires only one more vector addition and one extra scalar-vector multiplication compared to the standard LMO.

We summarize the optimality of Alg. 4 in the following result, which is direct consequence of Lemma 3.3

Lemma 4.2.

Algorithm SLMO𝒫(𝐱,d,𝐜,𝛌x)\mbox{SLMO}_{\mathcal{P}}({\boldsymbol{x}},d,{\boldsymbol{c}},\boldsymbol{\lambda}_{x}) returns an optimal solution 𝐲{\boldsymbol{y}}^{*} for the problem:

𝒚argmin{𝒄,𝒚|𝒚S𝒫(𝒙,d)𝒫}.{\boldsymbol{y}}^{*}\in\operatorname*{\arg\min}\left\{\langle{\boldsymbol{c}},{\boldsymbol{y}}\rangle\ |\ {\boldsymbol{y}}\in S_{\mathcal{P}}({\boldsymbol{x}},d)\cap\mathcal{P}\right\}.
Remark 6.

(Carathéodory’s Representation Assumption) By Carathéodory’s
Representation Theorem [28, Thm. 17.1], for any point 𝒙𝒫{\boldsymbol{x}}\in\mathcal{P}, there exists 𝝀x(𝒙)\boldsymbol{\lambda}_{x}\in\mathcal{M}({\boldsymbol{x}}) such that |+(𝝀x)|n+1|\mathcal{I}_{+}(\boldsymbol{\lambda}_{x})|\leq n+1 where +(𝝀x):={i[N]λx(i)>0}\mathcal{I}_{+}(\boldsymbol{\lambda}_{x}):=\{i\in[N]\mid\lambda_{x}(i)>0\}. As demonstrated in the illustrative examples in Supplement C.1, this representation can be easily implemented for common types of 𝒫\mathcal{P}. In this representation, the running time of SLMO𝒫\mbox{SLMO}_{\mathcal{P}} does not explicitly depend on the number of vertices NN, but rather on the natural dimension of 𝒫\mathcal{P}, that is, nn. For the analysis in the subsequent sections, we assume without loss of generality that the selected 𝝀x\boldsymbol{\lambda}_{x} always satisfies |+(𝝀x)|n+1|\mathcal{I}_{+}(\boldsymbol{\lambda}_{x})|\leq n+1.

The following lemma demonstrates some useful properties of S𝒫S_{\mathcal{P}} and SLMO𝒫\mbox{SLMO}_{\mathcal{P}}, which can be regarded as a generalization of Lemma 3.1(5) and is crucial for proving the convergence of our algorithm in the next subsection. The detailed proof can be found in Supplement B.

Lemma 4.3.

Given 𝐱𝒫{\boldsymbol{x}}\in\mathcal{P}, d>0d>0 and 𝐲SLMO𝒫(𝐱,d,𝐜,𝛌x){\boldsymbol{y}}^{*}\in\mbox{SLMO}_{\mathcal{P}}({\boldsymbol{x}},d,{\boldsymbol{c}},\boldsymbol{\lambda}_{x}), for any point 𝐲𝒫{\boldsymbol{y}}\in\mathcal{P} satisfying 𝐱𝐲dDη\|{\boldsymbol{x}}-{\boldsymbol{y}}\|\leq\frac{dD}{\eta}, it follows that 𝐲S𝒫(𝐱,d){\boldsymbol{y}}\in S_{\mathcal{P}}({\boldsymbol{x}},d) and 𝐜,𝐲𝐜,𝐲\langle{\boldsymbol{c}},{\boldsymbol{y}}^{*}\rangle\leq\langle{\boldsymbol{c}},{\boldsymbol{y}}\rangle. Furthermore, we have 𝐱𝐲(n+1)dD\|{\boldsymbol{x}}-{\boldsymbol{y}}^{*}\|\leq(n+1)dD.

We also like to note that, though similar to LLOO, SLMO𝒫(𝒙,d,𝒄,𝝀x)\mbox{SLMO}_{\mathcal{P}}({\boldsymbol{x}},d,{\boldsymbol{c}},\boldsymbol{\lambda}_{x}) is not exactly an LLOO because Lemma 4.3 only proves that 𝔹(𝒙,(D/η)d)S𝒫(𝒙,d)\mathbb{B}({\boldsymbol{x}},(D/\eta)d)\subseteq S_{\mathcal{P}}({\boldsymbol{x}},d), not 𝔹(𝒙,d)S𝒫(𝒙,d)\mathbb{B}({\boldsymbol{x}},d)\subseteq S_{\mathcal{P}}({\boldsymbol{x}},d) which would be sufficient for the first property of LLOO. We note that D/η=ξ/ψD/\eta=\xi/\psi. Therefore, the condition ξ/ψ1\xi/\psi\geq 1 would be enough for SLMO𝒫(𝒙,d,𝒄,𝝀x)\mbox{SLMO}_{\mathcal{P}}({\boldsymbol{x}},d,{\boldsymbol{c}},\boldsymbol{\lambda}_{x}) to be LLOO.

4.2 SFW𝒫\mbox{SFW}_{\mathcal{P}}: Simplex Frank-Wolfe for Arbitrary Polytopes

In this subsection, we extend the SFW to the polytope case. The generalized SFW algorithm is presented as follows.

Algorithm 5 SFW𝒫\mbox{SFW}_{\mathcal{P}}: Simplex Frank-Wolfe Method for Polytope 𝒫\mathcal{P}
0:  𝒙0Sn{\boldsymbol{x}}_{0}\in S_{n}, initial lower bound B0B_{0}, condition number η\eta and diameter DD of 𝒫\mathcal{P}.
1:  Set: d02(f(𝒙0)B0)μ,𝝀0(𝒙0).d_{0}\leftarrow\sqrt{\frac{2(f({\boldsymbol{x}}_{0})-B_{0})}{\mu}},\boldsymbol{\lambda}_{0}\in\mathcal{M}({\boldsymbol{x}}_{0}).
2:  for k=1,k=1,\dots do
3:   Compute 𝒚kSLMO𝒫(𝒙k1,ηDdk1,f(𝒙k1),𝝀k1){\boldsymbol{y}}_{k}\in\mbox{SLMO}_{\mathcal{P}}({{\boldsymbol{x}}}_{k-1},\frac{\eta}{D}{d}_{k-1},\nabla f({\boldsymbol{x}}_{k-1}),\boldsymbol{\lambda}_{k-1}).
4:   Compute the working lower bound: Bkwf(𝒙k1)+f(𝒙k1),𝒚k𝒙k1B_{k}^{w}\leftarrow f({\boldsymbol{x}}_{k-1})+\langle\nabla f({\boldsymbol{x}}_{k-1}),{\boldsymbol{y}}_{k}-{\boldsymbol{x}}_{k-1}\rangle.
5:   Update best bound Bkmax{Bk1,Bkw}B_{k}\leftarrow\max\{B_{k-1},B_{k}^{w}\}.
6:   Set 𝒙k(1δk)𝒙k1+δk𝒚k{\boldsymbol{x}}_{k}\leftarrow(1-\delta_{k}){\boldsymbol{x}}_{k-1}+\delta_{k}{\boldsymbol{y}}_{k} for some δk[0,1]\delta_{k}\in[0,1].
7:   Set: dk2(f(𝒙k)Bk)μ,𝝀k(𝒙k)d_{k}\leftarrow\sqrt{\frac{2(f({\boldsymbol{x}}_{k})-B_{k})}{\mu}},\boldsymbol{\lambda}_{k}\in\mathcal{M}({\boldsymbol{x}}_{k}).
8:  end for

Notice that when 𝒫\mathcal{P} degenerates to SnS_{n}, the algorithm differs from Alg. 5 only slightly at line 7 since η=D=2\eta=D=\sqrt{2} for SnS_{n}. The convergence for Alg. 5 stated below is proved in Supplement B.

Theorem 4.4.

Let {𝐱k}\{{\boldsymbol{x}}_{k}\} be the sequences generated by Alg. 5 with step size policy for {δk}\{\delta_{k}\} in (2.3), (2.4), or simple step size

δk=μ2L(n+1)2η2.\delta_{k}=\frac{\mu}{2L(n+1)^{2}\eta^{2}}. (4.5)

Then, for k0k\geq 0, we have

f(𝒙k)ff(𝒙k)Bkμd022eμ4Lη2(n+1)2k.f({\boldsymbol{x}}_{k})-f^{*}\leq f({\boldsymbol{x}}_{k})-B_{k}\leq\frac{\mu d_{0}^{2}}{2}e^{-\frac{\mu}{4L\eta^{2}(n+1)^{2}}k}. (4.6)

4.3 rSFW𝒫\mbox{rSFW}_{\mathcal{P}}: Refining SFW𝒫\mbox{SFW}_{\mathcal{P}}

As with the motivation for rSFW for the Simplex case, once we constructed the Simplex ball for 𝒫\mathcal{P}, we may compute an approximate solution:

𝒑kargminf(𝒑)s.t.𝒑S(𝒙k1,dk1)SN,{\boldsymbol{p}}_{k}\approx\operatorname*{\arg\min}\;f({\boldsymbol{p}})\quad\mbox{s.t.}\quad{\boldsymbol{p}}\in S({\boldsymbol{x}}_{k-1},d_{k-1})\cap S_{N},

with the initial point 𝒑0=𝒙k1{\boldsymbol{p}}_{0}={\boldsymbol{x}}_{k-1}. The motivation is based on a similar observation that SFW𝒫\mbox{SFW}_{\mathcal{P}} can be split into two independent parts with the first part of constructing the Simplex ball being the major computation. Hence, once such a ball is constructed we run a few more cheap SLMO steps over this ball. Once again, other methods such as Away-step FW and pairwise FW can be used for computing 𝒑k{\boldsymbol{p}}_{k}. The generalized rSFW algorithm is presented as follows.

Algorithm 6 rSFW𝒫\mbox{rSFW}_{\mathcal{P}}: Refined Simplex Frank-Wolfe Method for Polytope 𝒫\mathcal{P}
0:  𝒙0𝒫{\boldsymbol{x}}_{0}\in\mathcal{P}, 𝝀0(𝒙0)\boldsymbol{\lambda}_{0}\in\mathcal{M}({\boldsymbol{x}}_{0}), radius contraction ratio ρ>1\rho>1, initial lower bound B0B_{0}, condition number η\eta and diameter DD of 𝒫\mathcal{P}.
1:  Set: d0ηD2(f(𝒙0)B0)μ,J4ρ2(n+1)2η2Lμd_{0}\leftarrow\frac{\eta}{D}\sqrt{\frac{2(f({\boldsymbol{x}}_{0})-B_{0})}{\mu}},J\leftarrow\frac{4\rho^{2}(n+1)^{2}\eta^{2}L}{\mu}.
2:  for k=1,k=1,\dots do
3:   Set: 𝒑0𝒙k1,C0Bk1{\boldsymbol{p}}_{0}\leftarrow{\boldsymbol{x}}_{k-1},C_{0}\leftarrow B_{k-1}.
4:   (SLMO𝒫\mbox{SLMO}_{\mathcal{P}}-1) Compute 𝝀^k1\widehat{\boldsymbol{\lambda}}_{k-1} and d^k1\widehat{d}_{k-1} such that S(𝝀^k1,d^k1)=S(𝝀k1,dk1)SNS(\widehat{\boldsymbol{\lambda}}_{k-1},\widehat{d}_{k-1})=S({{\boldsymbol{\lambda}}}_{k-1},{d}_{k-1})\cap S_{N}.
5:   for j=1,,Jj=1,\dots,J do
6:    (SLMO𝒫\mbox{SLMO}_{\mathcal{P}}-2) Compute 𝒚jSLMO𝒫(𝒙k1,dk1,f(𝒑j1),𝝀k1){\boldsymbol{y}}_{j}\in\mbox{SLMO}_{\mathcal{P}}({\boldsymbol{x}}_{k-1},d_{k-1},\nabla f({\boldsymbol{p}}_{j-1}),\boldsymbol{\lambda}_{k-1}).
7:    Set: Cjwf(𝒑j1)+f(𝒑j1),𝒚j𝒑j1C_{j}^{w}\leftarrow f({\boldsymbol{p}}_{j-1})+\langle\nabla f({\boldsymbol{p}}_{j-1}),{\boldsymbol{y}}_{j}-{\boldsymbol{p}}_{j-1}\rangle.
8:    Update best bound Cjmax{Cj1,Cjw}C_{j}\leftarrow\max\{C_{j-1},C_{j}^{w}\}.
9:    if f(𝒑j)Cjμ2ρ2η2dk12D2f({\boldsymbol{p}}_{j})-C_{j}\leq\frac{\mu}{2\rho^{2}\eta^{2}}{d}_{k-1}^{2}D^{2} then
10:     Break out of the inner loop.
11:    end if
12:    Set 𝒑j(1δj)𝒑j1+δj𝒚j{\boldsymbol{p}}_{j}\leftarrow(1-\delta_{j}){\boldsymbol{p}}_{j-1}+\delta_{j}{\boldsymbol{y}}_{j} for some δj[0,1]\delta_{j}\in[0,1].
13:   end for
14:   Set: 𝒙k𝒑j{\boldsymbol{x}}_{k}\leftarrow{\boldsymbol{p}}_{j}, dkdk1ρd_{k}\leftarrow\frac{{d}_{k-1}}{\rho}, BkCjB_{k}\leftarrow C_{j} and 𝝀k(𝒙k)\boldsymbol{\lambda}_{k}\in\mathcal{M}({\boldsymbol{x}}_{k}).
15:  end for

Similar to Theorem 3.5, we can provide the following convergence analysis for Alg. 6, which is proven in Supplement B.

Theorem 4.5.

Let {𝐱k}\{{\boldsymbol{x}}_{k}\} be the sequences generated by Alg. 6 with step size policy for {δj}\{\delta_{j}\} in (2.2)-(2.4). Then, for k1k\geq 1, we have

f(𝒙k)ff(𝒙k)Bk(f(𝒙0)B0)ρ2k.f({\boldsymbol{x}}_{k})-f^{*}\leq f({\boldsymbol{x}}_{k})-B_{k}\leq(f({\boldsymbol{x}}_{0})-B_{0})\rho^{-2k}.
Remark 7.

(Adaptive Lower Bound Update) We estimate the lower bound of ff^{*} by f(𝒙k1)+f(𝒙k1),𝒚k𝒙k1f({\boldsymbol{x}}_{k-1})+\langle\nabla f({\boldsymbol{x}}_{k-1}),{\boldsymbol{y}}_{k}-{\boldsymbol{x}}_{k-1}\rangle for the Simplex Frank-Wolfe method and its refined version. In fact, when the objective function exhibits specific structural properties, we can derive an additional lower BkoB_{k}^{o} and update the best bound BkB_{k} as Bkmax{Bk1,Bkw,Bko}B_{k}\leftarrow\max\{B_{k-1},B_{k}^{w},B_{k}^{o}\}. For instance, when the objective function has a minmax structure, we can construct a minmax lower bound BkoB_{k}^{o} for ff^{*}, see [11] for detailed analysis. Moreover, in certain application scenarios, there may be exact information about the optimal value ff^{*}, such as in linear regression or machine learning tasks, where it is known a priori that the optimal value of the loss function is 0. In such cases, it is straightforward to set BkofB_{k}^{o}\leftarrow f^{*}.

Remark 8.

(Robustness to Parameter Estimation) Our algorithms rely on parameters L,μ,η,DL,\mu,\eta,D. In practice, using overestimates L,η,DL^{\prime},\eta^{\prime},D^{\prime} and an underestimate μ\mu^{\prime} such that LηDμLηDμ=O(1)\frac{L^{\prime}\eta^{\prime}D^{\prime}\mu}{L\eta D\mu^{\prime}}=O(1) only increases the bounds by a constant factor. Moreover, as shown in Supplement C.2, both η\eta and DD can be efficiently estimated for common 𝒫\mathcal{P}. For LL and μ\mu, one can use the backtracking strategy from [26] to estimate their local values and compute adaptive short step sizes; see Subsection 5.4 for details.

5 Numerical Experiments

In this section, we present numerical experiments to evaluate the efficiency, convergence, and adaptability of the proposed methods. All tests were performed using MATLAB R2022b on a Windows laptop equipped with a 14-core Intel(R) Core(TM) 2.30GHz CPU and 16GB of RAM.

We try to furnish four tasks. (T1) We first assess the computational efficiency of SLMO and SLMO-2 across four representative polytopes, consolidating their role of the workhorse in our SFW methods. (T2) We illustrate the linear convergence behavior of SFW and rSFW using two numerical experiments. (T3) We show that our methods can be enhanced with a backtracking strategy to eliminate the need for predefined values of the parameters LL and μ\mu. (T4) We demonstrate how integrating the away-step variants of the Frank-Wolfe method (AFW and PFW) into the rSFW framework significantly enhances its performance, outperforming the original AFW and PFW methods. Those four tasks are addressed in four subsections.

5.1 Efficiency of SLMO and SLMO-2

In this subsection, we evaluate the performance of the proposed SLMO and SLMO-2 through comparative experiments on four common polytopes 𝒫\mathcal{P}: (a) Unit simplex; (b) Hypercube; (c) 1\ell_{1}-ball; and (d) Flow polytope, derived from the video co-localization problem in [21].

Table 1: Description of projection and five LMO variants used in the numerical comparison. These six methods shares the same randomly generated parameters: 𝒄𝒩(𝟎,In),𝒙𝒫{\boldsymbol{c}}\sim\mathcal{N}({\boldsymbol{0}},I_{n}),{\boldsymbol{x}}\in\mathcal{P} and d𝒰[0,1]d\in\mathcal{U}_{[0,1]}.
Algorithm Formulation Description
Projection argmin𝒚𝒫𝒚𝒛2\operatorname*{\arg\min}_{{\boldsymbol{y}}\in\mathcal{P}}\|{\boldsymbol{y}}-{\boldsymbol{z}}\|^{2} The projection onto the polytope 𝒫\mathcal{P}, and 𝒛𝒩(𝟎,In){\boldsymbol{z}}\sim\mathcal{N}({\boldsymbol{0}},I_{n}) is a randomly generated point. We implement projections onto the Simplex and 1\ell_{1}-ball using the method from [8, Fig. 2], while the projection onto the hypercube is straightforward. Although a closed-form solution exists for projection onto the flow polytope [29, Thm. 20], its computational complexity of O(m3n+n2)O(m^{3}n+n^{2}) makes it significantly more expensive than other LMO variants.
LMO argmin𝒚𝒫𝒚,𝒄\operatorname*{\arg\min}_{{\boldsymbol{y}}\in\mathcal{P}}\langle{\boldsymbol{y}},{\boldsymbol{c}}\rangle The standard linear minimization oracle.
1\ell_{1}-LMO argmin_y∈{Vλλ∈B_1(λ_x,d)∩S_N}⟨y,c The 1\ell_{1}-norm constrained LMO, Alg. 3 and Alg. 4 in [14]. Here, VV consists of columns of 𝒗𝒱(𝒫){\boldsymbol{v}}\in\mathcal{V}(\mathcal{P}), and the computation of 𝝀x(𝒙)\boldsymbol{\lambda}_{x}\in\mathcal{M}({\boldsymbol{x}}) is included in the timing.
NEP argmin_yV(P)⟨y,c⟩+ λy-x∥^2 Nearest extreme point oracle in [15]. Here, λ𝒰[0,10000]\lambda\sim\mathcal{U}_{[0,10000]} is a randomly generated positive number.
SLMOP argmin_y∈{Vλλ∈S(λ_x,d)∩S_N}⟨y,c Our proposed Simplex Linear Minimization Oracle (Alg. 1 and Alg. 4). Here, the computation of 𝝀x(𝒙)\boldsymbol{\lambda}_{x}\in\mathcal{M}({\boldsymbol{x}}) is included in the timing.
SLMOP-2 argmin_y∈{Vλλ∈S(^λ_x,^d)}⟨y,c The latter phase of SLMO, consisting of Lines 4-5 of Alg. 1 and Alg. 4. Here, S(𝝀^x,d^)=S(𝝀x,d)SNS(\widehat{\boldsymbol{\lambda}}_{x},\widehat{d})=S(\boldsymbol{\lambda}_{x},d)\cap S_{N} is precomputed and not included in the timing.
Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Refer to caption
(d)
Figure 2: Comparison of solving Projection, LMO, 1\ell_{1}-LMO, NEP, SLMO, and SLMO-2 over the following polytopes: (a) Unit Simplex; (b) Unit Hypercube; (c) Unit 1\ell_{1}-ball; and (d) Flow polytope. All the results are averaged over 20 i.i.d runs. We omit the projection onto the flow polytope due to its prohibitively high computational cost.

We consider the six different methods, including projection (a key subproblem in projection/proximal based methods) and five variants of LMO, as detailed in Table 1. These six methods shares the same randomly generated 𝒄𝒩(𝟎,In),𝒙𝒫{\boldsymbol{c}}\sim\mathcal{N}({\boldsymbol{0}},I_{n}),{\boldsymbol{x}}\in\mathcal{P} and d𝒰[0,1]d\in\mathcal{U}_{[0,1]}. Figure 2 illustrates the relationship between running time and dimensionality for the six methods. Additionally, Table 2 reports the proportion of time spent on LMO calls within the SLMO-2 algorithm. We draw the following observations from these results:

  • 1\ell_{1}-LMO: Across all four polytopes, the 1\ell_{1}-LMO method incurs the highest computational overhead surpassing even that of projection-based methods.

  • SLMOP-2 vs. SLMOP: The overhead of SLMO-2 is significantly lower than that of SLMO. In most cases, as shown in Table 2, its overhead closely matches that of the LMO itself. This indicates that our proposed rSFW and rSFWP achieve iterative complexity comparable to that of the standard Frank-Wolfe algorithm.

  • NEP: While NEP demonstrates very low runtime overhead, it is important to note that the corresponding Frank-Wolfe variant, NEP-FW, converges only sublinearly as shown in Subsection 5.2.

  • 1\ell_{1}-LMO and SLMOP on the Flow Polytope: Both methods exhibit rising overhead with increasing dimension, mainly due to the cost of computing the Carathéodory representation 𝝀x(𝒙)\boldsymbol{\lambda}_{x}\in\mathcal{M}({\boldsymbol{x}}), which dominates the runtime.

Table 2: Time overhead of LMO calls as a percentage of total computation time when using the SLMO-2 algorithm across four different polytopes.
Simplex Hypercube 1\ell_{1}-ball Flow polytope
(n=107n=10^{7}) (n=107n=10^{7}) (n=107n=10^{7}) (n=3×104n=3\times 10^{4})
Time(LMO)Time(SLMO-2)\frac{\mbox{Time(LMO)}}{\mbox{Time(SLMO-2)}} 98.8%98.8\% 46.0%46.0\% 52.6%52.6\% 99.7%99.7\%

5.2 Linear Convergence of SFW and rSFW

Table 3: Description of Frank-Wolfe variants used in the numerical comparison.
Algorithm Description
FW (simple/Ada) Frank-Wolfe with simple step size δk=2/(k+1)\delta_{k}=2/(k+1). The ‘Ada’ variant employs a backtracking step (5.1) prior to updating the iterate, in order to estimate the local parameters LL and μ\mu, thereby enabling an adaptive short step size. We set τ1=2\tau_{1}=2 and τ2=0.9\tau_{2}=0.9 in Alg. 7, and apply the same configuration to the subsequent ‘Ada’ variants.
SFW/SFWP (line-search) Simplex Frank-Wolfe with exact line-search (Alg. 2 and Alg. 5). For the 1\ell_{1}-constrained least squares problem, we set μ=2λmin(AA)\mu=2\lambda_{min}(A^{\prime}A), D=2D=2 and η=n\eta=\sqrt{n}. Although Supplement C.2 estimates ηn\eta\leq n for 1\ell_{1}-ball, this setting does not hinder the algorithm s linear convergence and demonstrates strong practical performance. For the video co-localization task, we set μ=λmin(A)\mu=\lambda_{min}(A) and η=D=66\eta=D=\sqrt{66}; see Supplement C.2 for details. For the Simplex-constrained least squared problem, we set μ=2λmin(AA)\mu=2\lambda_{min}(A^{\prime}A).
NEP-FW (simple) Frank-Wolfe with Nearest Extreme Point Oracle, with theoretical step size 2/(k+1)2/(k+1) [15, Alg. 1]. We omitted Line 5, as it showed no noticeable effect on performance.
rSFW/rSFWP (simple) Refined Simplex Frank-Wolfe with simple step size δj=2/(j+1)\delta_{j}=2/(j+1) (Alg. 3 and Alg. 6). We utilize the warm-start strategy with ρ=2\rho^{\prime}=2 as mentioned in Remark. 4. The parameters μ,D\mu,D and η\eta are set the same as in SFW/SFWP. Additionally, we set ρ=1.01,L=2λmax(AA)\rho=1.01,L=2\lambda_{max}(A^{\prime}A) for 1\ell_{1}/Simplex-constrained least squares problems, and ρ=1.01,L=λmax(A)\rho=1.01,L=\lambda_{max}(A) for video co-localization problem.
PFW (line-search) Pairwise Frank-Wolfe with exact line-search [22, Alg. 2].
AFW (line-search) Away-steps Frank-Wolfe with exact line-search [22, Alg. 1].
rSFW-P (line-search) The Refined Simplex Frank-Wolfe framework enhanced with Pairwise technique. Specifically, we incorporate (5.3) after Line 6 in Alg. 3 and replace Line 12 with (5.4).
rSFW-A (line-search) The Refined Simplex Frank-Wolfe framework enhanced with Away-steps technique. Specifically, we incorporate (5.2) after Line 6 in Alg. 3 and replace Line 12 with (5.4).

We demonstrate the linear convergence of our proposed methods—SFW and rSFW through two numerical experiments. These methods are compared against the standard Frank-Wolfe (FW) algorithm, its variant NEP-FW111As the code for NEP-FW is not publicly available, we implemented it ourselves., and two well-known variants: Away-step FW222The implementations of AFW and PFW are available at https://github.com/Simon-Lacoste-Julien/linearFW. (AFW) and Pairwise FW (PFW), all summarized in Table 3.

To evaluate algorithmic performance, we adopt the Frank-Wolfe gap defined by f(𝒙k),𝒙k𝒚k+1,\langle\nabla f({\boldsymbol{x}}_{k}),{\boldsymbol{x}}_{k}-{\boldsymbol{y}}_{k+1}\rangle, where 𝒙k{\boldsymbol{x}}_{k} is the kk-th iterate and 𝒚k{\boldsymbol{y}}_{k} denotes the solution returned by the respective LMO variant at that iteration. This FW gap provides a valid upper bound on the primal gap, i.e., f(𝒙k)ff(𝒙k),𝒙k𝒚k+1f({\boldsymbol{x}}_{k})-f^{*}\leq\langle\nabla f({\boldsymbol{x}}_{k}),{\boldsymbol{x}}_{k}-{\boldsymbol{y}}_{k+1}\rangle, and can thus be used as a practical stopping criterion333For NEP-FW, however, this inequality does not hold in general for 𝒚k+1=NEP(𝒙k)=LMO(f(𝒙k)λk𝒙k,𝒫){\boldsymbol{y}}_{k+1}=\text{NEP}({\boldsymbol{x}}_{k})=\text{LMO}(\nabla f({\boldsymbol{x}}_{k})-\lambda_{k}{\boldsymbol{x}}_{k},\mathcal{P}). Therefore, to ensure a consistent and fair comparison, we use the standard FW gap to evaluate NEP-FW as well..

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Refer to caption
(d)
Figure 3: FW gap vs time/iterations.

The first experiment involves an 1\ell_{1}-regularized least squares regression, that is min𝒙11A𝒙𝒃22\min_{\|{\boldsymbol{x}}\|_{1}\leq 1}\lVert A{\boldsymbol{x}}-{\boldsymbol{b}}\rVert_{2}^{2}, where Am×nA\in\mathbb{R}^{m\times n} with m=400,n=100m=400,n=100, and the entries of AA are drawn from a standard Gaussian distribution. We set 𝒃=A𝒙{\boldsymbol{b}}=A{\boldsymbol{x}}^{*}, where 𝒙{\boldsymbol{x}}^{*} is constructed by first generating a random vector with sparsity parameter s=0.7s=0.7, followed by normalization to lie on the boundary of the 1\ell_{1}-ball. Thus, the optimal value of this problem is 0. We use the same initial point 𝒙0=𝟎n{\boldsymbol{x}}_{0}={\boldsymbol{0}}_{n} for all methods.

The second experiment involves a convex quadratic problem over the flow polytope, derived from the video co-localization task introduced by [21]. The problem is formulated as min𝒙s,t12𝒙A𝒙+𝒃x\min_{{\boldsymbol{x}}\in\mathcal{F}_{s,t}}\frac{1}{2}{\boldsymbol{x}}^{\prime}A{\boldsymbol{x}}+{\boldsymbol{b}}^{\prime}x, where An×nA\in\mathbb{R}^{n\times n} is a positive definite matrix, 𝒃n{\boldsymbol{b}}\in\mathbb{R}^{n}, and s,t\mathcal{F}_{s,t} represents the s-t flow polytope. We used the same dataset and initial point as in [22, 15]. The problem has a dimension of n=660n=660.

The results are presented in Figure 3. We make some comments below.

  • Linear convergence of SFWP and rSFWP: Both SFW𝒫{}\mathcal{P} and rSFW𝒫{}\mathcal{P} show linear convergence, confirming our theoretical guarantees.

  • Superior efficiency of rSFWP: In both experiments, our proposed rSFW method significantly outperforms all other algorithms –including the well-established AFW and PFW– in terms of running time.

  • Limitations of NEP-FW: Although NEP performs well in iteration complexity, its NEP-FW variant converges sublinearly and is slightly slower than standard FW.

  • Time inefficiency in video co-localization: In the video co-localization task, while SFW𝒫{}\mathcal{P}, AFW, and PFW converge quickly by iteration count, their runtime is slower due to overhead—–Carathéodory computation for SFW𝒫{}\mathcal{P}, and growing active sets for AFW and PFW.

5.3 SFW/rSFW with Backtracking

We further show that our methods can be enhanced with the backtracking technique proposed by [26], thereby eliminating the need to manually specify the parameters LL and μ\mu. While [26, Alg. 2] does not specify how to estimate the strong convexity constant μ\mu, we outline our approach to estimating both LL and μ\mu as well as determining an adaptive step size; see D for the detailed algorithm. As an example, consider the kk-th iteration of SFWP. Before updating 𝒙k{\boldsymbol{x}}_{k}, we perform the following backtracking step:

δk,Lk,μkBacktracking-Routine(𝒙k1,𝒚k𝒙k1,Lk1,μk1,1).\delta_{k},\ L_{k},\ \mu_{k}\leftarrow\text{Backtracking-Routine}({\boldsymbol{x}}_{k-1},{\boldsymbol{y}}_{k}-{\boldsymbol{x}}_{k-1},L_{k-1},\mu_{k-1},1). (5.1)

We focus on the 1\ell_{1}-constrained logistic regression problem with the form:

min𝒙1β1mi=1mln(1+exp(bi𝐚i,𝒙))+λ2𝒙2,\min_{\lVert{\boldsymbol{x}}\rVert_{1}\leq\beta}\frac{1}{m}\sum_{i=1}^{m}\text{ln}(1+\text{exp}(-b_{i}\langle\mathbf{a}_{i},{\boldsymbol{x}}\rangle))+\frac{\lambda}{2}\lVert{\boldsymbol{x}}\rVert^{2},

where A=[𝐚1,,𝐚m]m×nA=[\mathbf{a}_{1},\dots,\mathbf{a}_{m}]\in\mathbb{R}^{m\times n} and 𝒃m{\boldsymbol{b}}\in\mathbb{R}^{m}. We use the dataset Madelon [17], which has m=4400,n=500m=4400,n=500, and fully-density (i.e. density=1\text{density}=1). We set β=1\beta=1 and λ=1/n\lambda=1/n. We compare our methods against AFW, PFW, and the standard FW, all of which are equipped with the backtracking technique. The results are presented in Figure 4. It can be observed that our two methods achieve the best performance in terms of running time. Although AFW and PFW perform well in terms of iteration count, their overall efficiency is hindered by the increasing cost of maintaining a growing active set and computing the away direction.

Refer to caption
(a)
Refer to caption
(b)
Figure 4: FW gap vs time/iterations on the 1\ell_{1}-constrained logistic regression problem.

5.4 rSFW Framework Combined with the Away Step Technique

In this subsection, we demonstrate that the well-known linearly converging variants of the standard Frank-Wolfe method AFW and PFW [22] can be seamlessly integrated into the inner loop of the rSFW framework. This straightforward combination leads to a significant performance improvement over the original AFW and PFW methods.

We focus on the simplex-regularized problem min𝒙SnA𝒙𝒃22\min_{{\boldsymbol{x}}\in S_{n}}\lVert A{\boldsymbol{x}}-{\boldsymbol{b}}\rVert_{2}^{2}, where Am×n,m=800,n=200A\in\mathbb{R}^{m\times n},m=800,n=200, with standard Gaussian entries. We set 𝒃=A𝒙{\boldsymbol{b}}=A{\boldsymbol{x}}^{*}, where 𝒙{\boldsymbol{x}}^{*} is constructed by first generating a random nonnegative vector with sparsity parameter d=0.6d=0.6 and then normalized it so that its components sum to 1. Thus 0 is the optimal value of this problem. We use the same initial point 𝒙0=𝟏n/n{\boldsymbol{x}}_{0}={\boldsymbol{1}}_{n}/n for all methods.

In comparison to the experiments in the previous subsection, we introduce four additional algorithms: AFW and PFW, along with their respective versions integrated into the rSFW framework, denoted as rSFW-A and rSFW-P. Details of these methods are provided in Table 3.

We briefly explain here how the direction-correction 𝒈{\boldsymbol{g}} is computed within the general framework (1.2) for both the rSFW-A and rSFW-P algorithms. Based on Alg. 3, during the kk-th outer loop and the jj-th inner loop, let S(k,j)𝒱(S(𝒙^k1,d^k1))S^{(k,j)}\subset\mathcal{V}(S(\widehat{{\boldsymbol{x}}}_{k-1},\widehat{d}_{k-1})) denote the active set corresponding to the point 𝒑j1{\boldsymbol{p}}_{j-1}. Thus, 𝒑j1{\boldsymbol{p}}_{j-1} can be represented as 𝒑j1=𝒗S(k,j)α𝒗𝒗{\boldsymbol{p}}_{j-1}=\sum_{{\boldsymbol{v}}\in S^{(k,j)}}\alpha_{{\boldsymbol{v}}}{\boldsymbol{v}} where α𝒗>0\alpha_{{\boldsymbol{v}}}>0. Let 𝒗j=argmax𝒗S(k,j)f(𝒑j1),𝒗{\boldsymbol{v}}_{j}={\arg\max}_{{\boldsymbol{v}}\in S^{(k,j)}}\langle\nabla f({\boldsymbol{p}}_{j-1}),{\boldsymbol{v}}\rangle. For the rSFW-A method, the direction-correction is computed as:

𝒈j={11α𝒗j𝒑j1𝒚α𝒗j1α𝒗j𝒗jif Δj<0,𝟎if Δj0,{\boldsymbol{g}}_{j}=\begin{cases}\frac{1}{1-\alpha_{{\boldsymbol{v}}_{j}}}{\boldsymbol{p}}_{j-1}-{\boldsymbol{y}}-\frac{\alpha_{{\boldsymbol{v}}_{j}}}{1-\alpha_{{\boldsymbol{v}}_{j}}}{\boldsymbol{v}}_{j}&\text{if }\Delta_{j}<0,\\ {\boldsymbol{0}}&\text{if }\Delta_{j}\geq 0,\end{cases} (5.2)

where Δj:=f(𝒈j1),𝒚j𝒑j1f(𝒈j1),𝒑j1𝒗j.\Delta_{j}:=\langle-\nabla f({\boldsymbol{g}}_{j-1}),{\boldsymbol{y}}_{j}-{\boldsymbol{p}}_{j-1}\rangle-\langle-\nabla f({\boldsymbol{g}}_{j-1}),{\boldsymbol{p}}_{j-1}-{\boldsymbol{v}}_{j}\rangle.

For the rSFW-P method, the direction-correction is computed as:

𝒈j=𝒑j1(1α𝒗j)𝒚jα𝒗j𝒗j.{\boldsymbol{g}}_{j}={\boldsymbol{p}}_{j-1}-(1-\alpha_{{\boldsymbol{v}}_{j}}){\boldsymbol{y}}_{j}-\alpha_{{\boldsymbol{v}}_{j}}{\boldsymbol{v}}_{j}. (5.3)

Finally, we update the point 𝒑j{\boldsymbol{p}}_{j} using the iteration:

𝒑j(1δj)𝒑j1+δj(𝒚j+𝒈j),{\boldsymbol{p}}_{j}\leftarrow(1-\delta_{j}){\boldsymbol{p}}_{j-1}+\delta_{j}({\boldsymbol{y}}_{j}+{\boldsymbol{g}}_{j}), (5.4)

which replaces the original iteration.

Refer to caption
(a)
Refer to caption
(b)
Figure 5: FW gap vs time/iterations on the Simplex-constrained least squared problem with (m,n)=(800,200)(m,n)=(800,200).

The results are given in Figure 5. rSFW-P demonstrates superior performance compared to all other algorithms, excelling in both the number of iterations and running time, achieving nearly twice the efficiency of PFW. Additionally, both framework-based acceleration algorithms exhibit substantial performance improvements over the standalone rSFW framework.

6 Conclusion

In this paper, we introduced a novel oracle: SLMO, which leverages the advantageous geometric properties of the unit simplex. This design enables SLMO to be implemented with the same computational complexity as the standard linear optimization oracle, preserving the efficiency of the Frank-Wolfe framework. Building on this oracle, we proposed two new variants of the classical Frank-Wolfe algorithm: the Simplex Frank-Wolfe (SFW) and refined Simplex Frank-Wolfe (rSFW) algorithms. Both methods achieve linear convergence for smooth and strongly convex optimization problems over polytopes. The linear convergence rates of these methods depend only on the condition number of the objective function, the polytope s quantity, and the problem s dimension, demonstrating their scalability and robustness in various settings.

The purpose of this paper is to develop the basic framework for the new SFW methods and demonstrate that they are highly competitive. We do so with the simplest setting of ff being strongly convex and smooth. We made no attempt to weaken such assumption except pointing out that the obtained results should also hold under the quadratic growth condition. An immediate question would be to extend the methods to convex or even nonconvex setting. Furthermore, LLOO proposed in [14] was an elegant framework and it was largely omitted from recent surveys on FW methods. To our best knowledge, SFW was the first LLOO instance that was extensively tested and compared with other popular FW methods. An intriguing question is whether there exist alternative LLOO approaches that simultaneously satisfy the following criteria: (i) adhering to the LLOO framework, (ii) enabling fast and accurate computation, and (iii) incurring significantly lower overhead compared to projection-based methods? We leave those topics to our future research.

acknowledgement

This work was supported by the National Natural Science Foundation of China (Grant No. 12171271) and by Hong Kong RGC General Research Fund PolyU/15303124.

Appendix

Appendix A Proof of Lemma 3.1

Proof.

(1) By the definition of the simplex ball S(𝒙,d)S({\boldsymbol{x}},d), we have

S(𝟏n/n,1/n)\displaystyle S({\boldsymbol{1}}_{n}/n,1/n) =1n𝟏n+n×1nS0=1n𝟏n+S0=Sn.\displaystyle=\frac{1}{n}{\boldsymbol{1}}_{n}+n\times\frac{1}{n}S_{0}=\frac{1}{n}{\boldsymbol{1}}_{n}+S_{0}=S_{n}.

Furthermore, we have

Conv{n𝒆i𝟏n:i[n]}=nConv{𝒆i:i[n]}𝟏n=nSn𝟏n=nS0.\mbox{Conv}\left\{n{\boldsymbol{e}}_{i}-{\boldsymbol{1}}_{n}:\ i\in[n]\right\}=n\mbox{Conv}\left\{{\boldsymbol{e}}_{i}:\ i\in[n]\right\}-{\boldsymbol{1}}_{n}=nS_{n}-{\boldsymbol{1}}_{n}=nS_{0}.

Consequently, the characterization (3.2) holds.

(2) On one hand, for every 𝒚S(𝒙,d)Sn{\boldsymbol{y}}\in S({\boldsymbol{x}},d)\cap S_{n}, there exists 𝝀Sn\boldsymbol{\lambda}\in S_{n} such that yi=xid+ndλi,i[n]y_{i}=x_{i}-d+nd\lambda_{i},\forall i\in[n]. Define for each i[n]i\in[n],

λ^i={ndλij=1nmin{d,xj}if xid,xid+ndλij=1nmin{d,xj}if xi<d.\widehat{\lambda}_{i}=\left\{\begin{array}[]{ll}\frac{nd\lambda_{i}}{\sum_{j=1}^{n}\min\{d,x_{j}\}}&\text{if }x_{i}\geq d,\\[8.61108pt] \frac{x_{i}-d+nd\lambda_{i}}{\sum_{j=1}^{n}\min\{d,x_{j}\}}&\text{if }x_{i}<d.\end{array}\right.

Since 𝒚Sn{\boldsymbol{y}}\in S_{n} and 𝝀Sn\boldsymbol{\lambda}\in S_{n}, we have ndλi0nd\lambda_{i}\geq 0 and xid+ndλi=yi0x_{i}-d+nd\lambda_{i}=y_{i}\geq 0, which shows that λ^i0\widehat{\lambda}_{i}\geq 0 for each i[n]i\in[n]. Moreover, let :={i[n]xi<d}{\mathcal{I}}_{-}:=\{i\in[n]\mid x_{i}<d\} be an index set. Then we have

i=1nλ^i=i(xid)+ndi=1nmin{d,xi}=ixi+(n||)dixi+(n||)d=1.\sum_{i=1}^{n}\widehat{\lambda}_{i}=\frac{\sum_{i\in{\mathcal{I}}_{-}}(x_{i}-d)+nd}{\sum_{i=1}^{n}\min\{d,x_{i}\}}=\frac{\sum_{i\in{\mathcal{I}}_{-}}x_{i}+(n-|{\mathcal{I}}_{-}|)d}{\sum_{i\in{\mathcal{I}}_{-}}x_{i}+(n-|{\mathcal{I}}_{-}|)d}=1.

As above, we have verified that 𝝀^Sn\widehat{\boldsymbol{\lambda}}\in S_{n}. We now show that 𝒚=(𝒙^d^𝟏n)+nd^𝝀^{\boldsymbol{y}}=(\widehat{{\boldsymbol{x}}}-\widehat{d}{\boldsymbol{1}}_{n})+n\widehat{d}\widehat{\boldsymbol{\lambda}}, where 𝒙^\widehat{{\boldsymbol{x}}} and d^\widehat{d} are defined in (3.3). This follows from the fact that, for each i[n]i\in[n],

x^id^+nd^λ^i\displaystyle\widehat{x}_{i}-\widehat{d}+n\widehat{d}\;\widehat{\lambda}_{i}
=\displaystyle= max{xi,d}+d^dd^+nj=1nmin{d,xj}n×min{xid,0}+ndλij=1nmin{d,xj}\displaystyle\max\{x_{i},d\}+\widehat{d}-d-\widehat{d}+n\frac{\sum_{j=1}^{n}\min\{d,x_{j}\}}{n}\times\frac{\min\{x_{i}-d,0\}+nd\lambda_{i}}{\sum_{j=1}^{n}\min\{d,x_{j}\}}
=\displaystyle= max{xi,d}d+min{xid,0}+ndλi=xid+ndλi=yi.\displaystyle\max\{x_{i},d\}-d+\min\{x_{i}-d,0\}+nd\lambda_{i}=\;x_{i}-d+nd\lambda_{i}=y_{i}.

Thus, we have 𝒚S(𝒙^,d^){\boldsymbol{y}}\in S(\widehat{{\boldsymbol{x}}},\widehat{d}), leading to SnS(𝒙,d)S(𝒙^,d^)S_{n}\cap S({\boldsymbol{x}},d)\subset S(\widehat{{\boldsymbol{x}}},\widehat{d}).

On the other hand, for every 𝒚S(𝒙^,d^){\boldsymbol{y}}\in S(\widehat{{\boldsymbol{x}}},\widehat{d}), there exists 𝝀^Sn\widehat{\boldsymbol{\lambda}}\in S_{n} such that yi=x^id^+nd^λ^iy_{i}=\widehat{x}_{i}-\widehat{d}+n\widehat{d}\;\widehat{\lambda}_{i}. Due to the fact that for i[n]i\in[n], yix^id^=max{xi,d}d0y_{i}\geq\widehat{x}_{i}-\widehat{d}=\max\{x_{i},d\}-d\geq 0 and

i=1nyi=i=1nmax{xi,d}+nd^nd=i=1nmax{xi,d}+i=1nmin{xi,d}nd=i=1nxi=1,\sum_{i=1}^{n}y_{i}=\sum_{i=1}^{n}\max\{x_{i},d\}+n\widehat{d}-nd=\sum_{i=1}^{n}\max\{x_{i},d\}+\sum_{i=1}^{n}\min\{x_{i},d\}-nd=\sum_{i=1}^{n}x_{i}=1,

we have 𝒚Sn{\boldsymbol{y}}\in S_{n}. We now turn to show that 𝒚S(𝒙,d){\boldsymbol{y}}\in S({\boldsymbol{x}},d).

Let λi:=max{xi,d}xi+j=1nmin{xj,d}λ^ind\lambda_{i}:=\frac{\max\{x_{i},d\}-x_{i}+\sum_{j=1}^{n}\min\{x_{j},d\}\widehat{\lambda}_{i}}{nd}. It is not difficult to verify that λi0\lambda_{i}\geq 0 and i=1nλi=1\sum_{i=1}^{n}\lambda_{i}=1. Moreover, we have

xid+ndλi=xid+max{xi,d}xi+j=1nmin{xj,d}λ^i\displaystyle x_{i}-d+nd\lambda_{i}=x_{i}-d+\max\{x_{i},d\}-x_{i}+\sum_{j=1}^{n}\min\{x_{j},d\}\widehat{\lambda}_{i}
=\displaystyle= (max{xi,d}+d^d)d^+nd^λ^i=x^id^+nd^λ^i=yi,\displaystyle(\max\{x_{i},d\}+\widehat{d}-d)-\widehat{d}+n\widehat{d}\;\widehat{\lambda}_{i}=\widehat{x}_{i}-\widehat{d}+n\widehat{d}\;\widehat{\lambda}_{i}=y_{i},

implying 𝒚S(𝒙,d){\boldsymbol{y}}\in S({\boldsymbol{x}},d). Thus S(𝒙^,d^)SnS(𝒙,d)S(\widehat{{\boldsymbol{x}}},\widehat{d})\subset S_{n}\cap S({\boldsymbol{x}},d). This finishes the proof for (3.3).

We now proceed to prove (3.4). Following the definition of S(𝒙,d)S({\boldsymbol{x}},d), we have

S(𝒙,d)=(nd)Sn+(𝒙d𝟏n).S({\boldsymbol{x}},d)=(nd)S_{n}+({\boldsymbol{x}}-d{\boldsymbol{1}}_{n}).

Translating to (𝒙1,d1)({\boldsymbol{x}}_{1},d_{1}) and (𝒙2,d2)({\boldsymbol{x}}_{2},d_{2}), we have

S(𝒙1,d1)\displaystyle S({\boldsymbol{x}}_{1},d_{1}) =(nd1)Sn+(𝒙1d1𝟏n),\displaystyle=(nd_{1})S_{n}+({\boldsymbol{x}}_{1}-d_{1}{\boldsymbol{1}}_{n}),
S(𝒙2,d2)\displaystyle S({\boldsymbol{x}}_{2},d_{2}) =(nd2)Sn+(𝒙2d2𝟏n)\displaystyle=(nd_{2})S_{n}+({\boldsymbol{x}}_{2}-d_{2}{\boldsymbol{1}}_{n})
=(nd1)[n×d2nd1Sn+(𝒙2(𝒙1d1𝟏n)nd1d2nd1𝟏n)]+(𝒙1d1𝟏n)\displaystyle=(nd_{1})\left[n\times\frac{d_{2}}{nd_{1}}S_{n}+\left(\frac{{\boldsymbol{x}}_{2}-({\boldsymbol{x}}_{1}-d_{1}{\boldsymbol{1}}_{n})}{nd_{1}}-\frac{d_{2}}{nd_{1}}{\boldsymbol{1}}_{n}\right)\right]+({\boldsymbol{x}}_{1}-d_{1}{\boldsymbol{1}}_{n})
=nd1S(𝒙2(𝒙1d1𝟏n)nd1,d2nd1)+(𝒙1d1𝟏n).\displaystyle=nd_{1}S\left(\frac{{\boldsymbol{x}}_{2}-({\boldsymbol{x}}_{1}-d_{1}{\boldsymbol{1}}_{n})}{nd_{1}},\frac{d_{2}}{nd_{1}}\right)+({\boldsymbol{x}}_{1}-d_{1}{\boldsymbol{1}}_{n}).

Therefore,

S(𝒙1,d1)S(𝒙2,d2)=(𝒙1d1𝟏n)+(nd1)[Sn(𝒙2(𝒙1d1𝟏n)nd1,d2nd1)]S({\boldsymbol{x}}_{1},d_{1})\cap S({\boldsymbol{x}}_{2},d_{2})=({\boldsymbol{x}}_{1}-d_{1}{\boldsymbol{1}}_{n})+(nd_{1})\left[S_{n}\cap\left(\frac{{\boldsymbol{x}}_{2}-({\boldsymbol{x}}_{1}-d_{1}{\boldsymbol{1}}_{n})}{nd_{1}},\frac{d_{2}}{nd_{1}}\right)\right]

Using (3.3), we have

SnS(𝒙2(𝒙1d1𝟏n)nd1,d2nd1)=S(𝒙^,d^),S_{n}\cap S\left(\frac{{\boldsymbol{x}}_{2}-({\boldsymbol{x}}_{1}-d_{1}{\boldsymbol{1}}_{n})}{nd_{1}},\frac{d_{2}}{nd_{1}}\right)=S(\widehat{{\boldsymbol{x}}},\widehat{d}),

where

d^\displaystyle\widehat{d} =1ni=1nmin{d2nd1,x2(i)x1(i)+d1nd1}=i=1nmin{d2,x2(i)x1(i)+d1}n2d1\displaystyle=\frac{1}{n}\sum_{i=1}^{n}\min\left\{\frac{d_{2}}{nd_{1}},\frac{x_{2}(i)-x_{1}(i)+d_{1}}{nd_{1}}\right\}=\frac{\sum_{i=1}^{n}\min\{d_{2},x_{2}(i)-x_{1}(i)+d_{1}\}}{n^{2}d_{1}}
=(a)1+i=1nmin{d1x1(i),d2x2(i)}n2d1\displaystyle\stackrel{{\scriptstyle(a)}}{{=}}\frac{1+\sum_{i=1}^{n}\min\{d_{1}-x_{1}(i),d_{2}-x_{2}(i)\}}{n^{2}d_{1}}
x^i\displaystyle\widehat{x}_{i} =max{d2nd1,x2(i)x1(i)+d1nd1}+(d^d2nd1),i[n].\displaystyle=\max\left\{\frac{d_{2}}{nd_{1}},\frac{x_{2}(i)-x_{1}(i)+d_{1}}{nd_{1}}\right\}+\left(\widehat{d}-\frac{d_{2}}{nd_{1}}\right),\quad\forall i\in[n].

Here, (a)(a) follows from the fact 𝒙2Sn{\boldsymbol{x}}_{2}\in S_{n}. We then have

S(𝒙1,d1)S(𝒙2,d2)\displaystyle S({\boldsymbol{x}}_{1},d_{1})\cap S({\boldsymbol{x}}_{2},d_{2}) =\displaystyle= (nd1)S(𝒙^,d^)+(𝒙1d1𝟏n)\displaystyle(nd_{1})S(\widehat{{\boldsymbol{x}}},\widehat{d})+({\boldsymbol{x}}_{1}-d_{1}{\boldsymbol{1}}_{n})
=\displaystyle= (nd1)[(nd^)Sn+(𝒙^d^𝟏n)]+(𝒙1d1𝟏n)\displaystyle(nd_{1})\left[(n\widehat{d})S_{n}+(\widehat{{\boldsymbol{x}}}-\widehat{d}{\boldsymbol{1}}_{n})\right]+({\boldsymbol{x}}_{1}-d_{1}{\boldsymbol{1}}_{n})
=\displaystyle= n(nd1d^)Sn+[(𝒙1+nd1𝒙^d1𝟏n)(nd1d^)𝟏n]\displaystyle n(nd_{1}\widehat{d})S_{n}+\left[({\boldsymbol{x}}_{1}+nd_{1}\widehat{{\boldsymbol{x}}}-d_{1}{\boldsymbol{1}}_{n})-(nd_{1}\widehat{d}){\boldsymbol{1}}_{n}\right]
=\displaystyle= S(𝒙1+nd1𝒙^d1𝟏n,nd1d^)=S(𝒙3,d3),\displaystyle S\Big({\boldsymbol{x}}_{1}+nd_{1}\widehat{{\boldsymbol{x}}}-d_{1}{\boldsymbol{1}}_{n},\ nd_{1}\widehat{d}\Big)=S({\boldsymbol{x}}_{3},d_{3}),

where

d3\displaystyle d_{3} =nd1d^=1+i=1nmin{d1x1(i),d2x2(i)}n,\displaystyle=nd_{1}\widehat{d}=\frac{1+\sum_{i=1}^{n}\min\{d_{1}-x_{1}(i),d_{2}-x_{2}(i)\}}{n},
x3(i)\displaystyle x_{3}(i) =nd1x^(i)+(x1(i)d1)\displaystyle=nd_{1}\widehat{x}(i)+(x_{1}{(i)}-d_{1})
=max{d2,x2(i)x1(i)+d1}+(nd1d^d2)+(x1(i)d1)\displaystyle=\max\{d_{2},x_{2}(i)-x_{1}(i)+d_{1}\}+(nd_{1}\widehat{d}-d_{2})+(x_{1}{(i)}-d_{1})
=max{d2,x2(i)x1(i)+d1}+(x1(i)d1d2)+d3\displaystyle=\max\{d_{2},x_{2}(i)-x_{1}(i)+d_{1}\}+(x_{1}{(i)}-d_{1}-d_{2})+d_{3}
=max{x1(i)d1,x2(i)d2}+d3.\displaystyle=\max\{x_{1}(i)-d_{1},x_{2}(i)-d_{2}\}+d_{3}.

Thus, we have proven (3.4).

(3) Since the optimal solution 𝒚{\boldsymbol{y}}^{*} lies on the extreme point of S(𝒙,d)S({\boldsymbol{x}},d), we have

𝒚\displaystyle{\boldsymbol{y}}^{*} =argmin𝒚=𝒙+d(n𝒆i𝟏n),i[n]𝒄,𝒙+d(n𝒆i𝟏n)\displaystyle={\operatorname*{\arg\min}}_{{\boldsymbol{y}}={\boldsymbol{x}}+d(n{\boldsymbol{e}}_{i}-{\boldsymbol{1}}_{n}),i\in[n]}\langle{\boldsymbol{c}},{\boldsymbol{x}}+d(n{\boldsymbol{e}}_{i}-{\boldsymbol{1}}_{n})\rangle
=argmin𝒚=𝒙+d(n𝒆i𝟏n),i[n]𝒄,ei=𝒙+d(n𝒆i𝟏n),\displaystyle={\operatorname*{\arg\min}}_{{\boldsymbol{y}}={\boldsymbol{x}}+d(n{\boldsymbol{e}}_{i}-{\boldsymbol{1}}_{n}),i\in[n]}\langle{\boldsymbol{c}},e_{i}\rangle={\boldsymbol{x}}+d(n{\boldsymbol{e}}_{i^{*}}-{\boldsymbol{1}}_{n}),

where i=argmini[n]cii^{*}=\operatorname*{\arg\min}_{i\in[n]}c_{i}.

(4) By (3.1), for any points 𝒚1,𝒚2S(𝒙,d){\boldsymbol{y}}_{1},{\boldsymbol{y}}_{2}\in S({\boldsymbol{x}},d), there exist 𝝀1,𝝀2Sn\boldsymbol{\lambda}_{1},\boldsymbol{\lambda}_{2}\in S_{n} such that 𝒚i=(𝒙d𝟏n)+nd𝝀i{\boldsymbol{y}}_{i}=({\boldsymbol{x}}-d{\boldsymbol{1}}_{n})+nd\boldsymbol{\lambda}_{i} for i[2]i\in[2]. Thus, we have

max𝒚1,𝒚2S(𝒙,d)𝒚1𝒚2=ndmax𝝀1,𝝀2Sn𝝀1𝝀2=2nd.\max_{{\boldsymbol{y}}_{1},{\boldsymbol{y}}_{2}\in S({\boldsymbol{x}},d)}\lVert{\boldsymbol{y}}_{1}-{\boldsymbol{y}}_{2}\rVert=nd\cdot\max_{\boldsymbol{\lambda}_{1},\boldsymbol{\lambda}_{2}\in S_{n}}\lVert\boldsymbol{\lambda}_{1}-\boldsymbol{\lambda}_{2}\rVert=\sqrt{2}nd.

(5) Let 𝝀=1n(𝟏n+𝒚𝒙d)\boldsymbol{\lambda}=\frac{1}{n}({\boldsymbol{1}}_{n}+\frac{{\boldsymbol{y}}-{\boldsymbol{x}}}{d}). Since 𝒚𝒙d\lVert{\boldsymbol{y}}-{\boldsymbol{x}}\rVert\leq d, we have λi0\lambda_{i}\geq 0. Moreover, since 𝒙,𝒚Sn{\boldsymbol{x}},{\boldsymbol{y}}\in S_{n}, we have i=1nλi=1\sum_{i=1}^{n}\lambda_{i}=1, which implies 𝝀Sn\boldsymbol{\lambda}\in S_{n}. Thus by (3.1), 𝒚=(𝒙d𝟏n)+nd𝝀S(𝒙,d){\boldsymbol{y}}=({\boldsymbol{x}}-d{\boldsymbol{1}}_{n})+nd\boldsymbol{\lambda}\in S({\boldsymbol{x}},d). We now turn to the last part of Lemma 3.1(5). This simply follows from

max𝒚S(𝒙,d)𝒚𝒙=ndmax𝝀Sn𝝀𝟏nn=n(n1)dnd.\max_{{\boldsymbol{y}}\in S({\boldsymbol{x}},d)}\lVert{\boldsymbol{y}}-{\boldsymbol{x}}\rVert=nd\cdot\max_{\boldsymbol{\lambda}\in S_{n}}\lVert\boldsymbol{\lambda}-\frac{{\boldsymbol{1}}_{n}}{n}\rVert=\sqrt{n(n-1)}d\leq nd.

The proof is completed. ∎

Appendix B Proofs for Section 4

B.1 A Useful Bound

The proof of Lemma 4.3 relies on the following lemma, whose proof used some key technical results established in [14]. In particular, for given 𝒙,𝒚𝒫{\boldsymbol{x}},{\boldsymbol{y}}\in\mathcal{P} and 𝝀(𝒙)\boldsymbol{\lambda}\in\mathcal{M}({{\boldsymbol{x}}}), there must exist 𝒛𝒫{\boldsymbol{z}}\in\mathcal{P} and γ[0,1]\gamma\in[0,1] such that

𝒚=γ𝒙+(1γ)𝒛=γj=1Nλj𝒗j+(1γ)𝒛=i=1N(λjλj(1γ))𝒗j+(1γ)𝒛.{\boldsymbol{y}}=\gamma{\boldsymbol{x}}+(1-\gamma){\boldsymbol{z}}=\gamma\sum_{j=1}^{N}\lambda_{j}{\boldsymbol{v}}_{j}+(1-\gamma){\boldsymbol{z}}=\sum_{i=1}^{N}\Big(\lambda_{j}-\lambda_{j}(1-\gamma)\Big){\boldsymbol{v}}_{j}+(1-\gamma){\boldsymbol{z}}.

Let Δj:=λj(1γ)[0,λj]\Delta_{j}:=\lambda_{j}(1-\gamma)\in[0,\lambda_{j}]. We then have 1γ=j=1NΔj=:Δ.1-\gamma=\sum_{j=1}^{N}\Delta_{j}=:\Delta. To put another way, the point 𝒚{\boldsymbol{y}} can always be represented by

𝒚=j=1N(λjΔj)𝒗j+Δ𝒛,{\boldsymbol{y}}=\sum_{j=1}^{N}(\lambda_{j}-\Delta_{j}){\boldsymbol{v}}_{j}+\Delta{\boldsymbol{z}}, (B.1)

for some 𝒛𝒫{\boldsymbol{z}}\in\mathcal{P}, Δj[0,λj]\Delta_{j}\in[0,\lambda_{j}]. Since 𝒫\mathcal{P} is compact. There must exist a representation of (B.1) with the smallest Δ\Delta among all such representations. An important fact established in [14, lemma 5.3] is that the minimal value Δ\Delta can be bounded. We refine this bound below for the largest Δj\Delta_{j} in Δ\Delta.

Lemma B.1.

Let 𝐱,𝐲𝒫{\boldsymbol{x}},{\boldsymbol{y}}\in\mathcal{P} with 𝛌(𝐱)\boldsymbol{\lambda}\in\mathcal{M}({\boldsymbol{x}}) Let 𝐲{\boldsymbol{y}} be represented as in (B.1) with Δ\Delta having been minimized. Then it holds that

maxi[N]{Δi}ψξ𝒙𝒚.\max_{i\in[N]}\{\Delta_{i}\}\leq\frac{\psi}{\xi}\lVert{\boldsymbol{x}}-{\boldsymbol{y}}\rVert.
Proof.

The claim is trivial for the case i=1NΔi=0\sum_{i=1}^{N}\Delta_{i}=0. Now we suppose that i=1NΔi>0\sum_{i=1}^{N}\Delta_{i}>0 (i.e., at least one Δi>0\Delta_{i}>0). The following index sets C(𝒛)C({\boldsymbol{z}}) and C0(𝒛)C_{0}({\boldsymbol{z}}) are defined in [14]. We simply describe them and use some established results relating to them. Denote the index set C(𝒛):={j[m]A2(j)𝒛=b2(j)}C({\boldsymbol{z}}):=\{j\in[m]\mid A_{2}(j){\boldsymbol{z}}=b_{2}(j)\}. By [14, Lemma 5.3] we have C(𝒛)C({\boldsymbol{z}})\neq\emptyset since one Δi>0\Delta_{i}>0. Let C0(𝒛)C(𝒛)C_{0}({\boldsymbol{z}})\subseteq C({\boldsymbol{z}}) be such that the set {A2(j)}jC0(𝒛)\{A_{2}(j)\}_{j\in C_{0}({\boldsymbol{z}})} forms a basis for the set {A2(j)}jC(𝒛)\{A_{2}(j)\}_{j\in C({\boldsymbol{z}})}. Denote by A2,z|C0(𝒛)|×nA_{2,{z}}\in\mathbb{R}^{|C_{0}({\boldsymbol{z}})|\times n} consisting of the set {A2(j)}jC0(𝒛)\{A_{2}(j)\}_{j\in C_{0}({\boldsymbol{z}})}. By definition we have A2,zψ\|A_{2,{z}}\|\leq\psi. Then we obtain

𝒙𝒚2\displaystyle\|{\boldsymbol{x}}-{\boldsymbol{y}}\|^{2} =i[N]:Δi>0Δi(𝒗i𝒛)2\displaystyle=\left\|\sum_{i\in[N]:\Delta_{i}>0}\Delta_{i}({\boldsymbol{v}}_{i}-{\boldsymbol{z}})\right\|^{2}
1ψ2jC0(𝒛)(i[N]:Δi>0Δi(b2(j)A2(j)𝒗i))2\displaystyle\geq\frac{1}{\psi^{2}}\sum_{j\in C_{0}({\boldsymbol{z}})}\left(\sum_{i\in[N]:\Delta_{i}>0}\Delta_{i}(b_{2}(j)-A_{2}(j){\boldsymbol{v}}_{i})\right)^{2}
(a)1ψ2jC0(𝒛)i[N]:Δi>0Δi2(b2(j)A2(j)𝒗i)2\displaystyle\stackrel{{\scriptstyle(a)}}{{\geq}}\frac{1}{\psi^{2}}\sum_{j\in C_{0}({\boldsymbol{z}})}\sum_{i\in[N]:\Delta_{i}>0}\Delta_{i}^{2}(b_{2}(j)-A_{2}(j){\boldsymbol{v}}_{i})^{2}
=1ψ2i[N]:Δi>0jC0(𝒛)Δi2(b2(j)A2(j)𝒗i)2,\displaystyle=\frac{1}{\psi^{2}}\sum_{i\in[N]:\Delta_{i}>0}\sum_{j\in C_{0}({\boldsymbol{z}})}\Delta_{i}^{2}(b_{2}(j)-A_{2}(j){\boldsymbol{v}}_{i})^{2},

where the first inequality is established in the proof of [14, Lemma 5.5], (a)(a) follows from the fact that for any i[N]i\in[N], and any jC0(𝒛)j\in C_{0}({\boldsymbol{z}}) we have b2(j)A2(j)𝒗i0b_{2}(j)-A_{2}(j){\boldsymbol{v}}_{i}\geq 0. Combining [14, Lemma 5.3] and [14, Lemma 5.4], we obtain that for all i[N]i\in[N] such that Δi>0\Delta_{i}>0 there exists jC0(𝒛)j\in C_{0}({\boldsymbol{z}}) such that b2(j)A2(j)𝒗ξb_{2}(j)-A_{2}(j){\boldsymbol{v}}\geq\xi. Hence,

𝒙𝒚2ξ2ψ2i[N]:Δi>0Δi2ξ2ψ2maxi[N]{Δi2}.\|{\boldsymbol{x}}-{\boldsymbol{y}}\|^{2}\geq\frac{\xi^{2}}{\psi^{2}}\sum_{i\in[N]:\Delta_{i}>0}\Delta_{i}^{2}\geq\frac{\xi^{2}}{\psi^{2}}\max_{i\in[N]}\{\Delta_{i}^{2}\}.

Thus we conclude that maxi[N]{Δi}ψξ𝒙𝒚\max_{i\in[N]}\{\Delta_{i}\}\leq\frac{\psi}{\xi}\lVert{\boldsymbol{x}}-{\boldsymbol{y}}\rVert. ∎

B.2 Proof of Lemma 4.3

Proof.

We begin by proving the first part. Write 𝒙=i=1Nλi𝒗i{\boldsymbol{x}}=\sum_{i=1}^{N}\lambda_{i}{\boldsymbol{v}}_{i} for 𝝀SN\boldsymbol{\lambda}\in S_{N} and express 𝒚=i=1N(λiΔi)𝒗i+(i=1NΔi)𝒛{\boldsymbol{y}}=\sum_{i=1}^{N}(\lambda_{i}-\Delta_{i}){\boldsymbol{v}}_{i}+(\sum_{i=1}^{N}\Delta_{i}){\boldsymbol{z}}, where Δi[0,λi],i[N]\Delta_{i}\in[0,\lambda_{i}],\forall i\in[N] and 𝒛𝒫{\boldsymbol{z}}\in\mathcal{P}. Here, the sum Δ=i=1NΔi\Delta=\sum_{i=1}^{N}\Delta_{i} is minimized (as in Lemma B.1). We then have

maxi[N]{Δi}ψξ𝒙𝒚ψξ×dDη=d,\max_{i\in[N]}\{\Delta_{i}\}\leq\frac{\psi}{\xi}\lVert{\boldsymbol{x}}-{\boldsymbol{y}}\rVert\leq\frac{\psi}{\xi}\times\frac{dD}{\eta}=d,

where the first inequality used Lemma B.1, the second inequality used the assumption 𝒙𝒚(dD)/η\|{\boldsymbol{x}}-{\boldsymbol{y}}\|\leq(dD)/\eta, and the last equation is by the definition of η\eta in (2.6). Express 𝒛{\boldsymbol{z}} as 𝒛=i=1Nλi𝒗i{\boldsymbol{z}}=\sum_{i=1}^{N}\lambda_{i}^{\prime}{\boldsymbol{v}}_{i}, where 𝝀SN\boldsymbol{\lambda}^{\prime}\in S_{N}. We can then rewrite 𝒚{\boldsymbol{y}} as follows:

𝒚=i=1N(λiΔi+Δλi)𝒗i=i=1N((λid)+dΔi+Δλi)𝒗i.{\boldsymbol{y}}=\sum_{i=1}^{N}(\lambda_{i}-\Delta_{i}+\Delta\lambda_{i}^{\prime}){\boldsymbol{v}}_{i}=\sum_{i=1}^{N}\left((\lambda_{i}-d)+d-\Delta_{i}+\Delta\lambda_{i}^{\prime}\right){\boldsymbol{v}}_{i}.

Since maxi[N]{Δi}d\max_{i\in[N]}\{\Delta_{i}\}\leq d, we have dΔi+Δλi0d-\Delta_{i}+\Delta\lambda_{i}^{\prime}\geq 0 for all i[N]i\in[N]. Moreover, the sum i=1N(dΔi+Δλi)=Nd\sum_{i=1}^{N}(d-\Delta_{i}+\Delta\lambda_{i})=Nd, which implies that (dΔi+Δλi)iNdSN\frac{(d-\Delta_{i}+\Delta\lambda_{i})_{i}}{Nd}\in S_{N}. By the definition in (3.1), we have 𝝀y:=(λiΔi+Δλi)S(𝝀,d)\boldsymbol{\lambda}_{y}:=(\lambda_{i}-\Delta_{i}+\Delta\lambda_{i}^{\prime})\in S(\boldsymbol{\lambda},d), thus 𝒚S𝒫(𝒙,d){\boldsymbol{y}}\in S_{\mathcal{P}}({\boldsymbol{x}},d). Moreover, we have

𝒚,𝒄=𝝀y,𝒄ext𝝀,𝒄ext=𝒚,𝒄.\langle{\boldsymbol{y}},{\boldsymbol{c}}\rangle=\langle\boldsymbol{\lambda}_{y},{\boldsymbol{c}}_{ext}\rangle\geq\langle\boldsymbol{\lambda}^{*},{\boldsymbol{c}}_{ext}\rangle=\langle{\boldsymbol{y}}^{*},{\boldsymbol{c}}\rangle.

We now turn to prove the second part. Referring to Algorithm 4, we note that

𝝀+d𝟏N=max{𝝀x,d𝟏N}d𝟏N=𝝀xmin{𝝀x,d𝟏N}\boldsymbol{\lambda}_{+}-d{\boldsymbol{1}}_{N}=\max\{\boldsymbol{\lambda}_{x},d{\boldsymbol{1}}_{N}\}-d{\boldsymbol{1}}_{N}=\boldsymbol{\lambda}_{x}-\min\{\boldsymbol{\lambda}_{x},d{\boldsymbol{1}}_{N}\}

and

Nd^=i=1Nmin{λx(i),d}.N\widehat{d}=\sum_{i=1}^{N}\min\{\lambda_{x}(i),d\}.

Denote +(𝝀x):={i[N]λx(i)>0}{\mathcal{I}}_{+}(\boldsymbol{\lambda}_{x}):=\{i\in[N]\mid\lambda_{x}(i)>0\}, δi:=min{λx(i),d}\delta_{i}:=\min\{\lambda_{x}(i),d\}, and δ:=i+(𝝀x)δi\delta:=\sum_{i\in{\mathcal{I}}_{+}(\boldsymbol{\lambda}_{x})}\delta_{i}, the optimal solution 𝒚{\boldsymbol{y}}^{*} produced by Algorithm 4 has

𝒚=𝝀++d𝟏N+Nd^𝒆i=i+(𝝀x)(λx(i)δi)𝒗i+δ𝒗i,{\boldsymbol{y}}^{*}=\boldsymbol{\lambda}_{+}+d{\boldsymbol{1}}_{N}+N\widehat{d}{\boldsymbol{e}}_{i^{*}}=\sum_{i\in{\mathcal{I}}_{+}(\boldsymbol{\lambda}_{x})}(\lambda_{x}(i)-\delta_{i}){\boldsymbol{v}}_{i}+\delta{\boldsymbol{v}}_{i^{*}},

Thus we have that

𝒙𝒚\displaystyle\lVert{\boldsymbol{x}}-{\boldsymbol{y}}^{*}\rVert =i+(𝝀x)min{λx(i),d}(𝒗i𝒗i)\displaystyle=\lVert\sum_{i\in{\mathcal{I}}_{+}(\boldsymbol{\lambda}_{x})}\min\{\lambda_{x}(i),d\}({\boldsymbol{v}}_{i}-{\boldsymbol{v}}_{i^{*}})\rVert
i+(𝝀x)min{λx(i),d}𝒗i𝒗i|+(𝝀x)|dD(n+1)dD.\displaystyle\leq\sum_{i\in{\mathcal{I}}_{+}(\boldsymbol{\lambda}_{x})}\min\{\lambda_{x}(i),d\}\lVert{\boldsymbol{v}}_{i}-{\boldsymbol{v}}_{i^{*}}\rVert\leq|{\mathcal{I}}_{+}(\boldsymbol{\lambda}_{x})|dD\leq(n+1)dD.

B.3 Proof of Theorem 4.4

Proof.

The proof follows the framework of the proof of Theorem 3.4 and Lemma 4.3. We first claim that 𝒙S𝒫(𝒙k,ηDdk){\boldsymbol{x}}^{*}\in S_{\mathcal{P}}({\boldsymbol{x}}_{k},\frac{\eta}{D}d_{k}) and that f(𝒙k)Bkμdk22f({\boldsymbol{x}}_{k})-B_{k}\leq\frac{\mu d_{k}^{2}}{2}. We prove this by induction. First, we have

μd022=f(𝒙0)B0f(𝒙0)f(a)μ2𝒙0𝒙2,\frac{\mu d_{0}^{2}}{2}=f({\boldsymbol{x}}_{0})-B_{0}\geq f({\boldsymbol{x}}_{0})-f^{*}\stackrel{{\scriptstyle(a)}}{{\geq}}\frac{\mu}{2}\lVert{\boldsymbol{x}}_{0}-{\boldsymbol{x}}^{*}\rVert^{2},

where (a)(a) comes from (2.1). This implies that 𝒙0𝒙d0\lVert{\boldsymbol{x}}_{0}-{\boldsymbol{x}}^{*}\rVert\leq d_{0}, and by Lemma 4.3, we have 𝒙S𝒫(𝒙0,ηDd0){\boldsymbol{x}}^{*}\in S_{\mathcal{P}}({\boldsymbol{x}}_{0},\frac{\eta}{D}d_{0}). Therefore, the claim holds for k=0k=0.

Now suppose that 𝒙S𝒫(𝒙t,ηDdt){\boldsymbol{x}}^{*}\in S_{\mathcal{P}}({\boldsymbol{x}}_{t},\frac{\eta}{D}d_{t}) and f(𝒙t)Btμdt22f({\boldsymbol{x}}_{t})-B_{t}\leq\frac{\mu d_{t}^{2}}{2} for all tk1t\leq k-1. Let γ:=μ2L(n+1)2η2\gamma:=\frac{\mu}{2L(n+1)^{2}\eta^{2}}. In the same manner as the proof in Theorem 3.4, for step size policy (2.3), (2.4) or (4.5), we all have

f(𝒙k)Bk\displaystyle f({\boldsymbol{x}}_{k})-B_{k}\leq (1γ)(f(𝒙k1)Bk1)+Lγ22𝒚k𝒙k12\displaystyle(1-\gamma)(f({\boldsymbol{x}}_{k-1})-B_{k-1})+\frac{L\gamma^{2}}{2}\lVert{\boldsymbol{y}}_{k}-{\boldsymbol{x}}_{k-1}\rVert^{2}
(d)\displaystyle\stackrel{{\scriptstyle(d)}}{{\leq}} (1γ)μ2dk12+Lγ22(n+1)2η2dk12\displaystyle(1-\gamma)\frac{\mu}{2}d_{k-1}^{2}+\frac{L\gamma^{2}}{2}(n+1)^{2}\eta^{2}d_{k-1}^{2}
=\displaystyle= [(1γ)μ2+Lγ2(n+1)2η22]dk12,\displaystyle\left[(1-\gamma)\frac{\mu}{2}+\frac{L\gamma^{2}(n+1)^{2}\eta^{2}}{2}\right]d_{k-1}^{2},

where (d)(d) is due to our inductive hypothesis and Lemma 4.3. By plugging in the value of γ\gamma, and using 1xex1-x\leq e^{-x}, we have that

f(𝒙k)Bkμ2(1μ4L(n+1)2η2)dk12μ2eμ4L(n+1)2η2dk12.f({\boldsymbol{x}}_{k})-B_{k}\leq\frac{\mu}{2}(1-\frac{\mu}{4L(n+1)^{2}\eta^{2}})d_{k-1}^{2}\leq\frac{\mu}{2}e^{-\frac{\mu}{4L(n+1)^{2}\eta^{2}}}d_{k-1}^{2}.

Combining the above inequality with the fact that f(𝒙k)Bk=μ2(2(f(𝒙k)Bk)μ)2f({\boldsymbol{x}}_{k})-B_{k}=\frac{\mu}{2}(\sqrt{\frac{2(f({\boldsymbol{x}}_{k})-B_{k})}{\mu}})^{2}, and by the definition of dkd_{k}, we conclude that f(𝒙k)Bkμdk22f({\boldsymbol{x}}_{k})-B_{k}\leq\frac{\mu d_{k}^{2}}{2}. By the inductive hypothesis, we know that 𝒙S𝒫(𝒙t,dt){\boldsymbol{x}}^{*}\in S_{\mathcal{P}}({\boldsymbol{x}}_{t},d_{t}) holds for all tk1t\leq k-1. Thus Bt+1wB_{t+1}^{w} is a valid lower bound of ff^{*}, and consequently, BkB_{k} is also a lower bound of ff^{*}. Now by (2.1), we have

𝒙k𝒙2(2/μ)(f(𝒙k)f)(2/μ)(f(𝒙k)Bk)dk2.\lVert{\boldsymbol{x}}_{k}-{\boldsymbol{x}}^{*}\rVert^{2}\leq(2/\mu)(f({\boldsymbol{x}}_{k})-f^{*})\leq(2/\mu)(f({\boldsymbol{x}}_{k})-B_{k})\leq d_{k}^{2}. (B.2)

This implies that 𝒙S𝒫(𝒙k,ηDdk){\boldsymbol{x}}^{*}\in S_{\mathcal{P}}({\boldsymbol{x}}_{k},\frac{\eta}{D}d_{k}) by Lemma 4.3. Therefore, we have completed the proof of the claim.

We now start to prove the conclusion in Theorem 4.4. From the earlier proof, we know that BkfB_{k}\leq f^{*}, thus confirming the first part of the inequality. By the definition of dkd_{k} and the established claim, we have

f(𝒙k)Bk12μdk2μd022eμ4L(n+1)2η2k.f({\boldsymbol{x}}_{k})-B_{k}\leq\frac{1}{2}\mu d_{k}^{2}\leq\frac{\mu d_{0}^{2}}{2}e^{-\frac{\mu}{4L(n+1)^{2}\eta^{2}}k}.

The proof is thus completed. ∎

B.4 Proof of Theorem 4.5

The proof of Theorem 4.5 relies on the following lemma.

Lemma B.2.

For any 𝛌1,𝛌2S(𝛌x,d)\boldsymbol{\lambda}_{1},\boldsymbol{\lambda}_{2}\in S(\boldsymbol{\lambda}_{x},d), define the two corresponding points:

𝒚j=i=1Nλj(i)𝒗iS𝒫(𝒙,d),j=1,2.{\boldsymbol{y}}_{j}=\sum_{i=1}^{N}\lambda_{j}(i){\boldsymbol{v}}_{i}\in S_{\mathcal{P}}({\boldsymbol{x}},d),\quad j=1,2.

We must have 𝐲1𝐲2(n+1)dD.\|{\boldsymbol{y}}_{1}-{\boldsymbol{y}}_{2}\|\leq(n+1)dD.

Proof.

We note that S𝒫(𝒙,d)S_{\mathcal{P}}({\boldsymbol{x}},d) is compact. Let 𝒱(S)\mathcal{V}(S) denote the set of its vertices. Obviously, we have

𝒚1𝒚2max𝒖1,𝒖2𝒱(S)𝒖1𝒖2.\|{\boldsymbol{y}}_{1}-{\boldsymbol{y}}_{2}\|\leq\max_{{\boldsymbol{u}}_{1},{\boldsymbol{u}}_{2}\in\mathcal{V}(S)}\|{\boldsymbol{u}}_{1}-{\boldsymbol{u}}_{2}\|.

For 𝒖1𝒱(S){\boldsymbol{u}}_{1}\in\mathcal{V}(S) being a vertex of S𝒫(𝒙,d)S_{\mathcal{P}}({\boldsymbol{x}},d), according to the separation theorem [28] that there exists a vector 𝒄1N{\boldsymbol{c}}_{1}\in\mathbb{R}^{N} such that

𝒄1,𝒖1<𝒄1,𝒖for all𝒖𝒱(S){𝒖1}.\langle{\boldsymbol{c}}_{1},\;{\boldsymbol{u}}_{1}\rangle<\langle{\boldsymbol{c}}_{1},\;{\boldsymbol{u}}\rangle\quad\mbox{for all}\ {\boldsymbol{u}}\in\mathcal{V}(S)\setminus\{{\boldsymbol{u}}_{1}\}.

In other words, 𝒖1{\boldsymbol{u}}_{1} is the unique solution of the following problem:

min𝒄1,𝒖s.t.𝒖S𝒫(𝒙,d)𝒫.\min\;\langle{\boldsymbol{c}}_{1},\;{\boldsymbol{u}}\rangle\quad\mbox{s.t.}\quad{\boldsymbol{u}}\in S_{\mathcal{P}}({\boldsymbol{x}},d)\cap\mathcal{P}.

It follows Lemma 4.2 that 𝒖1{\boldsymbol{u}}_{1} can be represented as

𝒖1=j=1N(λx(j)δj)𝒗j+δ𝒛1for some𝒛1𝒫,{\boldsymbol{u}}_{1}=\sum_{j=1}^{N}(\lambda_{x}(j)-\delta_{j}){\boldsymbol{v}}_{j}+\delta{\boldsymbol{z}}_{1}\quad\mbox{for some}\ {\boldsymbol{z}}_{1}\in\mathcal{P},

where δj:=min{λx(j),d}\delta_{j}:=\min\{\lambda_{x}(j),d\} and δ:=j=1Nδj\delta:=\sum_{j=1}^{N}\delta_{j} independent of 𝒛1{\boldsymbol{z}}_{1}. Similarly, 𝒖2{\boldsymbol{u}}_{2} has a representation:

𝒖2=j=1N(λx(j)δj)𝒗j+δ𝒛2for some𝒛2𝒫.{\boldsymbol{u}}_{2}=\sum_{j=1}^{N}(\lambda_{x}(j)-\delta_{j}){\boldsymbol{v}}_{j}+\delta{\boldsymbol{z}}_{2}\quad\mbox{for some}\ {\boldsymbol{z}}_{2}\in\mathcal{P}.

Therefore, we have

𝒚1𝒚2\displaystyle\|{\boldsymbol{y}}_{1}-{\boldsymbol{y}}_{2}\| \displaystyle\leq max𝒖1,𝒖2𝒱(S)𝒖1𝒖2\displaystyle\max_{{\boldsymbol{u}}_{1},{\boldsymbol{u}}_{2}\in\mathcal{V}(S)}\|{\boldsymbol{u}}_{1}-{\boldsymbol{u}}_{2}\|
\displaystyle\leq δmax𝒛1,𝒛2𝒫𝒛1𝒛2δD\displaystyle\delta\max_{{\boldsymbol{z}}_{1},{\boldsymbol{z}}_{2}\in\mathcal{P}}\|{\boldsymbol{z}}_{1}-{\boldsymbol{z}}_{2}\|\leq\delta D
=\displaystyle= Di+(𝝀x)δiD|+(𝝀x)|d(n+1)dD,\displaystyle D\sum_{i\in\mathcal{I}_{+}(\boldsymbol{\lambda}_{x})}\delta_{i}\leq D|\mathcal{I}_{+}(\boldsymbol{\lambda}_{x})|d\leq(n+1)dD,

where +(𝝀x):={i|λi(x)>0}\mathcal{I}_{+}(\boldsymbol{\lambda}_{x}):=\left\{i\ |\ \lambda_{i}(x)>0\right\} and we have used |+(𝝀x)|(n+1)|\mathcal{I}_{+}(\boldsymbol{\lambda}_{x})|\leq(n+1). ∎

Proof of Theorem 4.5.

We first claim that 𝒙S𝒫(𝒙k,dk){\boldsymbol{x}}^{*}\in S_{\mathcal{P}}({{\boldsymbol{x}}}_{k},{d}_{k}) for any k0k\geq 0 and prove this by induction. From the proof of Theorem 4.4, this is true for k=0k=0. Now suppose that 𝒙S𝒫(𝒙k1,dk1){\boldsymbol{x}}^{*}\in S_{\mathcal{P}}({{\boldsymbol{x}}}_{k-1},{d}_{k-1}) for some k1k\geq 1. Note that the inner loop of Alg. 6 corresponds to the standard Frank-Wolfe algorithm. By Theorem 2.1 and Lemma B.2, we have

f(𝒑j)f2Lj+1((n+1)2dk1D)2=2L(n+1)2dk12D2j+1f({\boldsymbol{p}}_{j})-f^{*}\leq\frac{2L}{j+1}\left((n+1)^{2}{d}_{k-1}D\right)^{2}=\frac{2L(n+1)^{2}{d}_{k-1}^{2}D^{2}}{j+1}

hold for all j[J]j\in[J]. In the case where the inner loop terminates at j=Jj=J, we obtain

f(𝒙k)f=f(𝒑J)ff(𝒑J)CJμdk2D22η2.f({\boldsymbol{x}}_{k})-f^{*}=f({\boldsymbol{p}}_{J})-f^{*}\leq f({\boldsymbol{p}}_{J})-C_{J}\leq\frac{\mu d_{k}^{2}D^{2}}{2\eta^{2}}.

Similarly, if the inner loop is interrupted due to lines 9-11 of the algorithm, we still have f(𝒙k)ff(𝒑j)Cjμ2ρ2η2dk12D2=μdk2D22η2f({\boldsymbol{x}}_{k})-f^{*}\leq f({\boldsymbol{p}}_{j})-C_{j}\leq\frac{\mu}{2\rho^{2}\eta^{2}}{d}_{k-1}^{2}D^{2}=\frac{\mu d_{k}^{2}D^{2}}{2\eta^{2}}. Using the fact that f(𝒙k)fμ2𝒙k𝒙2f({\boldsymbol{x}}_{k})-f^{*}\geq\frac{\mu}{2}\lVert{\boldsymbol{x}}_{k}-{\boldsymbol{x}}^{*}\rVert^{2}, we have 𝒙k𝒙2dk2D2η2\lVert{\boldsymbol{x}}_{k}-{\boldsymbol{x}}^{*}\rVert^{2}\leq\frac{d_{k}^{2}D^{2}}{\eta^{2}}, which implies via Lemma 4.3 that 𝒙S𝒫(𝒙k,dk){\boldsymbol{x}}^{*}\in S_{\mathcal{P}}({{\boldsymbol{x}}}_{k},{d}_{k}).

We now start to prove the conclusion in Theorem 4.5. Since d0=ηD2(f(𝒙0)B0)μd_{0}=\frac{\eta}{D}\sqrt{\frac{2(f({\boldsymbol{x}}_{0})-B_{0})}{\mu}} and dk=dk1ρd_{k}=\frac{{d}_{k-1}}{\rho}, we have

f(𝒙k)ff(𝒙k)Bkμdk2D22η2(f(𝒙0)B0)ρ2k.f({\boldsymbol{x}}_{k})-f^{*}\leq f({\boldsymbol{x}}_{k})-B_{k}\leq\frac{\mu d_{k}^{2}D^{2}}{2\eta^{2}}\leq(f({\boldsymbol{x}}_{0})-B_{0})\rho^{-2k}.

Appendix C Properties of Some Common Polytopes

C.1 Carath odory Representation Examples

Hypercube: When 𝒫\mathcal{P} is a hypercube Bn:={𝒙nxi[0,1],i[n]}B_{n}:=\{{\boldsymbol{x}}\in\mathbb{R}^{n}\mid x_{i}\in[0,1],\forall i\in[n]\}, any point 𝒙Bn{\boldsymbol{x}}\in B_{n} can be naturally represented as

𝒙=i=1n1(xjixji+1)𝒗i+xjn𝟏n+(1xj1)𝟎n,{\boldsymbol{x}}=\sum_{i=1}^{n-1}(x_{j_{i}}-x_{j_{i+1}}){\boldsymbol{v}}_{i}+x_{j_{n}}{\boldsymbol{1}}_{n}+(1-x_{j_{1}}){\boldsymbol{0}}_{n},

where j1,,jnj_{1},\dots,j_{n} is a permutation over [n][n] such that xj1xjnx_{j_{1}}\geq\dots\geq x_{j_{n}} and 𝒗i{\boldsymbol{v}}_{i} is a vector with components from j1j_{1} to jij_{i} equal to 11 and the rest equal to 0.

1\ell_{1}-ball: When 𝒫\mathcal{P} is a 1\ell_{1}-ball Ln:={𝒙ni=1n|xi|1}L_{n}:=\{{\boldsymbol{x}}\in\mathbb{R}^{n}\mid\sum_{i=1}^{n}|x_{i}|\leq 1\}, any point 𝒙Ln{\boldsymbol{x}}\in L_{n} can be naturally represented as

𝒙=i=1n1|xi|(sgn(xi)𝒆i)+(|xn|+sx)(sgn(xn)𝒆n)+sx(sgn(xn)𝒆n),{\boldsymbol{x}}=\sum_{i=1}^{n-1}|x_{i}|(\text{sgn}(x_{i}){\boldsymbol{e}}_{i})+\left(|x_{n}|+s_{x}\right)(\text{sgn}(x_{n}){\boldsymbol{e}}_{n})+s_{x}(-\text{sgn}(x_{n}){\boldsymbol{e}}_{n}),

where sx=1i=1n|xi|/2s_{x}={1-\sum_{i=1}^{n}|x_{i}|}/{2} and sgn(x)=1\text{sgn}(x)=1 if x0x\geq 0 and 1-1 otherwise.

Flow polytope: Let GG be a directed acyclic graph (DAG) with a set of vertices VV and edges EE such that |E|=n|E|=n. Let s,ts,t be two vertices in VV, referred to as the source and target, respectively. The ss - tt flow polytope, here denoted by s,t\mathcal{F}_{s,t}, is the set of all unit ss - tt flows in GG. For any point 𝒙s,t{\boldsymbol{x}}\in\mathcal{F}_{s,t} and i[n]i\in[n], the entry xix_{i} represents the amount of flow through edge i[n]i\in[n], where the flow vector 𝒙{\boldsymbol{x}} satisfies the flow conservation constraints at each vertex, ensuring that the flow entering any vertex (except ss and tt) equals the flow leaving it. The extreme points of s,t\mathcal{F}_{s,t} are the extreme unit flows. To find the Carath odory representation of a given flow 𝒙s,t{\boldsymbol{x}}\in\mathcal{F}_{s,t}, we can proceed recursively as follows.

Starting with the flow 𝒙{\boldsymbol{x}}, we repeatedly perform the following steps until 𝒙=𝟎n{\boldsymbol{x}}={\boldsymbol{0}}_{n}:

  1. 1.

    Remove all edges with zero flow from the graph.

  2. 2.

    Identify the edge ii corresponding to the smallest non-zero flow in 𝒙{\boldsymbol{x}}, i.e., iargminxi>0xii\leftarrow{\operatorname*{\arg\min}}_{x_{i}>0}x_{i}.

  3. 3.

    Find the extreme unit flow 𝒗{\boldsymbol{v}} in the reduced graph that includes edge ii.

  4. 4.

    Subtract xi𝒗x_{i}{\boldsymbol{v}} from the current flow, i.e., 𝒙𝒙xi𝒗{\boldsymbol{x}}\leftarrow{\boldsymbol{x}}-x_{i}{\boldsymbol{v}}.

Since each operation eliminates at least one non-zero entry in the current flow, the loop will terminate within at most mm steps. As a result, we obtain a Carath odory representation of 𝒙{\boldsymbol{x}}. This algorithm can be implemented in O(n2)O(n^{2}) time when the graph is represented using sparsely structured adjacency matrices.

C.2 Quantities of Some Common Polytopes

Hypercube: The diameter of BnB_{n} is given by

D(Bn)=max𝒙,𝒚Bn𝒙𝒚=𝟏n𝟎n=n.D(B_{n})=\max_{{\boldsymbol{x}},{\boldsymbol{y}}\in B_{n}}\|{\boldsymbol{x}}-{\boldsymbol{y}}\|=\|{\boldsymbol{1}}_{n}-{\boldsymbol{0}}_{n}\|=\sqrt{n}.

Since BnB_{n} can be represented as

Bn={𝒙n(InIn)𝒙(𝟏n𝟎n)},B_{n}=\left\{{\boldsymbol{x}}\in\mathbb{R}^{n}\mid\left(\begin{array}[]{c}I_{n}\\ -I_{n}\end{array}\right){\boldsymbol{x}}\leq\left(\begin{array}[]{c}{\boldsymbol{1}}_{n}\\ {\boldsymbol{0}}_{n}\end{array}\right)\right\},

it follows from the definition in Subsection 2.3 that ξ(Bn)=1\xi(B_{n})=1 and ψ(Bn)=1\psi(B_{n})=1. Thus, the quantity of BnB_{n} is

η(Bn)=ψ(Bn)D(Bn)/ξ(Bn)=n.\eta(B_{n})={\psi(B_{n})D(B_{n})}/{\xi(B_{n})}=\sqrt{n}.

1\ell_{1}-ball: The diameter of LnL_{n} is given by

D(Ln)=max𝒙,𝒚Ln𝒙𝒚=𝒆1(𝒆1)=2.D(L_{n})=\max_{{\boldsymbol{x}},{\boldsymbol{y}}\in L_{n}}\|{\boldsymbol{x}}-{\boldsymbol{y}}\|=\|{\boldsymbol{e}}_{1}-(-{\boldsymbol{e}}_{1})\|=2.

Note that LnL_{n} can be described by the linear inequalities system Ln={𝒙nA2𝒙1n𝟏2n}L_{n}=\{{\boldsymbol{x}}\in\mathbb{R}^{n}\mid A_{2}{\boldsymbol{x}}\leq\frac{1}{\sqrt{n}}{\boldsymbol{1}}_{2^{n}}\}, where A22n×nA_{2}\in\mathbb{R}^{2^{n}\times n} is a matrix whose entries are either ±1n\pm\frac{1}{\sqrt{n}} and whose rows all have unit 2\ell_{2} norm. Following the definition in Subsection 2.3, we have ξ(Ln)=2n\xi(L_{n})=\frac{2}{\sqrt{n}} and

1n=maxM𝔸(Ln)1nMFψ(Ln)=maxM𝔸(Ln)MmaxM𝔸(Ln)MF=n,\frac{1}{\sqrt{n}}=\max_{M\in\mathbb{A}(L_{n})}\frac{1}{\sqrt{n}}\|M\|_{F}\leq\psi(L_{n})=\max_{M\in\mathbb{A}(L_{n})}\|M\|\leq\max_{M\in\mathbb{A}(L_{n})}\|M\|_{F}=\sqrt{n},

where F\|\cdot\|_{F} denotes the Frobenius norm. Thus, the quantity of LnL_{n} can be estimated as

η(Ln)=ψ(Ln)D(Ln)/ξ(Ln)[1,n].\eta(L_{n})={\psi(L_{n})D(L_{n})}/{\xi(L_{n})}\in[1,n].

Flow Polytope: For every two extreme unit flows 𝒙1,𝒙2𝒱(s,t){\boldsymbol{x}}_{1},{\boldsymbol{x}}_{2}\in\mathcal{V}(\mathcal{F}_{s,t}), since 𝒱(s,t){0,1}n\mathcal{V}(\mathcal{F}_{s,t})\subseteq\{0,1\}^{n}, we have 𝒙1𝒙2n\|{\boldsymbol{x}}_{1}-{\boldsymbol{x}}_{2}\|\leq\sqrt{n}. Thus, the diameter of s,t\mathcal{F}_{s,t} can be estimated as

D(s,t)=max𝒙1,𝒙2𝒱(s,t)𝒙1𝒙2n.D(\mathcal{F}_{s,t})=\max_{{\boldsymbol{x}}_{1},{\boldsymbol{x}}_{2}\in\mathcal{V}(\mathcal{F}_{s,t})}\|{\boldsymbol{x}}_{1}-{\boldsymbol{x}}_{2}\|\leq\sqrt{n}.

When representing s,t\mathcal{F}_{s,t} using a system of linear equations and inequalities, the inequality constraints are given by In𝒙𝟎n-I_{n}{\boldsymbol{x}}\leq{\boldsymbol{0}}_{n}. Thus, by definition, we have ξ(s,t)=ψ(s,t)=1\xi(\mathcal{F}_{s,t})=\psi(\mathcal{F}_{s,t})=1, leading to

η(s,t)=D(s,t)n.\eta(\mathcal{F}_{s,t})=D(\mathcal{F}_{s,t})\leq\sqrt{n}.

It is worth noting that in the numerical experiment in Subsection 5.2, we set η=D=66\eta=D=\sqrt{66} while the dimension is n=660n=660. This choice stems from the specific characteristics of the dataset used in [22, 15]. Specifically, we observe that each extreme unit flow contains exactly 3333 entries of 11, with the remaining entries being 0. Consequently, we can estimate η=D66\eta=D\leq\sqrt{66}.

Appendix D Backtracking Details

The routine of estimating local parameters L,μL,\mu and step-size δ\delta is shown as follows. For rSFW, if we use the simple step-size, we can employ only the estimates of LL and μ\mu without applying the corresponding step size.

Algorithm 7 Backtracking-Routine(𝒙,𝒅,L,μ,δmax)\mbox{Backtracking-Routine}({\boldsymbol{x}},{\boldsymbol{d}},L,\mu,\delta_{\text{max}})
0:  Iterate 𝒙𝒫{\boldsymbol{x}}\in\mathcal{P}, update direction 𝒅{\boldsymbol{d}}, previous estimation LL and μ\mu.
1:  Choose τ1>1,τ21\tau_{1}>1,\tau_{2}\leq 1.
2:  Lτ2L,μμ/τ2L\leftarrow\tau_{2}L,\ \mu\leftarrow\mu/\tau_{2}
3:  δmin{f(𝒙),𝒅L𝒅2,δmax}\delta\leftarrow\min\left\{\frac{\langle\nabla f({\boldsymbol{x}}),{\boldsymbol{d}}\rangle}{L\lVert{\boldsymbol{d}}\rVert^{2}},\delta_{\text{max}}\right\}
4:  while f(𝒙+δ𝒅)>f(𝒙)+δf(𝒙),𝒅+δ2L2𝒅2f({\boldsymbol{x}}+\delta{\boldsymbol{d}})>f({\boldsymbol{x}})+\delta\langle\nabla f({\boldsymbol{x}}),{\boldsymbol{d}}\rangle+\frac{\delta^{2}L}{2}\lVert{\boldsymbol{d}}\rVert^{2} do
5:   Lτ1LL\leftarrow\tau_{1}L
6:   μmin{2(f(𝒙+δ𝒅)f(𝒙)δf(𝒙),𝒅)δ2𝒅2,μ}\mu\leftarrow\min\left\{\frac{2(f({\boldsymbol{x}}+\delta{\boldsymbol{d}})-f({\boldsymbol{x}})-\delta\langle\nabla f({\boldsymbol{x}}),{\boldsymbol{d}}\rangle)}{\delta^{2}\lVert{\boldsymbol{d}}\rVert^{2}},\mu\right\}
7:   δmin{f(𝒙),𝒅L𝒅2,δmax}\delta\leftarrow\min\left\{\frac{\langle\nabla f({\boldsymbol{x}}),{\boldsymbol{d}}\rangle}{L\lVert{\boldsymbol{d}}\rVert^{2}},\delta_{\text{max}}\right\}
8:  end while
8:  δ,L,μ\delta,L,\mu.

References

  • [1] Beck, A., Teboulle, M.: A conditional gradient method with linear rate of convergence for solving convex linear systems. Mathematical Methods of Operations Research 59, 235–247 (2004)
  • [2] Bernd, G., Jaggi, M.: Optimization for machine learning. Lecture Notes CS-439, ETH, Spring 2023 (2023)
  • [3] Bomze, I.M., Rinaldi, F., Zeffiro, D.: Frank–wolfe and friends: a journey into projection-free first-order optimization methods. 4OR 19(3), 313–345 (2021)
  • [4] Braun, G., Carderera, A., Combettes, C.W., Hassani, H., Karbasi, A., Mokhtari, A., Pokutta, S.: Conditional gradient methods. arXiv preprint arXiv:2211.14103 (2022)
  • [5] Chandrasekaran, V., Recht, B., Parrilo, P.A., Willsky, A.S.: The convex geometry of linear inverse problems. Foundations of Computational mathematics 12(6), 805–849 (2012)
  • [6] Clarkson, K.L.: Coresets, sparse greedy approximation, and the frank-wolfe algorithm. ACM Transactions on Algorithms (TALG) 6(4), 1–30 (2010)
  • [7] Combettes, C.W., Pokutta, S.: Complexity of linear minimization and projection on some sets. Operations Research Letters 49(4), 565–571 (2021)
  • [8] Condat, L.: Fast projection onto the simplex and the l 1 ball. Mathematical Programming 158(1), 575–585 (2016)
  • [9] Damla Ahipasaoglu, S., Sun, P., Todd, M.J.: Linear convergence of a modified frank–wolfe algorithm for computing minimum-volume enclosing ellipsoids. Optimisation Methods and Software 23(1), 5–19 (2008)
  • [10] Frank, M., Wolfe, P., et al.: An algorithm for quadratic programming. Naval research logistics quarterly 3(1-2), 95–110 (1956)
  • [11] Freund, R.M., Grigas, P.: New analysis and results for the frank–wolfe method. Mathematical Programming 155(1), 199–230 (2016)
  • [12] Garber, D.: Faster projection-free convex optimization over the spectrahedron. Advances in Neural Information Processing Systems 29 (2016)
  • [13] Garber, D., Hazan, E.: Playing non-linear games with linear oracles. In: 2013 IEEE 54th annual symposium on foundations of computer science, pp. 420–428. IEEE (2013)
  • [14] Garber, D., Hazan, E.: A linearly convergent variant of the conditional gradient algorithm under strong convexity, with applications to online and stochastic optimization. SIAM Journal on Optimization 26(3), 1493–1528 (2016)
  • [15] Garber, D., Wolf, N.: Frank-wolfe with a nearest extreme point oracle. In: Conference on Learning Theory, pp. 2103–2132. PMLR (2021)
  • [16] Guélat, J., Marcotte, P.: Some comments on wolfe’s ‘away step’. Mathematical Programming 35(1), 110–119 (1986)
  • [17] Guyon, I., Gunn, S., Nikravesh, M., Zadeh, L.A.: Feature extraction: foundations and applications, vol. 207. Springer (2008)
  • [18] Hazan, E.: Sparse approximate solutions to semidefinite programs. In: Latin American symposium on theoretical informatics, pp. 306–316. Springer (2008)
  • [19] Jaggi, M.: Revisiting frank-wolfe: Projection-free sparse convex optimization. In: International conference on machine learning, pp. 427–435. PMLR (2013)
  • [20] Jaggi, M., Sulovsk, M., et al.: A simple algorithm for nuclear norm regularized problems. In: Proceedings of the 27th international conference on machine learning (ICML-10), pp. 471–478 (2010)
  • [21] Joulin, A., Tang, K., Fei-Fei, L.: Efficient image and video co-localization with frank-wolfe algorithm. In: D. Fleet, T. Pajdla, B. Schiele, T. Tuytelaars (eds.) Computer Vision – ECCV 2014, pp. 253–268. Springer International Publishing, Cham (2014)
  • [22] Lacoste-Julien, S., Jaggi, M.: On the global linear convergence of frank-wolfe optimization variants. Advances in neural information processing systems 28 (2015)
  • [23] Lan, G.: The complexity of large-scale convex programming under a linear optimization oracle. arXiv preprint arXiv:1309.5550 (2013)
  • [24] Lan, G.: First-order and stochastic optimization methods for machine learning, vol. 1. Springer (2020)
  • [25] Levitin, E.S., Polyak, B.T.: Constrained minimization methods. USSR Computational mathematics and mathematical physics 6(5), 1–50 (1966)
  • [26] Pedregosa, F., Negiar, G., Askari, A., Jaggi, M.: Linearly convergent frank-wolfe with backtracking line-search. In: International conference on artificial intelligence and statistics, pp. 1–10. PMLR (2020)
  • [27] Pokutta, S.: The frank-wolfe algorithm: a short introduction. Jahresbericht der Deutschen Mathematiker-Vereinigung 126(1), 3–35 (2024)
  • [28] Rockafellar, R.T.: Convex analysis, vol. 11. Princeton university press (1997)
  • [29] Végh, L.A.: Strongly polynomial algorithm for a class of minimum-cost flow problems with separable convex objectives. In: Proceedings of the forty-fourth annual ACM symposium on Theory of computing, pp. 27–40 (2012)