Simplex Frank-Wolfe: Linear Convergence and Its Numerical Efficiency for Convex Optimization over Polytopes

Haoning Wang Department of Mathematical Sciences, Tsinghua University, Beijing 100084, China Houduo Qi Corresponding author: houduo.qi@polyu.edu.hk Department of Data Science and Artificial Intelligence, and Department of Applied Mathematics, The Hong Kong Polytechnic University, Hong Kong Liping Zhang Department of Mathematical Sciences, Tsinghua University, Beijing 100084, China

Abstract

We investigate variants of the Frank-Wolfe (FW) algorithm for smoothing and strongly convex optimization over polyhedral sets, with the goal of designing algorithms that achieve linear convergence while minimizing per-iteration complexity as much as possible. Starting from the simple yet fundamental unit simplex, and based on geometrically intuitive motivations, we introduce a novel oracle called Simplex Linear Minimization Oracle (SLMO), which can be implemented with the same complexity as the standard FW oracle. We then present two FW variants based on SLMO: Simplex Frank-Wolfe and the refined Simplex Frank-Wolfe (rSFW). Both variants achieve a linear convergence rate for all three common step-size rules. Finally, we generalize the entire framework from the unit simplex to arbitrary polytopes. Furthermore, the refinement step in rSFW can accommodate any existing FW strategies such as the well-known away-step and pairwise-step, leading to outstanding numerical performance. We emphasize that the oracle used in our rSFW method requires only one more vector addition compared to the standard LMO, resulting in the lowest per-iteration computational overhead among all known Frank-Wolfe variants with linear convergence.

keywords: Frank-Wolfe algorithm, conditional gradient methods, linear convergence, convex optimization, first-order methods, linear programming

1 Introduction

Over the past decades, Frank-Wolfe (FW) algorithms [10] (a.k.a. conditional gradients [25]) have been extensively investigated due to its lower per-iteration complexity compared to projected or proximal gradient-based methods, in particular for large-scale machine learning applications and sparse optimization. This topic has been comprehensively covered in several recent publications including [3, 4, 27] and [24, Chapter 7],[2, Chapter 10], to just name a few. The key step in FW algorithms is Linear Minimization Oracle (LMO). We refer to [23] for (worst-case) complexity analysis for general LMOs. One of the most often cited examples is LMO over the unit Simplex $S_{n}:=\left\{{\boldsymbol{x}}\in\mathbb{R}^{n}|\ \sum x_{i}=1,{\boldsymbol{x}}\geq 0\right\}$ . Projection onto $S_{n}$ is much expensive than LMO over $S_{n}$ . Research effort has been on developing LMOs that may lead to linear convergence while keeping the computation of each LMO as low as possible. Therefore, the total computational complexity of a typical FW-type algorithm can be calculated as follows.

\mbox{Total Computation}=(\#\mbox{Iterations})\times\mbox{(Computation of LMOs per iteration)}.

Note that some existing algorithms may require more than one LMO each iteration. The purpose of this paper is to propose a new LMO, whose computational complexity is probably the cheapest among all existing algorithms. Furthermore, it also guarantees a linear convergence rate comparable to the known ones for the convex optimization over a polytope:

\min\ f({\boldsymbol{x}})\qquad\mbox{s.t.}\quad{\boldsymbol{x}}\in\mathcal{P}=\mbox{Conv}(\mathcal{V}),

(1.1)

where $\mathcal{V}\subseteq\mathbb{R}^{n}$ is a finite set of vectors that we call atoms. For the moment, we only assume $f:\mathcal{C}\mapsto\mathbb{R}$ is convex and differentiable for the convenience of discussion below. Here, $\mathcal{C}$ is an open set containing $\mathcal{P}$ . Later, we will enforce strong convexity as well as other properties for our analysis.

1.1 Related Work

There are a large number of publications that directly or remotely motivated this work. We are only able to list a few of them below with some critical analysis. Given an LMO, the original FW algorithm [10] states as:

\left\{\begin{array}[]{l}{\boldsymbol{y}}_{k}=\mbox{LMO}(\nabla f({\boldsymbol{x}}_{k-1}),\mathcal{P})\in\operatorname*{\arg\min}\left\{\langle\nabla f({\boldsymbol{x}}_{k-1}),{\boldsymbol{y}}\rangle\ |\ {\boldsymbol{y}}\in\mathcal{P}\right\},\\[0.86108pt] {\boldsymbol{x}}_{k}=(1-\delta_{k}){\boldsymbol{x}}_{k-1}+\delta_{k}{\boldsymbol{y}}_{k},\ \ \delta_{k}\in(0,1],\end{array}\right.

where $\delta_{k}$ is a steplength often satisfying certain conditions [6, 18, 19]. One of the key advantages of the FW method over the well-known projected gradient method is its lower cost per iteration in many common scenarios, such as the simplex [6], flow polytope [7, 21], spectrahedron [18, 12], and nuclear norm ball [20]. This efficiency makes the FW method particularly advantageous for large-scale problems. Numerous studies [19, 23, 11] have demonstrated that the convergence rate of the FW method is $\mathcal{O}(\frac{1}{k})$ and that this rate is generally not improvable, except for some special cases, e.g., when the optimal solution lies in the interior of the constraint set [16]. In fact, there exist examples for which the convergence rate of the FW method does not improve even when the objective function is strongly convex, see [19, 23].

Therefore, modifications on the original FW method must be made in order to achieve linear convergence rate. Significant advances have been made alone this line of research and there exist a large number of variants of FW methods that enjoy linear convergence rate, see [4, Chapter 3]. The well-known ones include FW-method with away-step (AFW) and the pairwise FW (PFW) [22, 9, 13]. Most of those modified methods can be cast in the following framework:

\left\{\begin{array}[]{l}{\boldsymbol{y}}_{k}=\mbox{LMO}(\nabla f({\boldsymbol{x}}_{k-1}),\mathcal{P}_{k})\in\operatorname*{\arg\min}\left\{\langle\nabla f({\boldsymbol{x}}_{k-1}),{\boldsymbol{y}}\rangle\ |\ {\boldsymbol{y}}\in\mathcal{P}_{k}\right\},\\[0.86108pt] {\boldsymbol{g}}_{k}=\mbox{direction-correction satisfying certain conditions},\\[0.86108pt] {\boldsymbol{x}}_{k}=(1-\delta_{k}){\boldsymbol{x}}_{k-1}+\delta_{k}({\boldsymbol{y}}_{k}+{\boldsymbol{g}}_{k}),\ \ \delta_{k}\in(0,1],\end{array}\right.

(1.2)

where $\mathcal{P}_{k}\subseteq\mathcal{P}$ is a well-constructed convex subset of $\mathcal{P}$ at the current iterate ${\boldsymbol{x}}_{k-1}$ . This framework has a flexibility for more technical strategies to be added. For example, one may mix ${\boldsymbol{y}}_{k}$ and ${\boldsymbol{g}}_{k}$ through certain combinations with some linesearch strategies. Both AFW and PFW make use of such flexibility. One major concern is that the computation of LMO over $\mathcal{P}_{k}$ may be significantly higher than that over $\mathcal{P}$ . This is the case when $\mathcal{P}$ is Simplex-like polytopes including $S_{n}$ .

In a significant development aiming to address this issue, Garber and Hazan [14] proposed the methodology of LLOO (Local Linear Optimization Oracle), where $\mathcal{P}_{k}$ is the intersection of $S_{n}$ and a $\ell_{1}$ -ball:

\mathcal{P}_{k}=S_{n}\cap B_{1}({\boldsymbol{x}}_{k-1},d_{k}),\quad\mbox{with}\quad B_{1}({\boldsymbol{x}},d)=\left\{{\boldsymbol{y}}\ |\ \|{\boldsymbol{x}}-{\boldsymbol{y}}\|_{1}\leq d\right\}.

A key result is that LMO over this $\mathcal{P}_{k}$ is LLOO, which is referred to as $\ell_{1}$ -LMO. Hence, linear convergence follows when the radius is exponentially reduced at each iteration under the strongly convex setting. We note that the framework of LLOO can be cast as a special case of Shrinking Conditional Gradient Methods (sCndG) by Lan [23, Eq. 3.34] and [24, Alg. 7.2], where an arbitrary norm is used. The LLOO framework does not require the step of ${\boldsymbol{g}}_{k}$ in (1.2). To understand its actual performance, Fig. 2(a) in Sect. 5 illustrates its computational time in comparison to the LMO over the unit simplex as well a projection algorithm.

It can be clearly observed that the time taken by $\ell_{1}$ -LMO is roughly same as the projection method, but is significantly slower (e.g., $100\times$ slower when $n$ gets big) than $\mbox{LMO}({\boldsymbol{c}},S_{n})$ . There is a deep reason behind this performance and it can be best appreciated from the perspective of geometric intuition by considering the situation of $n=3$ . Fig. 1(a) illustrates the intersection of $\ell_{1}$ -ball with the unit simplex. Note that for any point ${\boldsymbol{x}}\in S_{3}$ , the intersection of $\ell_{1}$ -ball with the hyperplane containing $S_{3}$ forms a regular hexagon. As the center ${\boldsymbol{x}}$ and radius $d$ vary, the shape corresponding to the intersection of this regular hexagon and unit simplex becomes more complex, as shown by the blue region in Fig. 1(a). This increased complexity of the constraint set makes solving $\ell_{1}$ -LMO more challenging. From a computational point of view, $\ell_{1}$ -LMO requires a sorting procedure [14] to handle the complexity and hence takes up too much time.

We also observe that LLOO/sCndG framework was largely omitted from the recent surveys [3, 4, 27] probably due to the following two reasons. One is on the concern of computational cost per iteration discussed above. The other is that there lacks flexibility of incorporating existing accelerating strategies such as Away-steps. In this paper, we propose a new framework of constructing the subset $\mathcal{P}_{k}$ that is not based on any norms. In the meantime, the computational cost per iteration is reduced probably to minimum and there is flexibility to include various acceleration strategies. We explain our framework below.

1.2 Simplex LMO and Simplex FW: a New Proposal

Ideally, we would like our subset $\mathcal{P}_{k}$ to be like the simplex $S_{n}$ so that linear optimization over it can be fast executed. We here introduce the simplex ball $S({\boldsymbol{x}},d)$ with centroid ${\boldsymbol{x}}$ and radius $d>0$ (detailed definition later). It coincides with the atom norm of the unit simplex as introduced by [5]. Moreover, the unit simplex $S_{n}$ is a simplex ball. A very useful property is that the intersection of two simplex balls is again a simplex ball:

S({\boldsymbol{x}}_{1},d_{1})\cap S({\boldsymbol{x}}_{2},d_{2})=S({\boldsymbol{x}}_{3},d_{3}),

where ${\boldsymbol{x}}_{3}$ and $d_{3}$ can be cheaply computed from $({\boldsymbol{x}}_{i},d_{i})$ , $i=1,2$ . This property is illustrated in Fig. 1(b). Consequently, given ${\boldsymbol{x}}\in S_{n}$ , a radius $d>0$ and ${\boldsymbol{c}}\in\mathbb{R}^{n}$ , we define the Simplex Linear Minimization Oracle $\mbox{SLMO}({\boldsymbol{x}},d,{\boldsymbol{c}})$ by

\mbox{SLMO}({\boldsymbol{x}},d,{\boldsymbol{c}})=\operatorname*{\arg\min}_{\boldsymbol{y}}\Big\{\langle{\boldsymbol{c}},{\boldsymbol{y}}\rangle\ |\ {\boldsymbol{y}}\in S_{n}\cap S({\boldsymbol{x}},d)\Big\}.

Since the constraint is again a simplex ball, SLMO has a closed-form formula (see Alg. 1) and its complexity is roughly same as $\mbox{LMO}({\boldsymbol{c}},S_{n})$ . Furthermore, we prove that SLMO is LLOO in Lemma 3.2. The consequence is that linearly convergent algorithm can be developed by following the template in [14]. This part forms the first contribution of the paper.

Casting SLMO as an instance of LLOO does not benefit too much in terms of computational efficiency because, as a common practice in FW methods, direction-correction step in (1.2) is essential in improving numerical performance. To accommodate this need, we make two additions. The first one is on a flexible rule to update the radius of the simplex ball. Any lower-bound for the objective function $f$ is permitted and the lower-bound by SLMO is a choice. We opt for the use of the best lower-bound available at the current iterate. This leads to the Simplex Frank-Wolfe (SFW) method in Alg. 2, which is proved to be linearly convergent in Thm. 3.4.

This second addition makes use of an important observation that SLMO can be split into two parts. The first part is the construction of the simplex ball and the second part is linear optimization over the simplex ball. Linear optimization is much cheaper than construction of the simplex ball. It would be much economical if we perform linear optimization a few more times for each simplex ball:

{\boldsymbol{p}}_{k}\approx\operatorname*{\arg\min}_{{\boldsymbol{y}}}\Big\{f({\boldsymbol{y}})\ |\ {\boldsymbol{y}}\in S_{n}\cap S({\boldsymbol{x}}_{k-1},d_{k-1})\Big\}.

This ${\boldsymbol{p}}_{k}$ functions like the direction-correction used in the general framework (1.2). This allows us to take advantage of existing FW algorithms. For example, AFW and PFW can be used for this part. This leads to our refined SFW method (rSFW) (see Alg. 3 and Alg. 6). We emphasize that the oracle used in our rSFW method requires only one additional vector addition compared to the standard LMO, whose computational complexity is probably the cheapest among all existing FW-type algorithms. Our numerical experiments show that rSFW with Away step and Pairwise steps improves its performance significantly. The resulting algorithmic scheme is hence different from LLOO scheme and we provide complete convergence analysis. This part may be treated as our second contribution.

Our third contribution is on extending the simplex case to general polytope case. We will make use of some fundamental connections between them established in [14]. Since the simplex ball is not defined from any norm, some part of the extension is highly non-trivial. In particular, the iteration complexity of the extended SFW depends only on the problem dimension $n$ instead of the number of extreme points $N$ of $\mathcal{P}$ , see Thm. 4.4. Computationally, this can all be achieved for three popular polytopes: Hypercube, Flow polytope and $\ell_{1}$ -ball.

Our final part is to address the implementation issues including adaptive backtracking techniques on choosing the problem prameters $L$ and $\mu$ , and incorporating Away-FW and Pairwise-FW steps to SFW methods. Numerical experiments demonstrate that our SFW methods are highly competitive.

1.3 Organization

In the preceding discussion, we only focus on the framework (1.2) that may lead to linearly convergent FW methods. We avoid specifying the actual conditions on $f$ because various conditions can ensure such linear rate. In Section 2, we describe such conditions as well as some background on polytopes. We will explain the key concept of LLOO proposed in [14]. Section 3 contains the detailed development of SLMO and the resulting Simplex FW methods (SFW and rSFW) for the case $\mathcal{P}=S_{n}$ . The extension to the general polytope case is conducted in Section 4. Lengthy proofs are moved to the Supplement for the benefit of readability of the paper. Section 5 reports some illustrative examples to demonstrate the advantage of SFW methods over some existing algorithms. We conclude the paper in Section 6.

2 Notation and Background

2.1 Notation

We employ lower-case letters, bold lower-case letters, and capital letters to denote scalars, vectors, and matrices, respectively (e.g., $x,{\boldsymbol{x}}$ and $X$ ). For two column vectors ${\boldsymbol{x}},{\boldsymbol{y}}\in\mathbb{R}^{n}$ , $\max\{{\boldsymbol{x}},{\boldsymbol{y}}\}$ is a new column vector that takes the component-wise maximum of ${\boldsymbol{x}}$ and ${\boldsymbol{y}}$ . The vector $\min\{{\boldsymbol{x}},{\boldsymbol{y}}\}$ is similarly defined. For vectors, we denote the standard Euclidean norm by $\|\cdot\|$ and the standard inner-product by $\langle\cdot,\cdot\rangle$ . For a vector ${\boldsymbol{x}}\in\mathbb{R}^{n}$ , a subset $C\subseteq\mathbb{R}^{n}$ , and $\tau>0$ , we define

{\boldsymbol{x}}+C:=\left\{{\boldsymbol{x}}+{\boldsymbol{y}}\ |\ {\boldsymbol{y}}\in C\right\}\qquad\mbox{and}\qquad\tau C:=\left\{\tau{\boldsymbol{y}}\ |\ {\boldsymbol{y}}\in C\right\},

where “ $:=$ ” means “define”.

We let $\mathbb{B}({\boldsymbol{x}},r)$ to denote the Euclidean ball of radius $r$ centered at ${\boldsymbol{x}}$ . For matrices, we let $\|\cdot\|$ denote the spectral norm. For a vector ${\boldsymbol{x}}\in\mathbb{R}^{n}$ , we use $x_{i}$ or $x(i)$ to denote the $i$ -th component. For a matrix $A$ , we use $A(i)$ to denote the $i$ -th row of $A$ . The vector ${\boldsymbol{1}}_{n}$ represents a vector with all entries equal to 1, and ${\boldsymbol{e}}_{i}$ is the standard $i$ th unit vector in $\mathbb{R}^{n}$ which takes value $1$ at its $i$ th position and $0$ elsewhere. Given a set $\mathcal{V}$ , we denotes its convex hull as $\mbox{Conv}\{\mathcal{V}\}$ . For any positive integer $n$ , we use the notation $[n]$ to represent the set $\{1,\dots,n\}$ . We use $S_{n}:=\{{\boldsymbol{x}}\in\mathbb{R}^{n}|\sum_{i=1}^{n}x_{i}=1,{\boldsymbol{x}}\geq 0\}$ to denote the unit simplex.

2.2 Smoothness, Strong Convexity and Stepsizes

Throughout the paper, we will assume $L$ -smoothness and $\mu$ -strong convexity of $f$ .

Definition 1 (Smooth function).

We say that a function $f({\boldsymbol{x}}):\mathbb{R}^{n}\to\mathbb{R}$ is $L$ -smooth over a convex set $\mathcal{P}\subset\mathbb{R}^{n}$ , if for every ${\boldsymbol{x}},{\boldsymbol{y}}\in\mathcal{P}$ there holds

f({\boldsymbol{y}})\leq f({\boldsymbol{x}})+\langle{\boldsymbol{y}}-{\boldsymbol{x}},\nabla f({\boldsymbol{x}})\rangle+\frac{L}{2}\|{\boldsymbol{x}}-{\boldsymbol{y}}\|^{2}.

Definition 2 (Strongly convex function).

We say that a function $f({\boldsymbol{x}}):\mathbb{R}^{n}\to\mathbb{R}$ is $\mu$ -strongly convex over a convex set $\mathcal{P}\subset\mathbb{R}^{n}$ , if for every ${\boldsymbol{x}},{\boldsymbol{y}}\in\mathcal{P}$ there holds

f({\boldsymbol{y}})\geq f({\boldsymbol{x}})+\langle{\boldsymbol{y}}-{\boldsymbol{x}},\nabla f({\boldsymbol{x}})\rangle+\frac{\mu}{2}\|{\boldsymbol{x}}-{\boldsymbol{y}}\|^{2}.

The above definition combined with first order optimality conditions imply that for a $\mu$ -strongly convex function $f$ , if ${\boldsymbol{x}}^{*}=\operatorname*{\arg\min}_{{\boldsymbol{x}}\in\mathcal{P}}f({\boldsymbol{x}})$ , then for any ${\boldsymbol{x}}\in\mathcal{P}$ we have

f({\boldsymbol{x}})-f^{*}\geq\frac{\mu}{2}\|{\boldsymbol{x}}-{\boldsymbol{x}}^{*}\|^{2}.

(2.1)

This property, while weaker than strong convexity, is essential for demonstrating linear convergence rather than relying solely on strong convexity. A natural generalization of this property is known as the quadratic growth property.

There are three popular step size strategies:

1.

Simple step size:

$\delta_{k}=2/(k+1),\quad k=1,\dots.$ (2.2)

Line-search step size:

\delta_{k}=\operatorname*{arg\,min}_{\delta\in[0,1]}f((1-\delta){\boldsymbol{x}}_{k-1}+\delta{\boldsymbol{y}}_{k}),\quad k=1,\dots.

(2.3)

Short step size:

\delta_{k}=\min\left\{1,\frac{\langle\nabla f({\boldsymbol{x}}_{k-1}),{\boldsymbol{x}}_{k-1}-{\boldsymbol{y}}_{k}\rangle}{L\|{\boldsymbol{x}}_{k-1}-{\boldsymbol{y}}_{k}\|^{2}}\right\},\quad k=1,\dots.

(2.4)

For the three step size strategies described above, it can be shown that the standard Frank-Wolfe method exhibits the following convergence rates. For a detailed proof, refer to the modern surveys by [19], [23] or [11].

Theorem 2.1.

Let $\{{\boldsymbol{x}}_{k}\}$ be the sequences generated by standard FW method with step size policy for $\{\delta_{k}\}$ in (2.2), (2.3), or (2.4). Then, for $k\geq 1$ , we have

f({\boldsymbol{x}}_{k})-f^{*}\leq\langle\nabla f({\boldsymbol{x}}_{k}),{\boldsymbol{x}}_{k}-{\boldsymbol{y}}_{k+1}\rangle\leq 2LD^{2}/(k+1),

(2.5)

where ${\boldsymbol{y}}_{k+1}=\text{LMO}(\nabla f({\boldsymbol{x}}_{k}),\mathcal{P})$ and $D$ is the diameter of $\mathcal{P}$ .

2.3 Quantities of Polytope

The quantities reviewed in this part are well defined and investigated in [14] and they are mainly used in the extension of SFW to polytopes.

Let $\mathcal{P}$ be a polytope described by linear equations and inequalities, specifically $\mathcal{P}=\{{\boldsymbol{x}}\in\mathbb{R}^{n}|A_{1}{\boldsymbol{x}}={\boldsymbol{b}}_{1},A_{2}{\boldsymbol{x}}\leq{\boldsymbol{b}}_{2}\}$ , where $A_{2}\in\mathbb{R}^{m\times n}$ . Without loss of generality, we assume that all rows of $A_{2}$ have been scaled to possess a unit $l_{2}$ norm. We denote the set of vertices of $\mathcal{P}$ as $\mathcal{V}(\mathcal{P})$ and let $N=|\mathcal{V}(\mathcal{P})|$ represent the number of vertices.

Next, we introduce several geometric parameters related to $\mathcal{P}$ that will naturally arise in our analysis. The Euclidean diameter of $\mathcal{P}$ is defined as $D(\mathcal{P})=\max_{{\boldsymbol{x}},{\boldsymbol{y}}\in\mathcal{P}}\lVert{\boldsymbol{x}}-{\boldsymbol{y}}\rVert$ . We define

\xi(\mathcal{P})=\min_{{\boldsymbol{v}}\in\mathcal{V}(\mathcal{P})}\left(\min\left\{{b}_{2}(j)-A_{2}(j){\boldsymbol{v}}\mid j\in[m],A_{2}(j){\boldsymbol{v}}<b_{2}(j)\right\}\right).

This means that for any inequality constraint defining the polytope and for a given vertex, that vertex either satisfies the constraint with equality or is at least $\xi(\mathcal{P})$ units away from satisfying it with equality. Let $r(A_{2})$ denote the row rank of $A_{2}$ , and let $\mathbb{A}(\mathcal{P})$ represent the set of all $r(A_{2})\times n$ matrices with linearly independent rows selected from the rows of $A_{2}$ . We then define $\psi(\mathcal{P})=\max_{M\in\mathbb{A}(\mathcal{P})}\|M\|$ . Finally, we introduce condition number of $\mathcal{P}$ as

\eta(\mathcal{P})=\psi(\mathcal{P})D(\mathcal{P})/\xi(\mathcal{P}).

(2.6)

It is important to note that the translation, rotation and scaling of the polytope $\mathcal{P}$ are invariant to $\eta(\mathcal{P})$ . For convenience we use $\mathcal{V},D,\xi,\psi,\eta$ without explicitly mentioning the polytope when $\mathcal{P}$ is clear from context. It is worth noting that in many relevant scenarios—particularly in cases where efficient algorithms exist for linear optimization over the given polytope—estimating the parameters $D,\xi,\psi$ is often straightforward. This is particularly true in convex domains encountered in combinatorial optimization, such as flow polytopes, matching polytopes, and matroid polytopes, among others. Furthermore, our algorithm relies primarily on the parameter $\eta$ and $D$ .

2.4 LLOO

A major concept proposed by Garber and Hazan [14, Def. 2.5] is LLOO. Consider the problem (1.1). We say a procedure $\mathcal{A}({\boldsymbol{x}},d,{\boldsymbol{c}})$ , where ${\boldsymbol{x}}\in\mathcal{P}$ , $d>0$ , ${\boldsymbol{c}}\in\mathbb{R}^{n}$ , is an LLOO with parameter $\rho\geq 1$ for polytope $\mathcal{P}$ if $\mathcal{A}({\boldsymbol{x}},d,{\boldsymbol{c}})$ returns a feasible point ${\boldsymbol{p}}\in\mathcal{P}$ such that

(i)

$\langle{\boldsymbol{y}},{\boldsymbol{c}}\rangle\geq\langle{\boldsymbol{p}},{\boldsymbol{c}}\rangle$ for all ${\boldsymbol{y}}\in\mathbb{B}({\boldsymbol{x}},d)\cap\mathcal{P}$ , and
(ii)

$\|{\boldsymbol{x}}-{\boldsymbol{p}}\|\leq\rho d$ .

Suppose the optimal solution ${\boldsymbol{x}}^{*}$ of (1.1) is contained in $\mathbb{B}({\boldsymbol{x}},d)$ and LLOO $\mathcal{A}({\boldsymbol{x}},d,\nabla f({\boldsymbol{x}}))$ return a feasible point ${\boldsymbol{p}}\in\mathcal{P}$ . The convexity of $f$ implies the following.

	$\displaystyle f({\boldsymbol{x}}^{*})$	$\displaystyle\geq f({\boldsymbol{x}})+\langle\nabla f({\boldsymbol{x}}),{\boldsymbol{x}}^{*}-{\boldsymbol{x}}\rangle$
		$\displaystyle\geq f({\boldsymbol{x}})+\langle\nabla f({\boldsymbol{x}}),{\boldsymbol{p}}-{\boldsymbol{x}}\rangle\quad\mbox{(by ${\boldsymbol{x}}^{*}\in\mathbb{B}({\boldsymbol{x}},d)$ and Property LLOO(i))}$

That is, LLOO naturally provides a lower bound for the optimal objective. Such lower bounds will be used in our updating scheme of the radius $d$ . We also note that LLOO $\mathcal{A}({\boldsymbol{x}},d,\nabla f({\boldsymbol{x}}))$ often return an optimal solution over a subset $\mathcal{P}_{k}$ , which should be constructed in (1.2) bearing in mind of its solution efficiency.

Given an LLOO procedure available, a general FW framework can be developed for it to enjoy a linear convergence rate over general polytope $\mathcal{P}$ provided $f$ being $L$ -smooth and $\mu$ -strongly convex, see [14, Thm. 4.1]. It is proved that $\ell_{1}$ -LMO is an LLOO over the simplex polytope $S_{n}$ . The framework is then extended to general polytope. As we discussed in Introduction, $\ell_{1}$ -LMO is much less efficient than the original LMO over the simplex polytope $S_{n}$ . This is the one of the motivations for us to develop the simplex LMO below.

3 Simplex FW Method

This section is solely devoted to the case of simplex polytope: Problem (1.1) with $\mathcal{P}=S_{n}$ . We will then extend the obtained results to general polytopes in the next section. We start with the introduction of simplex ball.

3.1 Simplex Ball and Simplex-based Linear Minimization Oracle

In this subsection, we will formally define the concept of the simplex ball and present some of its useful properties. Following this, we will introduce the Simplex-based Linear Minimization Oracle (SLMO) and provide an efficient algorithm for solving it.

Definition 3 (Simplex ball).

Let $S_{0}:=S_{n}-\frac{1}{n}{\boldsymbol{1}}_{n}.$ For any ${\boldsymbol{x}}\in\mathbb{R}^{n}$ and $d>0$ , we define $S({\boldsymbol{x}},d)$ as the simplex ball of radius $d$ centered at ${\boldsymbol{x}}$ by

S({\boldsymbol{x}},d):={\boldsymbol{x}}+(nd)S_{0}=\Big\{({\boldsymbol{x}}-d{\boldsymbol{1}}_{n})+nd\boldsymbol{\lambda}\mid\boldsymbol{\lambda}\in S_{n}\Big\}.

(3.1)

The following properties of the simplex ball are crucial to our development. The proof is moved to the Supplement A.

Lemma 3.1.

Given ${\boldsymbol{x}}\in S_{n}$ and $d>0$ , we have

(1)

The unit simplex is a simplex ball, i.e., $S_{n}=S(\frac{1}{n}{\boldsymbol{1}}_{n},\frac{1}{n})$ . Moreover, we have

S({\boldsymbol{x}},d)=\Big\{{\boldsymbol{x}}+d{\boldsymbol{r}}|{\boldsymbol{r}}\in\rm{Conv}\{n{\boldsymbol{e}}_{i}-{\boldsymbol{1}}_{n}:i\in[n]\}\Big\}.

(3.2)

(2)

The intersection of two simplex balls, if nonempty, is again a simplex ball. In particular,

S_{n}\cap S({\boldsymbol{x}},d)=S(\widehat{{\boldsymbol{x}}},\widehat{d})\ \ \mbox{where}\ \ \left\{\begin{array}[]{l}\widehat{d}=\frac{\sum_{i=1}^{n}\min\{d,\;x_{i}\}}{n}\\ \widehat{x}_{i}=\max\{x_{i},d\}+\widehat{d}-d,\quad i\in[n].\end{array}\right.

(3.3)

Moreover, for ${\boldsymbol{x}}_{1},{\boldsymbol{x}}_{2}\in S_{n}$ and radius $d_{1},d_{2}>0$ such that $S({\boldsymbol{x}}_{1},d_{1})\cap S({\boldsymbol{x}}_{2},d_{2})\neq\emptyset$ , it holds

S({\boldsymbol{x}}_{1},d_{1})\cap S({\boldsymbol{x}}_{2},d_{2})=S({\boldsymbol{x}}_{3},d_{3}),

where

	$\displaystyle d_{3}$	$\displaystyle=\frac{1+\sum_{i=1}^{n}\min\{d_{1}-x_{1}(i),d_{2}-x_{2}(i)\}}{n},$		(3.4)
	$\displaystyle x_{3}(i)$	$\displaystyle=\max\{x_{1}(i)-d_{1},x_{2}(i)-d_{2}\}+d_{3},\ \ i\in[n]$		(3.4)

Consequently, we have $d_{3}\leq\min\{d_{1},d_{2}\}$ .

(3)

The linear optimization over a simplex ball has the following closed-form solution:

\displaystyle{\boldsymbol{y}}^{*}:={\boldsymbol{x}}+(nd)\Big({\boldsymbol{e}}_{i^{*}}-\frac{{\boldsymbol{1}}_{n}}{n}\Big)\in\operatorname*{\arg\min}_{{\boldsymbol{y}}\in S({\boldsymbol{x}},d)}\ \langle{\boldsymbol{c}},{\boldsymbol{y}}\rangle\ \ \mbox{with}\ \ i^{*}=\operatorname*{\arg\min}_{i\in[n]}c_{i}.

(4)

The diameter of the simplex ball $S({\boldsymbol{x}},d)$ is $\sqrt{2}nd$ , i.e., $\max_{{\boldsymbol{y}}_{1},{\boldsymbol{y}}_{2}\in S({\boldsymbol{x}},d)}\lVert{\boldsymbol{y}}_{1}-{\boldsymbol{y}}_{2}\rVert=\sqrt{2}nd.$
(5)

For any point ${\boldsymbol{y}}\in S_{n}$ , if $\lVert{\boldsymbol{x}}-{\boldsymbol{y}}\rVert\leq d$ , then ${\boldsymbol{y}}\in S({\boldsymbol{x}},d)$ . Moreover, for any point ${\boldsymbol{y}}\in S({\boldsymbol{x}},d)$ , we have $\lVert{\boldsymbol{y}}-{\boldsymbol{x}}\rVert\leq nd$ .

We now give a formal definition of our LMO based on simplex ball.

Definition 4 (SLMO).

Given a linear objective ${\boldsymbol{c}}\in\mathbb{R}^{n}$ , radius $d>0$ and a point ${\boldsymbol{x}}\in S_{n}$ , a solution ${\boldsymbol{y}}^{*}\in\rm{SLMO}({\boldsymbol{x}},d,{\boldsymbol{c}})$ is called simplex-based linear minimization oracle if it solves the following optimization problem

\displaystyle\min\ \langle{\boldsymbol{y}},{\boldsymbol{c}}\rangle\quad

\displaystyle\rm{s.t.}\quad{\boldsymbol{y}}\in S({\boldsymbol{x}},d)\cap S_{n}.

(3.5)

We have proved in Lemma 3.1(5) that $\mathbb{B}({\boldsymbol{x}},d)\subseteq S({\boldsymbol{x}},d)\subseteq\mathbb{B}({\boldsymbol{x}},nd)$ . Therefore, for any ${\boldsymbol{y}}\in\mathbb{B}({\boldsymbol{x}},d)\cap S_{n}$ , we must have $\langle{\boldsymbol{y}},{\boldsymbol{c}}\rangle\geq\langle{\boldsymbol{y}}^{*},{\boldsymbol{c}}\rangle.$ This is the first property of LLOO. Moreover, since both ${\boldsymbol{x}},{\boldsymbol{y}}^{*}\in S({\boldsymbol{x}},d)$ , we must have $\|{\boldsymbol{x}}-{\boldsymbol{y}}^{*}\|\leq\rho d$ with $\rho=n$ . This leads to the following key result.

Lemma 3.2.

Given ${\boldsymbol{x}}\in\mathcal{P}$ , $d>0$ and ${\boldsymbol{c}}\in\mathbb{R}^{n}$ such that $S({\boldsymbol{x}},d)\cap{S}_{n}\not=\emptyset$ , then $\rm{SLMO}({\boldsymbol{x}},d,{\boldsymbol{c}})$ is an LLOO $\mathcal{A}({\boldsymbol{x}},d,{\boldsymbol{c}})$ with $\rho=n$ .

The implication of this result is far-reaching because the framework developed in [14] can be followed to get a linearly convergent algorithm with SLMO. An even more important result is that SLMO problem (3.5) can be solved by the following simple algorithm.

Algorithm 1

\mbox{SLMO}({\boldsymbol{x}},d,{\boldsymbol{c}})

0: point

{\boldsymbol{x}}\in S_{n}

, linear objective

{\boldsymbol{c}}\in\mathbb{R}^{n}

, radius

d>0

\widehat{d}\leftarrow\frac{\sum_{i=1}^{n}\min\{d,x_{i}\}}{n}

\widehat{{\boldsymbol{x}}}\leftarrow{\boldsymbol{x}}-\min\{{\boldsymbol{x}},d{\boldsymbol{1}}_{n}\}+\widehat{d}{\boldsymbol{1}}_{n}

{\boldsymbol{y}}_{+}\leftarrow\widehat{{\boldsymbol{x}}}-\widehat{d}{\boldsymbol{1}}_{n}

i^{*}\leftarrow\operatorname*{\arg\min}_{i\in[n]}c_{i}

{\boldsymbol{y}}^{*}\leftarrow{\boldsymbol{y}}_{+}+n\widehat{d}\;{\boldsymbol{e}}_{i^{*}}

{\boldsymbol{y}}^{*}

The algorithm follows these basic steps. Firstly, it represents the constraint set as a single simplex ball: $S(\widehat{{\boldsymbol{x}}},\widehat{d}).$ Secondly, it uses the existing theoretical results of linear programming over the simplex ball to find the optimal solution.

Lemma 3.3.

Alg. 1 finds an optimal solution to Problem (3.5).

Proof.

First, by Lemma 3.1(2), we have $S({\boldsymbol{x}},d)\cap S_{n}=S(\widehat{{\boldsymbol{x}}},\widehat{d})$ , where the definitions of $\widehat{{\boldsymbol{x}}}$ and $\widehat{d}$ are given in (3.3). Consequently, Problem (3.5) is equivalent to $\min_{{\boldsymbol{y}}\in S(\widehat{{\boldsymbol{x}}},\widehat{d})}\langle{\boldsymbol{y}},{\boldsymbol{c}}\rangle$ . Note that this is the same form as the problem in Lemma 3.1(3). Thus, we have

{\boldsymbol{y}}^{*}=\widehat{{\boldsymbol{x}}}+\widehat{d}(n{\boldsymbol{e}}_{i^{*}}-{\boldsymbol{1}}_{n})=\max\{{\boldsymbol{x}},d{\boldsymbol{1}}_{n}\}-d{\boldsymbol{1}}_{n}+n\widehat{d}\;{\boldsymbol{e}}_{i^{*}}.

Consequently, Alg. 1 solves Problem (3.5). ∎

Remark 1.

(Comparison with $\mbox{LMO}({\boldsymbol{c}},S_{n})$ and $\ell_{1}$ - $\mbox{LMO}({\boldsymbol{x}},d,{\boldsymbol{c}})$ ) If we treat the element-wise minimum between two vectors as a basic operation, then SLMO requires only one extra basic operation, one more vector summation, and one more vector addition compared to the the original LMO over the simplex $S_{n}$ . Therefore, its total exact complexity is $4n$ flops, making it nearly as efficient as $\mbox{LMO}({\boldsymbol{c}},S_{n})$ . However, $\ell_{1}$ - $\mbox{LMO}({\boldsymbol{x}},d,{\boldsymbol{c}})$ involves a sorting operation, whose overall complexity is usually $O(n\log(n)$ ), It also involves a few more vector additions. As will be illustrated in Fig. 2(c), it is far less efficient than the original $\mbox{LMO}({\boldsymbol{c}},S_{n})$ and SLMO. Therefore, we expect that a linearly convergent algorithm with SLMO should be efficient as well. We develop it below.

3.2 SFW: Simplex Frank-Wolfe Method

In this subsection, we present a new variant of Frank-Wolfe method called Simplex Frank-Wolfe (abbreviated as SFW), obtained by replacing the LMO with SLMO. The algorithm is formally described as follows.

Algorithm 2 Simplex Frank-Wolfe Method: SFW

{\boldsymbol{x}}_{0}\in S_{n}

, initial lower bound

B_{0}

1: Set:

d_{0}\leftarrow\sqrt{\frac{2(f({\boldsymbol{x}}_{0})-B_{0})}{\mu}}.

2: for

k=1,\dots

3: Compute

{\boldsymbol{y}}_{k}\in\mbox{SLMO}({{\boldsymbol{x}}}_{k-1},{d}_{k-1},\nabla f({\boldsymbol{x}}_{k-1}))

4: Compute the working lower bound:

B_{k}^{w}\leftarrow f({\boldsymbol{x}}_{k-1})+\langle\nabla f({\boldsymbol{x}}_{k-1}),{\boldsymbol{y}}_{k}-{\boldsymbol{x}}_{k-1}\rangle

5: Update best bound

B_{k}\leftarrow\max\{B_{k-1},B_{k}^{w}\}

6: Set

{\boldsymbol{x}}_{k}\leftarrow(1-\delta_{k}){\boldsymbol{x}}_{k-1}+\delta_{k}{\boldsymbol{y}}_{k}

for some

\delta_{k}\in[0,1]

7: Set:

d_{k}\leftarrow\sqrt{\frac{2(f({\boldsymbol{x}}_{k})-B_{k})}{\mu}}

8: end for

Before stating its convergence rate result, we make the following remarks regarding Alg. 2.

Remark 2.

(Choice of $d_{k}$ update strategy) We could follow the linear shrinking rule of $d_{k}$ in [14, 24]: $d_{k}=\gamma d_{k-1}$ with properly chosen $\gamma<1$ . The linear convergent rate would be guaranteed by invoking the LLOO property of SLMO (Lemma 3.2). We do not take this route for convergence analysis because of the following two reasons. One is that the key property $\mathbb{B}({\boldsymbol{x}},d)\subseteq S({\boldsymbol{x}},d)\subseteq\mathbb{B}({\boldsymbol{x}},nd)$ ensuring LLOO will have to be modified when it comes to the general polytope as our simplex ball is not based on any norm. The corresponding relationship becomes $\mathbb{B}({\boldsymbol{x}},dD/\eta)\subseteq S_{\mathcal{P}}({\boldsymbol{x}},d)\subseteq\mathbb{B}({\boldsymbol{x}},(n+1)dD)$ , see Lemma 4.3, where $S_{\mathcal{P}}$ is defined. At least at a technical level, the original LLOO will have to be generalized to suit this extension and the corresponding proofs have also to be reproduced. The proof we provided below is more direct. The other reason is that we use the best lower bound $B_{k}$ provided by the algorithm to define $d_{k}$ . This choice is important because we are going to incorporate other accelerating strategies to SFW resulting in rSFW. The stopping criterion used there will be also based on the best lower bounds obtained. Therefore, the convergence analysis for SFW will naturally be adapted to rSFW.

At the $k$ -th iteration, Alg. 2 first invokes SLMO to find the minimum point ${\boldsymbol{y}}_{k}$ of the first-order approximate expansion of the objective function within the region $S({\boldsymbol{x}}_{k-1},d_{k-1})\cap S_{n}$ . Subsequently, using a suitable step size $\delta_{k}$ , a convex combination of ${\boldsymbol{y}}_{k}$ and ${\boldsymbol{x}}_{k-1}$ is computed to update the iteration point to ${\boldsymbol{x}}_{k+1}$ . Finally, the algorithm updates the radius $d$ , ensuring that the optimal solution ${\boldsymbol{x}}^{*}$ progressively falls within a smaller neighborhood $S({\boldsymbol{x}}_{k},d_{k})$ . For Alg. 2, we propose the following simple step size, as an alternative to the simple step size selection in the original Frank-Wolfe algorithm:

\delta_{k}=\frac{\mu}{2Ln^{2}}.

(3.6)

We have the following linear convergence rate result. The induction technique in the proof below was taken from [14, Lemma 4.3].

Theorem 3.4.

Let $\{{\boldsymbol{x}}_{k}\}$ be the sequences generated by Alg. 2 with step size policy for $\{\delta_{k}\}$ in (2.3), (2.4), or (3.6). Then, for $k\geq 0$ , we have

f({\boldsymbol{x}}_{k})-f^{*}\leq f({\boldsymbol{x}}_{k})-B_{k}\leq\frac{\mu d_{0}^{2}}{2}e^{-\frac{\mu}{4Ln^{2}}k}.

(3.7)

Proof.

We first claim that ${\boldsymbol{x}}^{*}\in S({\boldsymbol{x}}_{k},d_{k})$ and that $f({\boldsymbol{x}}_{k})-B_{k}\leq\frac{\mu d_{k}^{2}}{2}$ . We prove this by induction. First, we have

\frac{\mu d_{0}^{2}}{2}=f({\boldsymbol{x}}_{0})-B_{0}\geq f({\boldsymbol{x}}_{0})-f^{*}\stackrel{{\scriptstyle(a)}}{{\geq}}\frac{\mu}{2}\lVert{\boldsymbol{x}}_{0}-{\boldsymbol{x}}^{*}\rVert^{2},

where $(a)$ comes from (2.1). This implies that $\lVert{\boldsymbol{x}}_{0}-{\boldsymbol{x}}^{*}\rVert\leq d_{0}$ , and by Lemma 3.1(5), we have ${\boldsymbol{x}}^{*}\in S({\boldsymbol{x}}_{0},d_{0})$ . Therefore, the claim holds for $k=0$ .

Now suppose that ${\boldsymbol{x}}^{*}\in S({\boldsymbol{x}}_{t},d_{t})$ and $f({\boldsymbol{x}}_{t})-B_{t}\leq\frac{\mu d_{t}^{2}}{2}$ for all $t\leq k-1$ . Let $\gamma:=\frac{\mu}{2Ln^{2}}\leq 1$ . For step size policy $\delta_{k}$ in (2.3) (exact line search stepsize) or (3.6), we both have

	$\displaystyle f({\boldsymbol{x}}_{k})$	$\displaystyle=$	$\displaystyle f({\boldsymbol{x}}_{k-1}+\delta_{k}({\boldsymbol{y}}_{k}-{\boldsymbol{x}}_{k-1}))\leq f({\boldsymbol{x}}_{k-1}+\gamma({\boldsymbol{y}}_{k}-{\boldsymbol{x}}_{k-1}))$		(3.8)
		$\displaystyle\leq$	$\displaystyle f({\boldsymbol{x}}_{k-1})+\gamma\langle\nabla f({\boldsymbol{x}}_{k-1}),{\boldsymbol{y}}_{k}-{\boldsymbol{x}}_{k-1}\rangle+\frac{L\gamma^{2}}{2}\lVert{\boldsymbol{y}}_{k}-{\boldsymbol{x}}_{k-1}\rVert^{2}.$		(3.8)

Similarly, for the step size policy (2.4) (short stepsize), we have

	$\displaystyle f({\boldsymbol{x}}_{k})$	$\displaystyle\leq$	$\displaystyle f({\boldsymbol{x}}_{k-1})+\delta_{k}\langle\nabla f({\boldsymbol{x}}_{k-1}),{\boldsymbol{y}}_{k}-{\boldsymbol{x}}_{k-1}\rangle+\frac{L\delta_{k}^{2}}{2}\lVert{\boldsymbol{y}}_{k}-{\boldsymbol{x}}_{k-1}\rVert^{2}$		(3.9)
		$\displaystyle\leq$	$\displaystyle f({\boldsymbol{x}}_{k-1})+\gamma\langle\nabla f({\boldsymbol{x}}_{k-1}),{\boldsymbol{y}}_{k}-{\boldsymbol{x}}_{k-1}\rangle+\frac{L\gamma^{2}}{2}\lVert{\boldsymbol{y}}_{k}-{\boldsymbol{x}}_{k-1}\rVert^{2}.$		(3.9)

Combining (3.8) and (3.9), we have

	$\displaystyle f({\boldsymbol{x}}_{k})\leq$	$\displaystyle f({\boldsymbol{x}}_{k-1})+\gamma\langle\nabla f({\boldsymbol{x}}_{k-1}),{\boldsymbol{y}}_{k}-{\boldsymbol{x}}_{k-1}\rangle+\frac{L\gamma^{2}}{2}\lVert{\boldsymbol{y}}_{k}-{\boldsymbol{x}}_{k-1}\rVert^{2}$
	$\displaystyle\stackrel{{\scriptstyle(b)}}{{\leq}}$	$\displaystyle(1-\gamma)f({\boldsymbol{x}}_{k-1})+\gamma B_{k}^{w}+\frac{L\gamma^{2}}{2}\lVert{\boldsymbol{y}}_{k}-{\boldsymbol{x}}_{k-1}\rVert^{2}$
	$\displaystyle\stackrel{{\scriptstyle(c)}}{{\leq}}$	$\displaystyle(1-\gamma)(f({\boldsymbol{x}}_{k-1})-B_{k-1})+B_{k}+\frac{L\gamma^{2}}{2}\lVert{\boldsymbol{y}}_{k}-{\boldsymbol{x}}_{k-1}\rVert^{2}$

holds for step size policy (2.3), (2.4), or (3.6), where $(b)$ comes from the definition of $B_{k}^{w}$ , and $(c)$ is due to $B_{k}\geq\max\{B_{k-1},B_{k}^{w}\}$ . Subtracting $B_{k}$ from the both sides of the above inequality, we obtain

	$\displaystyle f({\boldsymbol{x}}_{k})-B_{k}\leq$	$\displaystyle(1-\gamma)(f({\boldsymbol{x}}_{k-1})-B_{k-1})+\frac{L\gamma^{2}}{2}\lVert{\boldsymbol{y}}_{k}-{\boldsymbol{x}}_{k-1}\rVert^{2}$
	$\displaystyle\stackrel{{\scriptstyle(d)}}{{\leq}}$	$\displaystyle(1-\gamma)\frac{\mu}{2}d_{k-1}^{2}+\frac{L\gamma^{2}}{2}n^{2}d_{k-1}^{2}=\left[(1-\gamma)\frac{\mu}{2}+\frac{L\gamma^{2}n^{2}}{2}\right]d_{k-1}^{2},$

where $(d)$ is due to our inductive hypothesis and Lemma 3.1(5). By plugging in the value of $\gamma$ , and using $1-x\leq e^{-x}$ , we have that

f({\boldsymbol{x}}_{k})-B_{k}\leq\frac{\mu}{2}\left(1-\frac{\mu}{4Ln^{2}}\right)d_{k-1}^{2}\leq\frac{\mu}{2}e^{-\frac{\mu}{4Ln^{2}}}d_{k-1}^{2}.

(3.10)

With the definition of $d_{k}$ , we have $f({\boldsymbol{x}}_{k})-B_{k}=\frac{\mu}{2}d_{k}^{2}$ . The bound in (3.10) implies

d_{k}\leq e^{-\frac{\mu}{8Ln^{2}}}{d}_{k-1}.

(3.11)

By the inductive hypothesis, we know that ${\boldsymbol{x}}^{*}\in S({\boldsymbol{x}}_{t},d_{t})$ holds for all $t\leq k-1$ . Thus $B_{t+1}^{w}$ is a valid lower bound of $f^{*}$ , and consequently, $B_{k}$ is also a lower bound of $f^{*}$ . Now by (2.1), we have

\lVert{\boldsymbol{x}}_{k}-{\boldsymbol{x}}^{*}\rVert^{2}\leq\frac{2}{\mu}(f({\boldsymbol{x}}_{k})-f^{*})\leq\frac{2}{\mu}(f({\boldsymbol{x}}_{k})-B_{k})=d_{k}^{2}.

This implies that ${\boldsymbol{x}}^{*}\in S({\boldsymbol{x}}_{k},d_{k})$ by Lemma 3.1(5). Therefore, we have completed the proof of the claim.

We now start to prove the conclusion in Theorem 3.4. From the earlier proof, we know that $B_{k}\leq f^{*}$ , thus confirming the first part of the inequality. By the definition of $d_{k}$ and the established reduction inequality (3.11), we have

f({\boldsymbol{x}}_{k})-B_{k}=\frac{\mu d_{k}^{2}}{2}\leq\frac{\mu d_{0}^{2}}{2}e^{-\frac{\mu}{4Ln^{2}}k}.

The proof is thus completed. ∎

Remark 3.

(Iteration complexity of SFW) If we skip Lines 4–5 and replace Line 7 in Alg. 2 with $\sqrt{\frac{2\langle\nabla f({\boldsymbol{x}}_{k-1}),{\boldsymbol{x}}_{k-1}-{\boldsymbol{y}}_{k}\rangle}{\mu}}$ , the algorithm remains correct and preserves its convergence guarantee. This modification avoids computing the objective value $f({\boldsymbol{x}}_{k})$ , making the algorithm more practical and simple when the objective is expensive or difficult to evaluate. In this case, assuming that there is an oracle to obtain the gradient information $\nabla f({\boldsymbol{x}})$ at each iteration, we are able to give the exact number of flops operations to compute the next iterate. The SLMO part to get ${\boldsymbol{y}}_{k}$ is $4n$ flops, and computing the direction $d_{k}$ requires an additional $3n$ flops. For the short step size strategy, evaluating the step size $\delta_{k}$ incurs $2n$ flops, and updating the iterate ${\boldsymbol{x}}_{k}$ takes another $3n$ flops. Thus, SFW needs $10n$ flops with a simple step size, or $12n$ flops with the short step size—–yet still achieves linear convergence for the simplex. To our knowledge, this is the lowest per-iteration cost among FW variants with linear convergence.

3.3 Refining SFW

In this subsection, we aim to further reduce the computational overhead of the proposed oracle as much as possible, while retaining the linear convergence rate. The motivation stems from an important observation. $\mbox{SLMO}({\boldsymbol{x}},d,{\boldsymbol{c}})$ can be split into two parts. The first part is to construct a new Simplex-ball $S(\widehat{{\boldsymbol{x}}},\widehat{d})=S({\boldsymbol{x}},d)\cap S_{n}$ . For easy reference, we call it SLMO-1, which corresponds to Lines 1 in Alg. 1. The second part, which finds an optimal solution over $S(\widehat{{\boldsymbol{x}}},\widehat{d})$ , is referred to as SLMO-2 and corresponds to Lines 4-5 in Alg. 1. It is easy to see that the computation of SLMO-2 requires only one more vector addition compared to the standard LMO over $S_{n}$ and hence its computation is already kept minimum. The extra computation of SLMO is from SLMO-1. Since the new Simplex ball is already constructed, we like to carry out a few more times of the SLMO-2 part. This is roughly to find an approximate solution ${\boldsymbol{p}}$ to the problem

\min\;f({\boldsymbol{y}})\quad\mbox{s.t.}\quad{\boldsymbol{y}}\in S(\widehat{{\boldsymbol{x}}},\widehat{d})

with starting point ${\boldsymbol{x}}$ . Our control of this refinement step is for the overall computation to remain $O(n)$ and the overall convergence rate to remain linear. The overall algorithm is given in Alg. 3 and it is called Refined SFW.

Algorithm 3 rSFW: Refined Simplex Frank-Wolfe Method

0: Radius contraction ratio

\rho>1

, initial lower bound

B_{0}

1: Set:

d_{0}\leftarrow\frac{1}{n},{\boldsymbol{x}}_{0}\leftarrow\frac{{\boldsymbol{1}}_{n}}{n},J\leftarrow\frac{8\rho^{2}n^{2}L}{\mu},\bar{{\boldsymbol{x}}}_{0}\leftarrow{\boldsymbol{x}}_{0},\bar{d}_{0}\leftarrow d_{0}

2: for

k=1,\dots

3: Set:

{\boldsymbol{p}}_{0}\leftarrow{\boldsymbol{x}}_{k-1},C_{0}\leftarrow B_{k-1}

4: (SLMO-1) construct the new Simplex ball:

S(\widehat{{\boldsymbol{x}}}_{k-1},\widehat{d}_{k-1})=S({\bar{{\boldsymbol{x}}}}_{k-1},\bar{d}_{k-1})\cap S_{n}

5: for

j=1,\dots,J

6: (SLMO-2): Compute

{\boldsymbol{y}}_{j}={\operatorname*{\arg\min}}_{{\boldsymbol{y}}\in S(\widehat{{\boldsymbol{x}}}_{k-1},\widehat{d}_{k-1})}\langle\nabla f({\boldsymbol{p}}_{j-1}),{\boldsymbol{y}}\rangle

7: Compute the current lower bound:

C_{j}^{w}\leftarrow f({\boldsymbol{p}}_{j-1})+\langle\nabla f({\boldsymbol{p}}_{j-1}),{\boldsymbol{y}}_{j}-{\boldsymbol{p}}_{j-1}\rangle

8: Update the best lower bound

C_{j}\leftarrow\max\{C_{j-1},C_{j}^{w}\}

9: if

f({\boldsymbol{p}}_{j})-C_{j}\leq\frac{\mu}{2\rho^{2}}\widehat{d}_{k-1}^{2}

then

10: Break the inner loop.

11: end if

12: Set

{\boldsymbol{p}}_{j}\leftarrow(1-\delta_{j}){\boldsymbol{p}}_{j-1}+\delta_{j}{\boldsymbol{y}}_{j}

for some

\delta_{j}\in[0,1]

13: end for

14: Set:

{\boldsymbol{x}}_{k}\leftarrow{\boldsymbol{p}}_{j},d_{k}\leftarrow\frac{\widehat{d}_{k-1}}{\rho},B_{k}\leftarrow C_{j}

15: Find

(\bar{{\boldsymbol{x}}}_{k},\bar{d}_{k})

such that

S(\bar{{\boldsymbol{x}}}_{k},\bar{d}_{k})=S({\boldsymbol{x}}_{k},d_{k})\cap S(\widehat{{\boldsymbol{x}}}_{k-1},\widehat{d}_{k-1})

by using (3.4).

16: end for

Alg. 3 keeps three sequences $\{({\boldsymbol{x}}_{k},d_{k})\}$ , $\{(\bar{{\boldsymbol{x}}}_{k},\bar{d}_{k})\}$ , and $\{(\widehat{{\boldsymbol{x}}}_{k},\widehat{d}_{k})\}$ , each associated with a simplex ball, namely $S({\boldsymbol{x}}_{k},d_{k})$ , $S(\bar{{\boldsymbol{x}}}_{k},\bar{d}_{k})$ , and $S(\widehat{{\boldsymbol{x}}}_{k},\widehat{d}_{k})$ . Starting with $({\boldsymbol{x}}_{0},d_{0})=(\bar{{\boldsymbol{x}}}_{0},\bar{d}_{0})$ , we define $(\widehat{{\boldsymbol{x}}}_{0},\widehat{d}_{0})$ such that $S(\widehat{{\boldsymbol{x}}}_{0},\widehat{d}_{0})=S(\bar{{\boldsymbol{x}}}_{0},\bar{d}_{0})\cap S_{n}$ . At the $k$ th iteration, we first compute

{\boldsymbol{x}}_{k}\approx\operatorname*{\arg\min}\left\{f({\boldsymbol{y}})\ |\ {\boldsymbol{y}}\in S(\widehat{{\boldsymbol{x}}}_{k-1},\widehat{d}_{k-1})\right\}\quad\mbox{and}\quad d_{k}=\widehat{d}_{k-1}/\rho.

We then define $(\bar{{\boldsymbol{x}}}_{k},\bar{d}_{k})$ by its simplex ball, which satisfies $S(\bar{{\boldsymbol{x}}}_{k},\bar{d}_{k})=S({\boldsymbol{x}}_{k},d_{k})\cap S(\widehat{{\boldsymbol{x}}}_{k-1},\widehat{d}_{k-1})$ . We further define the iterate $(\widehat{{\boldsymbol{x}}}_{k},\widehat{d}_{k})$ by its simplex ball satisfying $S(\widehat{{\boldsymbol{x}}}_{k},\widehat{d}_{k})=S(\bar{{\boldsymbol{x}}}_{k},\bar{d}_{k})\cap S_{n}$ . They can all be efficiently computed via the formula (3.4).

A great advantage of Alg. 3 is its computation of ${\boldsymbol{x}}_{k}$ . The oracle we call is SLMO-2, which ensures that the iteration complexity remains the same as the original FW algorithm. It is also important to highlight that the inner loop of our algorithm (Lines 5-13) follows the standard FW algorithm. Its primary goal is to find a solution ${\boldsymbol{p}}_{j}$ that satisfies $f({\boldsymbol{p}}_{j})-C_{j}\leq\frac{\mu}{2\rho^{2}}\widehat{d}_{k-1}^{2}$ . Therefore, various speedup techniques for the classical FW algorithm can be directly applied to this inner loop without interfering with the outer loop of the algorithm. These include approaches such as the ‘away-step’ and ‘pairwise’ variants of FW proposed by [22], ‘fully-corrective’ variant of FW proposed by [19], as well as the warm start technique suggested by [11].

The following theorem summarizes the convergence result for this algorithm.

Theorem 3.5.

Let $\{{\boldsymbol{x}}_{k}\}$ be the sequences generated by Alg. 3 with step size policy for $\{\delta_{j}\}$ in (2.2)-(2.4). Then, for $k\geq 1$ , we have

f({\boldsymbol{x}}_{k})-f^{*}\leq f({\boldsymbol{x}}_{k})-B_{k}\leq\frac{\mu}{2n^{2}\rho^{2k}}.

Proof.

We first claim that ${\boldsymbol{x}}^{*}\in S(\widehat{{\boldsymbol{x}}}_{k},\widehat{d}_{k})$ for any $k\geq 0$ and we prove this by induction. For $k=0$ , we have $(\bar{{\boldsymbol{x}}}_{0},\bar{d}_{0})=({\boldsymbol{x}}_{0},d_{0})=({\boldsymbol{1}}_{n}/n,1/n)$ . By the definition of $S(\widehat{{\boldsymbol{x}}}_{0},\widehat{d}_{0})$ , we have

S(\widehat{{\boldsymbol{x}}}_{0},\widehat{d}_{0})=S(\bar{{\boldsymbol{x}}}_{0},\bar{d}_{0})\cap S_{n}=S({\boldsymbol{1}}_{n}/n,1/n)\cap S_{n}=S_{n}.

Hence, ${\boldsymbol{x}}^{*}\in S(\widehat{{\boldsymbol{x}}}_{0},\widehat{d}_{0})$ . Now suppose that ${\boldsymbol{x}}^{*}\in S(\widehat{{\boldsymbol{x}}}_{k-1},\widehat{d}_{k-1})$ for some $k\geq 1$ . Note that the inner loop of Alg. 3 corresponds to the standard Frank-Wolfe algorithm. By Theorem 2.1 and Lemma 3.1(4), which implies that the diameter of $S(\widehat{{\boldsymbol{x}}}_{k-1},\widehat{d}_{k-1})$ is $\sqrt{2}n\widehat{d}_{k-1}$ , we have

f({\boldsymbol{p}}_{j})-f^{*}\leq\frac{2L}{j+1}\left(\sqrt{2}n\widehat{d}_{k-1}\right)^{2}=\frac{4Ln^{2}\widehat{d}_{k-1}^{2}}{j+1}

hold for all $j\in[J]$ . In the case where the inner loop terminates at $j=J$ , we obtain

f({\boldsymbol{x}}_{k})-f^{*}=f({\boldsymbol{p}}_{J})-f^{*}\leq f({\boldsymbol{p}}_{J})-C_{J}\leq\frac{2L}{\frac{8\rho^{2}n^{2}L}{\mu}}2n^{2}\widehat{d}_{k-1}^{2}=\frac{\mu}{2\rho^{2}}\widehat{d}_{k-1}^{2}=\frac{\mu}{2}d_{k}^{2}.

Similarly, if the inner loop is interrupted due to lines 9-11 of the algorithm, we still have

f({\boldsymbol{x}}_{k})-f^{*}\leq f({\boldsymbol{p}}_{j})-C_{j}\leq\frac{\mu}{2\rho^{2}}\widehat{d}_{k-1}^{2}=\frac{\mu}{2}d_{k}^{2}.

Using the fact that $f({\boldsymbol{x}}_{k})-f^{*}\geq\frac{\mu}{2}\lVert{\boldsymbol{x}}_{k}-{\boldsymbol{x}}^{*}\rVert^{2}$ , we have $\lVert{\boldsymbol{x}}_{k}-{\boldsymbol{x}}^{*}\rVert^{2}\leq d_{k}^{2}$ , which implies via Lemma 3.1(5) that ${\boldsymbol{x}}^{*}\in S({{\boldsymbol{x}}}_{k},{d}_{k})$ . This implies

$\displaystyle{\boldsymbol{x}}^{*}$	$\displaystyle\in$	$\displaystyle S({{\boldsymbol{x}}}_{k},{d}_{k})\cap S(\widehat{{\boldsymbol{x}}}_{k-1},\widehat{d}_{k-1})\qquad(\mbox{as}\ {\boldsymbol{x}}^{*}\in S(\widehat{{\boldsymbol{x}}}_{k-1},\widehat{d}_{k-1})\ \mbox{by induction})$
	$\displaystyle=$	$\displaystyle S({{\boldsymbol{x}}}_{k},{d}_{k})\cap S(\widehat{{\boldsymbol{x}}}_{k-1},\widehat{d}_{k-1})\cap S_{n}\qquad(\mbox{as}\ {\boldsymbol{x}}^{*}\in S_{n})$
	$\displaystyle=$	$\displaystyle S(\bar{{\boldsymbol{x}}}_{k},\bar{d}_{k})\cap S_{n}=S(\widehat{{\boldsymbol{x}}}_{k},\widehat{d}_{k}).$

We now start to prove the conclusion in Theorem 3.5. Since $d_{0}=\frac{1}{n}$ and $\widehat{d}_{k}\leq d_{k}=\frac{\widehat{d}_{k-1}}{\rho}\leq\frac{d_{k-1}}{\rho}$ , we have $\widehat{d}_{k}\leq\frac{1}{n\rho^{k}}$ and thus

f({\boldsymbol{x}}_{k})-f^{*}\leq f({\boldsymbol{x}}_{k})-B_{k}\leq\frac{\mu}{2}{d}_{k}^{2}\leq\frac{\mu}{2n^{2}\rho^{2k}}.

We complete the proof. ∎

Remark 4.

(Warm-start Strategy) For rSFW and rSFW_P in the upcoming Section 4, when using the simple step size, the initial steps of each inner loop may perform poorly, causing the iteration point ${\boldsymbol{p}}_{j}$ far away from the optimal solution. We found that the following heuristic warm-start strategy is effective in practice. Let $J_{k}\ll J$ denote the actual number of inner loop iterations during the $k$ -th outer loop iteration. When initiating the $(k+1)$ -th outer loop, instead of starting the inner loop from $j=1$ , begin from either $j=\frac{J_{k}}{\rho^{\prime}}$ or $j=\sqrt{\frac{\bar{d}{k}}{\bar{d}{k-1}}}J_{k}$ . Here $\rho^{\prime}>1$ is a hyperparameter, with $\rho^{\prime}=2$ serving as a reasonable default value.

Remark 5.

(Extension to Quadratic Growth Condition) Although the two main Thms. 3.4 and 3.5 are established under the strong convexity of $f(\cdot)$ , we would like to point out that the assumption can be weakened to quadratic growth condition:

(f({\boldsymbol{x}})-f({\boldsymbol{x}}^{*}))^{1/2}\geq c~\mbox{dist}({\boldsymbol{x}},X^{*}),\qquad\forall\ {\boldsymbol{x}}\in\mathcal{P},

where $c>0$ and $X^{*}$ is the solution set of Problem (1.1). This includes the well-known case $f({\boldsymbol{x}})=g(A{\boldsymbol{x}})+\langle{\boldsymbol{b}},{\boldsymbol{x}}\rangle$ with $g$ strongly convex, $A\in\mathbb{R}^{m\times n}$ and ${\boldsymbol{b}}\in\mathbb{R}^{n}$ , for a detailed investigation of this class of functions with FW methods, see [1]. In this paper, we did not make effort for such extension as our main purpose is to introduce SLMO under the strong convexity setting for the sake of simplicity.

4 Generalization to Arbitrary Polytopes

This section extends the previous results for the unit simplex to arbitrary polytopes. This generalization allows for a broader application of our findings, facilitating their relevance to a wider range of optimization problems. Important properties between the standard simplex $S_{n}$ and general polytopes have been established in [14]. Our extension heavily relies on some of those results. This section is patterned after Section 3 with some details omitted to avoid repeating. We first define the simplex ball for general polytope and the corresponding SLMO. We then describe the Simplex Frank-Wolfe for general polytope, followed by its refined version.

4.1 Simplex Ball and SLMO for Arbitrary Polytopes

Consider Problem (1.1) with $\mathcal{P}=\mbox{Conv}(\mathcal{V})$ and $\mathcal{V}=\{{\boldsymbol{v}}_{1},\dots,{\boldsymbol{v}}_{N}\}$ . Therefore, any given ${\boldsymbol{x}}\in\mathcal{P}$ can be represented as a convex combination of those atoms ${\boldsymbol{v}}_{i}$ . However, the convex combination may not be unique. We define a set-valued mapping $\mathcal{M}$ from $\mathcal{P}$ to the following set:

\mathcal{M}({\boldsymbol{x}}):=\left\{\boldsymbol{\lambda}\in S_{N}\ \left|\ {\boldsymbol{x}}=\sum_{i=1}^{N}\lambda_{i}{\boldsymbol{v}}_{i}\right.\right\}.

Recall that for $\boldsymbol{\lambda}\in\mathbb{R}^{N}$ and $d>0$ , $S(\boldsymbol{\lambda},d)$ is the simplex ball defined in (3.2). The idea of defining a similar Simplex ball over the polytope can be summarized as follows:

{\boldsymbol{x}}\in\mathcal{P}\ \Longrightarrow\ \boldsymbol{\lambda}_{x}\in\mathcal{M}({\boldsymbol{x}})\ \Longrightarrow\ S(\boldsymbol{\lambda}_{x},d)\ \Longrightarrow\ S_{\mathcal{P}}({\boldsymbol{x}},d):=\left\{V\boldsymbol{\lambda}\ |\ \boldsymbol{\lambda}\in S(\boldsymbol{\lambda}_{x},d)\right\},

(4.1)

where $V$ consists of the columns ${\boldsymbol{v}}_{i}$ , $i\in[N]$ . It seems that the Simplex ball $S_{\mathcal{P}}({\boldsymbol{x}},d)$ depends on a particular choice of $\boldsymbol{\lambda}_{x}\in\mathcal{M}({\boldsymbol{x}})$ . The following result dismisses this dependence.

Lemma 4.1.

Given ${\boldsymbol{x}}\in\mathcal{P}$ and $d>0$ , let $\boldsymbol{\lambda}_{x},\boldsymbol{\lambda}_{x}^{\prime}\in\mathcal{M}({\boldsymbol{x}})$ . Then for any $\boldsymbol{\lambda}\in S(\boldsymbol{\lambda}_{x},d)$ , there exist $\boldsymbol{\lambda}^{\prime}\in S(\boldsymbol{\lambda}_{x}^{\prime},d)$ such that

\sum_{i=1}^{N}\lambda(i){\boldsymbol{v}}_{i}=\sum_{i=1}^{N}\lambda^{\prime}(i){\boldsymbol{v}}_{i}.

Proof.

By definition of $\mathcal{M}({\boldsymbol{x}})$ , we know

\sum_{i=1}^{N}\lambda_{x}(i){\boldsymbol{v}}_{i}=\sum_{i=1}^{N}\lambda_{x}^{\prime}(i){\boldsymbol{v}}_{i}.

(4.2)

It follows from $\boldsymbol{\lambda}_{x}^{\prime}=\boldsymbol{\lambda}_{x}-\boldsymbol{\lambda}_{x}+\boldsymbol{\lambda}_{x}^{\prime}$ and the definition of Simplex ball (3.1) that

S(\boldsymbol{\lambda}_{x}^{\prime},d)=S(\boldsymbol{\lambda}_{x},d)-\boldsymbol{\lambda}_{x}+\boldsymbol{\lambda}_{x}^{\prime}.

(4.3)

Let $\boldsymbol{\lambda}^{\prime}:=\boldsymbol{\lambda}-\boldsymbol{\lambda}_{x}+\boldsymbol{\lambda}_{x}^{\prime}$ . Since $\boldsymbol{\lambda}\in S(\boldsymbol{\lambda}_{x},d)$ , the identity (4.3) implies $\boldsymbol{\lambda}^{\prime}\in S(\boldsymbol{\lambda}_{x}^{\prime},d)$ . Moreover, we have

\sum_{i=1}^{N}\lambda^{\prime}(i){\boldsymbol{v}}_{i}=\sum_{i=1}^{N}(\lambda(i)-\lambda_{x}(i)+\lambda_{x}^{\prime}(i)){\boldsymbol{v}}_{i}=\sum_{i=1}^{N}\lambda(i){\boldsymbol{v}}_{i},

where the last equation used (4.2). This completes the proof. ∎

Lemma 4.1 ensures that the definition is independent of choice of $\boldsymbol{\lambda}_{x}\in\mathcal{M}({\boldsymbol{x}})$ . Hence, the definition is well defined. Given a linear objective ${\boldsymbol{c}}\in\mathbb{R}^{n}$ , we extend it to ${\boldsymbol{c}}_{ext}\in\mathbb{R}^{N}$ such that $c_{ext}(i)=\langle{\boldsymbol{v}}_{i},{\boldsymbol{c}}\rangle$ for all $i\in[N]$ . Consequently, the following equivalence holds:

\min_{{\boldsymbol{y}}\in\mathcal{P}}\langle{\boldsymbol{y}},{\boldsymbol{c}}\rangle=\min_{\boldsymbol{\lambda}\in S_{N}}\langle\boldsymbol{\lambda},{\boldsymbol{c}}_{ext}\rangle.

Leveraging this equivalence, we define the generalized SLMO for $\mathcal{P}$ as follows.

Definition 5 ( $\mbox{SLMO}_{\mathcal{P}}$ : SLMO over $\mathcal{P}$ ).

Given a linear objective ${\boldsymbol{c}}\in\mathbb{R}^{n}$ , radius $d>0$ , a point ${\boldsymbol{x}}\in\mathcal{P}$ , and its corresponding $\boldsymbol{\lambda}_{x}\in\mathcal{M}({\boldsymbol{x}})$ , a solution ${\boldsymbol{y}}^{*}\in\mbox{SLMO}_{\mathcal{P}}({\boldsymbol{x}},d,{\boldsymbol{c}},\boldsymbol{\lambda}_{x})$ is referred to as a generalized simplex-based linear minimization oracle if

{\boldsymbol{y}}^{*}=\sum_{i=1}^{N}\lambda_{i}^{*}{\boldsymbol{v}}_{i},

where $\boldsymbol{\lambda}^{*}$ is an optimal solution to the following optimization problem

		$\displaystyle\min\quad\langle\boldsymbol{\lambda},{\boldsymbol{c}}_{ext}\rangle$		(4.4)
		$\displaystyle\mbox{s.t. }\quad\boldsymbol{\lambda}\in S(\boldsymbol{\lambda}_{x},d)\cap S_{N}.$		(4.4)

We note that (4.4) can be efficiently solved by Alg. 1. Consequently, $\mbox{SLMO}_{\mathcal{P}}$ can also be efficiently solved provided an element in $\mathcal{M}({\boldsymbol{x}})$ can be cheaply obtained. The detailed steps are outlined in Alg. 4.

Algorithm 4

\mbox{SLMO}_{\mathcal{P}}({\boldsymbol{x}},d,{\boldsymbol{c}},\boldsymbol{\lambda}_{x})

0: point

{\boldsymbol{x}}\in\mathcal{P}

with

\boldsymbol{\lambda}_{x}\in\mathcal{M}({\boldsymbol{x}})

, linear objective

{\boldsymbol{c}}\in\mathbb{R}^{n}

, radius

d>0

\widehat{d}\leftarrow\frac{\sum_{i=1}^{N}\min\{\lambda_{x}(i),d\}}{n+1}

\widehat{\boldsymbol{\lambda}}\leftarrow\boldsymbol{\lambda}_{x}-\min\{\boldsymbol{\lambda}_{x},d{\boldsymbol{1}}_{N}\}+\widehat{d}{\boldsymbol{1}}_{N}

{\boldsymbol{y}}_{+}\leftarrow\sum_{i=1}^{N}(\widehat{\lambda}_{i}-\widehat{d}){\boldsymbol{v}}_{i}

{\boldsymbol{v}}_{i^{*}}\leftarrow\operatorname*{\arg\min}_{{\boldsymbol{v}}\in\mathcal{P}}\langle{\boldsymbol{v}},{\boldsymbol{c}}\rangle

{\boldsymbol{y}}^{*}\leftarrow{\boldsymbol{y}}_{+}+(n+1)\widehat{d}{\boldsymbol{v}}_{i^{*}}

{\boldsymbol{y}}^{*}

One can observe that $\mbox{SLMO}_{\mathcal{P}}$ -2—corresponding to Lines 4-5 in Alg. 4 and serving as the oracle in our subsequent Alg. 6—requires only one more vector addition and one extra scalar-vector multiplication compared to the standard LMO.

We summarize the optimality of Alg. 4 in the following result, which is direct consequence of Lemma 3.3

Lemma 4.2.

Algorithm $\mbox{SLMO}_{\mathcal{P}}({\boldsymbol{x}},d,{\boldsymbol{c}},\boldsymbol{\lambda}_{x})$ returns an optimal solution ${\boldsymbol{y}}^{*}$ for the problem:

{\boldsymbol{y}}^{*}\in\operatorname*{\arg\min}\left\{\langle{\boldsymbol{c}},{\boldsymbol{y}}\rangle\ |\ {\boldsymbol{y}}\in S_{\mathcal{P}}({\boldsymbol{x}},d)\cap\mathcal{P}\right\}.

Remark 6.

(Carathéodory’s Representation Assumption) By Carathéodory’s
Representation Theorem [28, Thm. 17.1], for any point ${\boldsymbol{x}}\in\mathcal{P}$ , there exists $\boldsymbol{\lambda}_{x}\in\mathcal{M}({\boldsymbol{x}})$ such that $|\mathcal{I}_{+}(\boldsymbol{\lambda}_{x})|\leq n+1$ where $\mathcal{I}_{+}(\boldsymbol{\lambda}_{x}):=\{i\in[N]\mid\lambda_{x}(i)>0\}$ . As demonstrated in the illustrative examples in Supplement C.1, this representation can be easily implemented for common types of $\mathcal{P}$ . In this representation, the running time of $\mbox{SLMO}_{\mathcal{P}}$ does not explicitly depend on the number of vertices $N$ , but rather on the natural dimension of $\mathcal{P}$ , that is, $n$ . For the analysis in the subsequent sections, we assume without loss of generality that the selected $\boldsymbol{\lambda}_{x}$ always satisfies $|\mathcal{I}_{+}(\boldsymbol{\lambda}_{x})|\leq n+1$ .

The following lemma demonstrates some useful properties of $S_{\mathcal{P}}$ and $\mbox{SLMO}_{\mathcal{P}}$ , which can be regarded as a generalization of Lemma 3.1(5) and is crucial for proving the convergence of our algorithm in the next subsection. The detailed proof can be found in Supplement B.

Lemma 4.3.

Given ${\boldsymbol{x}}\in\mathcal{P}$ , $d>0$ and ${\boldsymbol{y}}^{*}\in\mbox{SLMO}_{\mathcal{P}}({\boldsymbol{x}},d,{\boldsymbol{c}},\boldsymbol{\lambda}_{x})$ , for any point ${\boldsymbol{y}}\in\mathcal{P}$ satisfying $\|{\boldsymbol{x}}-{\boldsymbol{y}}\|\leq\frac{dD}{\eta}$ , it follows that ${\boldsymbol{y}}\in S_{\mathcal{P}}({\boldsymbol{x}},d)$ and $\langle{\boldsymbol{c}},{\boldsymbol{y}}^{*}\rangle\leq\langle{\boldsymbol{c}},{\boldsymbol{y}}\rangle$ . Furthermore, we have $\|{\boldsymbol{x}}-{\boldsymbol{y}}^{*}\|\leq(n+1)dD$ .

We also like to note that, though similar to LLOO, $\mbox{SLMO}_{\mathcal{P}}({\boldsymbol{x}},d,{\boldsymbol{c}},\boldsymbol{\lambda}_{x})$ is not exactly an LLOO because Lemma 4.3 only proves that $\mathbb{B}({\boldsymbol{x}},(D/\eta)d)\subseteq S_{\mathcal{P}}({\boldsymbol{x}},d)$ , not $\mathbb{B}({\boldsymbol{x}},d)\subseteq S_{\mathcal{P}}({\boldsymbol{x}},d)$ which would be sufficient for the first property of LLOO. We note that $D/\eta=\xi/\psi$ . Therefore, the condition $\xi/\psi\geq 1$ would be enough for $\mbox{SLMO}_{\mathcal{P}}({\boldsymbol{x}},d,{\boldsymbol{c}},\boldsymbol{\lambda}_{x})$ to be LLOO.

4.2 $\mbox{SFW}_{\mathcal{P}}$ : Simplex Frank-Wolfe for Arbitrary Polytopes

In this subsection, we extend the SFW to the polytope case. The generalized SFW algorithm is presented as follows.

Algorithm 5

\mbox{SFW}_{\mathcal{P}}

: Simplex Frank-Wolfe Method for Polytope

\mathcal{P}

{\boldsymbol{x}}_{0}\in S_{n}

, initial lower bound

B_{0}

, condition number

\eta

and diameter

D

\mathcal{P}

1: Set:

d_{0}\leftarrow\sqrt{\frac{2(f({\boldsymbol{x}}_{0})-B_{0})}{\mu}},\boldsymbol{\lambda}_{0}\in\mathcal{M}({\boldsymbol{x}}_{0}).

2: for

k=1,\dots

3: Compute

{\boldsymbol{y}}_{k}\in\mbox{SLMO}_{\mathcal{P}}({{\boldsymbol{x}}}_{k-1},\frac{\eta}{D}{d}_{k-1},\nabla f({\boldsymbol{x}}_{k-1}),\boldsymbol{\lambda}_{k-1})

4: Compute the working lower bound:

B_{k}^{w}\leftarrow f({\boldsymbol{x}}_{k-1})+\langle\nabla f({\boldsymbol{x}}_{k-1}),{\boldsymbol{y}}_{k}-{\boldsymbol{x}}_{k-1}\rangle

5: Update best bound

B_{k}\leftarrow\max\{B_{k-1},B_{k}^{w}\}

6: Set

{\boldsymbol{x}}_{k}\leftarrow(1-\delta_{k}){\boldsymbol{x}}_{k-1}+\delta_{k}{\boldsymbol{y}}_{k}

for some

\delta_{k}\in[0,1]

7: Set:

d_{k}\leftarrow\sqrt{\frac{2(f({\boldsymbol{x}}_{k})-B_{k})}{\mu}},\boldsymbol{\lambda}_{k}\in\mathcal{M}({\boldsymbol{x}}_{k})

8: end for

Notice that when $\mathcal{P}$ degenerates to $S_{n}$ , the algorithm differs from Alg. 5 only slightly at line 7 since $\eta=D=\sqrt{2}$ for $S_{n}$ . The convergence for Alg. 5 stated below is proved in Supplement B.

Theorem 4.4.

Let $\{{\boldsymbol{x}}_{k}\}$ be the sequences generated by Alg. 5 with step size policy for $\{\delta_{k}\}$ in (2.3), (2.4), or simple step size

\delta_{k}=\frac{\mu}{2L(n+1)^{2}\eta^{2}}.

(4.5)

Then, for $k\geq 0$ , we have

f({\boldsymbol{x}}_{k})-f^{*}\leq f({\boldsymbol{x}}_{k})-B_{k}\leq\frac{\mu d_{0}^{2}}{2}e^{-\frac{\mu}{4L\eta^{2}(n+1)^{2}}k}.

(4.6)

4.3 $\mbox{rSFW}_{\mathcal{P}}$ : Refining $\mbox{SFW}_{\mathcal{P}}$

As with the motivation for rSFW for the Simplex case, once we constructed the Simplex ball for $\mathcal{P}$ , we may compute an approximate solution:

{\boldsymbol{p}}_{k}\approx\operatorname*{\arg\min}\;f({\boldsymbol{p}})\quad\mbox{s.t.}\quad{\boldsymbol{p}}\in S({\boldsymbol{x}}_{k-1},d_{k-1})\cap S_{N},

with the initial point ${\boldsymbol{p}}_{0}={\boldsymbol{x}}_{k-1}$ . The motivation is based on a similar observation that $\mbox{SFW}_{\mathcal{P}}$ can be split into two independent parts with the first part of constructing the Simplex ball being the major computation. Hence, once such a ball is constructed we run a few more cheap SLMO steps over this ball. Once again, other methods such as Away-step FW and pairwise FW can be used for computing ${\boldsymbol{p}}_{k}$ . The generalized rSFW algorithm is presented as follows.

Algorithm 6

\mbox{rSFW}_{\mathcal{P}}

: Refined Simplex Frank-Wolfe Method for Polytope

\mathcal{P}

{\boldsymbol{x}}_{0}\in\mathcal{P}

\boldsymbol{\lambda}_{0}\in\mathcal{M}({\boldsymbol{x}}_{0})

, radius contraction ratio

\rho>1

, initial lower bound

B_{0}

, condition number

\eta

and diameter

D

\mathcal{P}

1: Set:

d_{0}\leftarrow\frac{\eta}{D}\sqrt{\frac{2(f({\boldsymbol{x}}_{0})-B_{0})}{\mu}},J\leftarrow\frac{4\rho^{2}(n+1)^{2}\eta^{2}L}{\mu}

2: for

k=1,\dots

3: Set:

{\boldsymbol{p}}_{0}\leftarrow{\boldsymbol{x}}_{k-1},C_{0}\leftarrow B_{k-1}

4: (

\mbox{SLMO}_{\mathcal{P}}

-1) Compute

\widehat{\boldsymbol{\lambda}}_{k-1}

and

\widehat{d}_{k-1}

such that

S(\widehat{\boldsymbol{\lambda}}_{k-1},\widehat{d}_{k-1})=S({{\boldsymbol{\lambda}}}_{k-1},{d}_{k-1})\cap S_{N}

5: for

j=1,\dots,J

6: (

\mbox{SLMO}_{\mathcal{P}}

-2) Compute

{\boldsymbol{y}}_{j}\in\mbox{SLMO}_{\mathcal{P}}({\boldsymbol{x}}_{k-1},d_{k-1},\nabla f({\boldsymbol{p}}_{j-1}),\boldsymbol{\lambda}_{k-1})

7: Set:

C_{j}^{w}\leftarrow f({\boldsymbol{p}}_{j-1})+\langle\nabla f({\boldsymbol{p}}_{j-1}),{\boldsymbol{y}}_{j}-{\boldsymbol{p}}_{j-1}\rangle

8: Update best bound

C_{j}\leftarrow\max\{C_{j-1},C_{j}^{w}\}

9: if

f({\boldsymbol{p}}_{j})-C_{j}\leq\frac{\mu}{2\rho^{2}\eta^{2}}{d}_{k-1}^{2}D^{2}

then

10: Break out of the inner loop.

11: end if

12: Set

{\boldsymbol{p}}_{j}\leftarrow(1-\delta_{j}){\boldsymbol{p}}_{j-1}+\delta_{j}{\boldsymbol{y}}_{j}

for some

\delta_{j}\in[0,1]

13: end for

14: Set:

{\boldsymbol{x}}_{k}\leftarrow{\boldsymbol{p}}_{j}

d_{k}\leftarrow\frac{{d}_{k-1}}{\rho}

B_{k}\leftarrow C_{j}

and

\boldsymbol{\lambda}_{k}\in\mathcal{M}({\boldsymbol{x}}_{k})

15: end for

Similar to Theorem 3.5, we can provide the following convergence analysis for Alg. 6, which is proven in Supplement B.

Theorem 4.5.

Let $\{{\boldsymbol{x}}_{k}\}$ be the sequences generated by Alg. 6 with step size policy for $\{\delta_{j}\}$ in (2.2)-(2.4). Then, for $k\geq 1$ , we have

f({\boldsymbol{x}}_{k})-f^{*}\leq f({\boldsymbol{x}}_{k})-B_{k}\leq(f({\boldsymbol{x}}_{0})-B_{0})\rho^{-2k}.

Remark 7.

(Adaptive Lower Bound Update) We estimate the lower bound of $f^{*}$ by $f({\boldsymbol{x}}_{k-1})+\langle\nabla f({\boldsymbol{x}}_{k-1}),{\boldsymbol{y}}_{k}-{\boldsymbol{x}}_{k-1}\rangle$ for the Simplex Frank-Wolfe method and its refined version. In fact, when the objective function exhibits specific structural properties, we can derive an additional lower $B_{k}^{o}$ and update the best bound $B_{k}$ as $B_{k}\leftarrow\max\{B_{k-1},B_{k}^{w},B_{k}^{o}\}$ . For instance, when the objective function has a minmax structure, we can construct a minmax lower bound $B_{k}^{o}$ for $f^{*}$ , see [11] for detailed analysis. Moreover, in certain application scenarios, there may be exact information about the optimal value $f^{*}$ , such as in linear regression or machine learning tasks, where it is known a priori that the optimal value of the loss function is $0$ . In such cases, it is straightforward to set $B_{k}^{o}\leftarrow f^{*}$ .

Remark 8.

(Robustness to Parameter Estimation) Our algorithms rely on parameters $L,\mu,\eta,D$ . In practice, using overestimates $L^{\prime},\eta^{\prime},D^{\prime}$ and an underestimate $\mu^{\prime}$ such that $\frac{L^{\prime}\eta^{\prime}D^{\prime}\mu}{L\eta D\mu^{\prime}}=O(1)$ only increases the bounds by a constant factor. Moreover, as shown in Supplement C.2, both $\eta$ and $D$ can be efficiently estimated for common $\mathcal{P}$ . For $L$ and $\mu$ , one can use the backtracking strategy from [26] to estimate their local values and compute adaptive short step sizes; see Subsection 5.4 for details.

5 Numerical Experiments

In this section, we present numerical experiments to evaluate the efficiency, convergence, and adaptability of the proposed methods. All tests were performed using MATLAB R2022b on a Windows laptop equipped with a 14-core Intel(R) Core(TM) 2.30GHz CPU and 16GB of RAM.

We try to furnish four tasks. (T1) We first assess the computational efficiency of SLMO and SLMO-2 across four representative polytopes, consolidating their role of the workhorse in our SFW methods. (T2) We illustrate the linear convergence behavior of SFW and rSFW using two numerical experiments. (T3) We show that our methods can be enhanced with a backtracking strategy to eliminate the need for predefined values of the parameters $L$ and $\mu$ . (T4) We demonstrate how integrating the away-step variants of the Frank-Wolfe method (AFW and PFW) into the rSFW framework significantly enhances its performance, outperforming the original AFW and PFW methods. Those four tasks are addressed in four subsections.

5.1 Efficiency of SLMO and SLMO-2

In this subsection, we evaluate the performance of the proposed SLMO and SLMO-2 through comparative experiments on four common polytopes $\mathcal{P}$ : (a) Unit simplex; (b) Hypercube; (c) $\ell_{1}$ -ball; and (d) Flow polytope, derived from the video co-localization problem in [21].

Table 1: Description of projection and five LMO variants used in the numerical comparison. These six methods shares the same randomly generated parameters:

{\boldsymbol{c}}\sim\mathcal{N}({\boldsymbol{0}},I_{n}),{\boldsymbol{x}}\in\mathcal{P}

and

d\in\mathcal{U}_{[0,1]}

Algorithm	Formulation	Description
Projection	$\operatorname*{\arg\min}_{{\boldsymbol{y}}\in\mathcal{P}}\\|{\boldsymbol{y}}-{\boldsymbol{z}}\\|^{2}$	The projection onto the polytope $\mathcal{P}$ , and ${\boldsymbol{z}}\sim\mathcal{N}({\boldsymbol{0}},I_{n})$ is a randomly generated point. We implement projections onto the Simplex and $\ell_{1}$ -ball using the method from [8, Fig. 2], while the projection onto the hypercube is straightforward. Although a closed-form solution exists for projection onto the flow polytope [29, Thm. 20], its computational complexity of $O(m^{3}n+n^{2})$ makes it significantly more expensive than other LMO variants.
LMO	$\operatorname*{\arg\min}_{{\boldsymbol{y}}\in\mathcal{P}}\langle{\boldsymbol{y}},{\boldsymbol{c}}\rangle$	The standard linear minimization oracle.
$\ell_{1}$ -LMO	argmin_y∈{Vλ∣λ∈B_1(λ_x,d)∩S_N}⟨y,c⟩	The $\ell_{1}$ -norm constrained LMO, Alg. 3 and Alg. 4 in [14]. Here, $V$ consists of columns of ${\boldsymbol{v}}\in\mathcal{V}(\mathcal{P})$ , and the computation of $\boldsymbol{\lambda}_{x}\in\mathcal{M}({\boldsymbol{x}})$ is included in the timing.
NEP	argmin_y∈V(P)⟨y,c⟩+ λ∥y-x∥^2	Nearest extreme point oracle in [15]. Here, $\lambda\sim\mathcal{U}_{[0,10000]}$ is a randomly generated positive number.
SLMO_P	argmin_y∈{Vλ∣λ∈S(λ_x,d)∩S_N}⟨y,c⟩	Our proposed Simplex Linear Minimization Oracle (Alg. 1 and Alg. 4). Here, the computation of $\boldsymbol{\lambda}_{x}\in\mathcal{M}({\boldsymbol{x}})$ is included in the timing.
SLMO_P-2	argmin_y∈{Vλ∣λ∈S(^λ_x,^d)}⟨y,c⟩	The latter phase of SLMO, consisting of Lines 4-5 of Alg. 1 and Alg. 4. Here, $S(\widehat{\boldsymbol{\lambda}}_{x},\widehat{d})=S(\boldsymbol{\lambda}_{x},d)\cap S_{N}$ is precomputed and not included in the timing.

We consider the six different methods, including projection (a key subproblem in projection/proximal based methods) and five variants of LMO, as detailed in Table 1. These six methods shares the same randomly generated ${\boldsymbol{c}}\sim\mathcal{N}({\boldsymbol{0}},I_{n}),{\boldsymbol{x}}\in\mathcal{P}$ and $d\in\mathcal{U}_{[0,1]}$ . Figure 2 illustrates the relationship between running time and dimensionality for the six methods. Additionally, Table 2 reports the proportion of time spent on LMO calls within the SLMO-2 algorithm. We draw the following observations from these results:

•

$\ell_{1}$ -LMO: Across all four polytopes, the $\ell_{1}$ -LMO method incurs the highest computational overhead surpassing even that of projection-based methods.
•

SLMO_P-2 vs. SLMO_P: The overhead of SLMO-2 is significantly lower than that of SLMO. In most cases, as shown in Table 2, its overhead closely matches that of the LMO itself. This indicates that our proposed rSFW and rSFW_P achieve iterative complexity comparable to that of the standard Frank-Wolfe algorithm.
•

NEP: While NEP demonstrates very low runtime overhead, it is important to note that the corresponding Frank-Wolfe variant, NEP-FW, converges only sublinearly as shown in Subsection 5.2.
•

$\ell_{1}$ -LMO and SLMO_P on the Flow Polytope: Both methods exhibit rising overhead with increasing dimension, mainly due to the cost of computing the Carathéodory representation $\boldsymbol{\lambda}_{x}\in\mathcal{M}({\boldsymbol{x}})$ , which dominates the runtime.

Table 2: Time overhead of LMO calls as a percentage of total computation time when using the SLMO-2 algorithm across four different polytopes.

	Simplex	Hypercube	$\ell_{1}$ -ball	Flow polytope
	( $n=10^{7}$ )	( $n=10^{7}$ )	( $n=10^{7}$ )	( $n=3\times 10^{4}$ )
$\frac{\mbox{Time(LMO)}}{\mbox{Time(SLMO-2)}}$	$98.8\%$	$46.0\%$	$52.6\%$	$99.7\%$

5.2 Linear Convergence of SFW and rSFW

Table 3: Description of Frank-Wolfe variants used in the numerical comparison.

Algorithm	Description
FW (simple/Ada)	Frank-Wolfe with simple step size $\delta_{k}=2/(k+1)$ . The ‘Ada’ variant employs a backtracking step (5.1) prior to updating the iterate, in order to estimate the local parameters $L$ and $\mu$ , thereby enabling an adaptive short step size. We set $\tau_{1}=2$ and $\tau_{2}=0.9$ in Alg. 7, and apply the same configuration to the subsequent ‘Ada’ variants.
SFW/SFW_P (line-search)	Simplex Frank-Wolfe with exact line-search (Alg. 2 and Alg. 5). For the $\ell_{1}$ -constrained least squares problem, we set $\mu=2\lambda_{min}(A^{\prime}A)$ , $D=2$ and $\eta=\sqrt{n}$ . Although Supplement C.2 estimates $\eta\leq n$ for $\ell_{1}$ -ball, this setting does not hinder the algorithm s linear convergence and demonstrates strong practical performance. For the video co-localization task, we set $\mu=\lambda_{min}(A)$ and $\eta=D=\sqrt{66}$ ; see Supplement C.2 for details. For the Simplex-constrained least squared problem, we set $\mu=2\lambda_{min}(A^{\prime}A)$ .
NEP-FW (simple)	Frank-Wolfe with Nearest Extreme Point Oracle, with theoretical step size $2/(k+1)$ [15, Alg. 1]. We omitted Line 5, as it showed no noticeable effect on performance.
rSFW/rSFW_P (simple)	Refined Simplex Frank-Wolfe with simple step size $\delta_{j}=2/(j+1)$ (Alg. 3 and Alg. 6). We utilize the warm-start strategy with $\rho^{\prime}=2$ as mentioned in Remark. 4. The parameters $\mu,D$ and $\eta$ are set the same as in SFW/SFW_P. Additionally, we set $\rho=1.01,L=2\lambda_{max}(A^{\prime}A)$ for $\ell_{1}$ /Simplex-constrained least squares problems, and $\rho=1.01,L=\lambda_{max}(A)$ for video co-localization problem.
PFW (line-search)	Pairwise Frank-Wolfe with exact line-search [22, Alg. 2].
AFW (line-search)	Away-steps Frank-Wolfe with exact line-search [22, Alg. 1].
rSFW-P (line-search)	The Refined Simplex Frank-Wolfe framework enhanced with Pairwise technique. Specifically, we incorporate (5.3) after Line 6 in Alg. 3 and replace Line 12 with (5.4).
rSFW-A (line-search)	The Refined Simplex Frank-Wolfe framework enhanced with Away-steps technique. Specifically, we incorporate (5.2) after Line 6 in Alg. 3 and replace Line 12 with (5.4).

We demonstrate the linear convergence of our proposed methods—SFW and rSFW through two numerical experiments. These methods are compared against the standard Frank-Wolfe (FW) algorithm, its variant NEP-FW¹¹1As the code for NEP-FW is not publicly available, we implemented it ourselves., and two well-known variants: Away-step FW²²2The implementations of AFW and PFW are available at https://github.com/Simon-Lacoste-Julien/linearFW. (AFW) and Pairwise FW (PFW), all summarized in Table 3.

To evaluate algorithmic performance, we adopt the Frank-Wolfe gap defined by $\langle\nabla f({\boldsymbol{x}}_{k}),{\boldsymbol{x}}_{k}-{\boldsymbol{y}}_{k+1}\rangle,$ where ${\boldsymbol{x}}_{k}$ is the $k$ -th iterate and ${\boldsymbol{y}}_{k}$ denotes the solution returned by the respective LMO variant at that iteration. This FW gap provides a valid upper bound on the primal gap, i.e., $f({\boldsymbol{x}}_{k})-f^{*}\leq\langle\nabla f({\boldsymbol{x}}_{k}),{\boldsymbol{x}}_{k}-{\boldsymbol{y}}_{k+1}\rangle$ , and can thus be used as a practical stopping criterion³³3For NEP-FW, however, this inequality does not hold in general for ${\boldsymbol{y}}_{k+1}=\text{NEP}({\boldsymbol{x}}_{k})=\text{LMO}(\nabla f({\boldsymbol{x}}_{k})-\lambda_{k}{\boldsymbol{x}}_{k},\mathcal{P})$ . Therefore, to ensure a consistent and fair comparison, we use the standard FW gap to evaluate NEP-FW as well..

The first experiment involves an $\ell_{1}$ -regularized least squares regression, that is $\min_{\|{\boldsymbol{x}}\|_{1}\leq 1}\lVert A{\boldsymbol{x}}-{\boldsymbol{b}}\rVert_{2}^{2}$ , where $A\in\mathbb{R}^{m\times n}$ with $m=400,n=100$ , and the entries of $A$ are drawn from a standard Gaussian distribution. We set ${\boldsymbol{b}}=A{\boldsymbol{x}}^{*}$ , where ${\boldsymbol{x}}^{*}$ is constructed by first generating a random vector with sparsity parameter $s=0.7$ , followed by normalization to lie on the boundary of the $\ell_{1}$ -ball. Thus, the optimal value of this problem is 0. We use the same initial point ${\boldsymbol{x}}_{0}={\boldsymbol{0}}_{n}$ for all methods.

The second experiment involves a convex quadratic problem over the flow polytope, derived from the video co-localization task introduced by [21]. The problem is formulated as $\min_{{\boldsymbol{x}}\in\mathcal{F}_{s,t}}\frac{1}{2}{\boldsymbol{x}}^{\prime}A{\boldsymbol{x}}+{\boldsymbol{b}}^{\prime}x$ , where $A\in\mathbb{R}^{n\times n}$ is a positive definite matrix, ${\boldsymbol{b}}\in\mathbb{R}^{n}$ , and $\mathcal{F}_{s,t}$ represents the s-t flow polytope. We used the same dataset and initial point as in [22, 15]. The problem has a dimension of $n=660$ .

The results are presented in Figure 3. We make some comments below.

•

Linear convergence of SFW_P and rSFW_P: Both SFW ${}\mathcal{P}$ and rSFW ${}\mathcal{P}$ show linear convergence, confirming our theoretical guarantees.
•

Superior efficiency of rSFW_P: In both experiments, our proposed rSFW method significantly outperforms all other algorithms –including the well-established AFW and PFW– in terms of running time.
•

Limitations of NEP-FW: Although NEP performs well in iteration complexity, its NEP-FW variant converges sublinearly and is slightly slower than standard FW.
•

Time inefficiency in video co-localization: In the video co-localization task, while SFW ${}\mathcal{P}$ , AFW, and PFW converge quickly by iteration count, their runtime is slower due to overhead—–Carathéodory computation for SFW ${}\mathcal{P}$ , and growing active sets for AFW and PFW.

5.3 SFW/rSFW with Backtracking

We further show that our methods can be enhanced with the backtracking technique proposed by [26], thereby eliminating the need to manually specify the parameters $L$ and $\mu$ . While [26, Alg. 2] does not specify how to estimate the strong convexity constant $\mu$ , we outline our approach to estimating both $L$ and $\mu$ as well as determining an adaptive step size; see D for the detailed algorithm. As an example, consider the $k$ -th iteration of SFW_P. Before updating ${\boldsymbol{x}}_{k}$ , we perform the following backtracking step:

\delta_{k},\ L_{k},\ \mu_{k}\leftarrow\text{Backtracking-Routine}({\boldsymbol{x}}_{k-1},{\boldsymbol{y}}_{k}-{\boldsymbol{x}}_{k-1},L_{k-1},\mu_{k-1},1).

(5.1)

We focus on the $\ell_{1}$ -constrained logistic regression problem with the form:

\min_{\lVert{\boldsymbol{x}}\rVert_{1}\leq\beta}\frac{1}{m}\sum_{i=1}^{m}\text{ln}(1+\text{exp}(-b_{i}\langle\mathbf{a}_{i},{\boldsymbol{x}}\rangle))+\frac{\lambda}{2}\lVert{\boldsymbol{x}}\rVert^{2},

where $A=[\mathbf{a}_{1},\dots,\mathbf{a}_{m}]\in\mathbb{R}^{m\times n}$ and ${\boldsymbol{b}}\in\mathbb{R}^{m}$ . We use the dataset Madelon [17], which has $m=4400,n=500$ , and fully-density (i.e. $\text{density}=1$ ). We set $\beta=1$ and $\lambda=1/n$ . We compare our methods against AFW, PFW, and the standard FW, all of which are equipped with the backtracking technique. The results are presented in Figure 4. It can be observed that our two methods achieve the best performance in terms of running time. Although AFW and PFW perform well in terms of iteration count, their overall efficiency is hindered by the increasing cost of maintaining a growing active set and computing the away direction.

5.4 rSFW Framework Combined with the Away Step Technique

In this subsection, we demonstrate that the well-known linearly converging variants of the standard Frank-Wolfe method AFW and PFW [22] can be seamlessly integrated into the inner loop of the rSFW framework. This straightforward combination leads to a significant performance improvement over the original AFW and PFW methods.

We focus on the simplex-regularized problem $\min_{{\boldsymbol{x}}\in S_{n}}\lVert A{\boldsymbol{x}}-{\boldsymbol{b}}\rVert_{2}^{2}$ , where $A\in\mathbb{R}^{m\times n},m=800,n=200$ , with standard Gaussian entries. We set ${\boldsymbol{b}}=A{\boldsymbol{x}}^{*}$ , where ${\boldsymbol{x}}^{*}$ is constructed by first generating a random nonnegative vector with sparsity parameter $d=0.6$ and then normalized it so that its components sum to 1. Thus 0 is the optimal value of this problem. We use the same initial point ${\boldsymbol{x}}_{0}={\boldsymbol{1}}_{n}/n$ for all methods.

In comparison to the experiments in the previous subsection, we introduce four additional algorithms: AFW and PFW, along with their respective versions integrated into the rSFW framework, denoted as rSFW-A and rSFW-P. Details of these methods are provided in Table 3.

We briefly explain here how the direction-correction ${\boldsymbol{g}}$ is computed within the general framework (1.2) for both the rSFW-A and rSFW-P algorithms. Based on Alg. 3, during the $k$ -th outer loop and the $j$ -th inner loop, let $S^{(k,j)}\subset\mathcal{V}(S(\widehat{{\boldsymbol{x}}}_{k-1},\widehat{d}_{k-1}))$ denote the active set corresponding to the point ${\boldsymbol{p}}_{j-1}$ . Thus, ${\boldsymbol{p}}_{j-1}$ can be represented as ${\boldsymbol{p}}_{j-1}=\sum_{{\boldsymbol{v}}\in S^{(k,j)}}\alpha_{{\boldsymbol{v}}}{\boldsymbol{v}}$ where $\alpha_{{\boldsymbol{v}}}>0$ . Let ${\boldsymbol{v}}_{j}={\arg\max}_{{\boldsymbol{v}}\in S^{(k,j)}}\langle\nabla f({\boldsymbol{p}}_{j-1}),{\boldsymbol{v}}\rangle$ . For the rSFW-A method, the direction-correction is computed as:

{\boldsymbol{g}}_{j}=\begin{cases}\frac{1}{1-\alpha_{{\boldsymbol{v}}_{j}}}{\boldsymbol{p}}_{j-1}-{\boldsymbol{y}}-\frac{\alpha_{{\boldsymbol{v}}_{j}}}{1-\alpha_{{\boldsymbol{v}}_{j}}}{\boldsymbol{v}}_{j}&\text{if }\Delta_{j}<0,\\ {\boldsymbol{0}}&\text{if }\Delta_{j}\geq 0,\end{cases}

(5.2)

where $\Delta_{j}:=\langle-\nabla f({\boldsymbol{g}}_{j-1}),{\boldsymbol{y}}_{j}-{\boldsymbol{p}}_{j-1}\rangle-\langle-\nabla f({\boldsymbol{g}}_{j-1}),{\boldsymbol{p}}_{j-1}-{\boldsymbol{v}}_{j}\rangle.$

For the rSFW-P method, the direction-correction is computed as:

{\boldsymbol{g}}_{j}={\boldsymbol{p}}_{j-1}-(1-\alpha_{{\boldsymbol{v}}_{j}}){\boldsymbol{y}}_{j}-\alpha_{{\boldsymbol{v}}_{j}}{\boldsymbol{v}}_{j}.

(5.3)

Finally, we update the point ${\boldsymbol{p}}_{j}$ using the iteration:

{\boldsymbol{p}}_{j}\leftarrow(1-\delta_{j}){\boldsymbol{p}}_{j-1}+\delta_{j}({\boldsymbol{y}}_{j}+{\boldsymbol{g}}_{j}),

(5.4)

which replaces the original iteration.

The results are given in Figure 5. rSFW-P demonstrates superior performance compared to all other algorithms, excelling in both the number of iterations and running time, achieving nearly twice the efficiency of PFW. Additionally, both framework-based acceleration algorithms exhibit substantial performance improvements over the standalone rSFW framework.

6 Conclusion

In this paper, we introduced a novel oracle: SLMO, which leverages the advantageous geometric properties of the unit simplex. This design enables SLMO to be implemented with the same computational complexity as the standard linear optimization oracle, preserving the efficiency of the Frank-Wolfe framework. Building on this oracle, we proposed two new variants of the classical Frank-Wolfe algorithm: the Simplex Frank-Wolfe (SFW) and refined Simplex Frank-Wolfe (rSFW) algorithms. Both methods achieve linear convergence for smooth and strongly convex optimization problems over polytopes. The linear convergence rates of these methods depend only on the condition number of the objective function, the polytope s quantity, and the problem s dimension, demonstrating their scalability and robustness in various settings.

The purpose of this paper is to develop the basic framework for the new SFW methods and demonstrate that they are highly competitive. We do so with the simplest setting of $f$ being strongly convex and smooth. We made no attempt to weaken such assumption except pointing out that the obtained results should also hold under the quadratic growth condition. An immediate question would be to extend the methods to convex or even nonconvex setting. Furthermore, LLOO proposed in [14] was an elegant framework and it was largely omitted from recent surveys on FW methods. To our best knowledge, SFW was the first LLOO instance that was extensively tested and compared with other popular FW methods. An intriguing question is whether there exist alternative LLOO approaches that simultaneously satisfy the following criteria: (i) adhering to the LLOO framework, (ii) enabling fast and accurate computation, and (iii) incurring significantly lower overhead compared to projection-based methods? We leave those topics to our future research.

acknowledgement

This work was supported by the National Natural Science Foundation of China (Grant No. 12171271) and by Hong Kong RGC General Research Fund PolyU/15303124.

Appendix

Appendix A Proof of Lemma 3.1

Proof.

(1) By the definition of the simplex ball $S({\boldsymbol{x}},d)$ , we have

\displaystyle S({\boldsymbol{1}}_{n}/n,1/n)

\displaystyle=\frac{1}{n}{\boldsymbol{1}}_{n}+n\times\frac{1}{n}S_{0}=\frac{1}{n}{\boldsymbol{1}}_{n}+S_{0}=S_{n}.

Furthermore, we have

\mbox{Conv}\left\{n{\boldsymbol{e}}_{i}-{\boldsymbol{1}}_{n}:\ i\in[n]\right\}=n\mbox{Conv}\left\{{\boldsymbol{e}}_{i}:\ i\in[n]\right\}-{\boldsymbol{1}}_{n}=nS_{n}-{\boldsymbol{1}}_{n}=nS_{0}.

Consequently, the characterization (3.2) holds.

(2) On one hand, for every ${\boldsymbol{y}}\in S({\boldsymbol{x}},d)\cap S_{n}$ , there exists $\boldsymbol{\lambda}\in S_{n}$ such that $y_{i}=x_{i}-d+nd\lambda_{i},\forall i\in[n]$ . Define for each $i\in[n]$ ,

\widehat{\lambda}_{i}=\left\{\begin{array}[]{ll}\frac{nd\lambda_{i}}{\sum_{j=1}^{n}\min\{d,x_{j}\}}&\text{if }x_{i}\geq d,\\[8.61108pt] \frac{x_{i}-d+nd\lambda_{i}}{\sum_{j=1}^{n}\min\{d,x_{j}\}}&\text{if }x_{i}<d.\end{array}\right.

Since ${\boldsymbol{y}}\in S_{n}$ and $\boldsymbol{\lambda}\in S_{n}$ , we have $nd\lambda_{i}\geq 0$ and $x_{i}-d+nd\lambda_{i}=y_{i}\geq 0$ , which shows that $\widehat{\lambda}_{i}\geq 0$ for each $i\in[n]$ . Moreover, let ${\mathcal{I}}_{-}:=\{i\in[n]\mid x_{i}<d\}$ be an index set. Then we have

\sum_{i=1}^{n}\widehat{\lambda}_{i}=\frac{\sum_{i\in{\mathcal{I}}_{-}}(x_{i}-d)+nd}{\sum_{i=1}^{n}\min\{d,x_{i}\}}=\frac{\sum_{i\in{\mathcal{I}}_{-}}x_{i}+(n-|{\mathcal{I}}_{-}|)d}{\sum_{i\in{\mathcal{I}}_{-}}x_{i}+(n-|{\mathcal{I}}_{-}|)d}=1.

As above, we have verified that $\widehat{\boldsymbol{\lambda}}\in S_{n}$ . We now show that ${\boldsymbol{y}}=(\widehat{{\boldsymbol{x}}}-\widehat{d}{\boldsymbol{1}}_{n})+n\widehat{d}\widehat{\boldsymbol{\lambda}}$ , where $\widehat{{\boldsymbol{x}}}$ and $\widehat{d}$ are defined in (3.3). This follows from the fact that, for each $i\in[n]$ ,

		$\displaystyle\widehat{x}_{i}-\widehat{d}+n\widehat{d}\;\widehat{\lambda}_{i}$
	$\displaystyle=$	$\displaystyle\max\{x_{i},d\}+\widehat{d}-d-\widehat{d}+n\frac{\sum_{j=1}^{n}\min\{d,x_{j}\}}{n}\times\frac{\min\{x_{i}-d,0\}+nd\lambda_{i}}{\sum_{j=1}^{n}\min\{d,x_{j}\}}$
	$\displaystyle=$	$\displaystyle\max\{x_{i},d\}-d+\min\{x_{i}-d,0\}+nd\lambda_{i}=\;x_{i}-d+nd\lambda_{i}=y_{i}.$

Thus, we have ${\boldsymbol{y}}\in S(\widehat{{\boldsymbol{x}}},\widehat{d})$ , leading to $S_{n}\cap S({\boldsymbol{x}},d)\subset S(\widehat{{\boldsymbol{x}}},\widehat{d})$ .

On the other hand, for every ${\boldsymbol{y}}\in S(\widehat{{\boldsymbol{x}}},\widehat{d})$ , there exists $\widehat{\boldsymbol{\lambda}}\in S_{n}$ such that $y_{i}=\widehat{x}_{i}-\widehat{d}+n\widehat{d}\;\widehat{\lambda}_{i}$ . Due to the fact that for $i\in[n]$ , $y_{i}\geq\widehat{x}_{i}-\widehat{d}=\max\{x_{i},d\}-d\geq 0$ and

\sum_{i=1}^{n}y_{i}=\sum_{i=1}^{n}\max\{x_{i},d\}+n\widehat{d}-nd=\sum_{i=1}^{n}\max\{x_{i},d\}+\sum_{i=1}^{n}\min\{x_{i},d\}-nd=\sum_{i=1}^{n}x_{i}=1,

we have ${\boldsymbol{y}}\in S_{n}$ . We now turn to show that ${\boldsymbol{y}}\in S({\boldsymbol{x}},d)$ .

Let $\lambda_{i}:=\frac{\max\{x_{i},d\}-x_{i}+\sum_{j=1}^{n}\min\{x_{j},d\}\widehat{\lambda}_{i}}{nd}$ . It is not difficult to verify that $\lambda_{i}\geq 0$ and $\sum_{i=1}^{n}\lambda_{i}=1$ . Moreover, we have

		$\displaystyle x_{i}-d+nd\lambda_{i}=x_{i}-d+\max\{x_{i},d\}-x_{i}+\sum_{j=1}^{n}\min\{x_{j},d\}\widehat{\lambda}_{i}$
	$\displaystyle=$	$\displaystyle(\max\{x_{i},d\}+\widehat{d}-d)-\widehat{d}+n\widehat{d}\;\widehat{\lambda}_{i}=\widehat{x}_{i}-\widehat{d}+n\widehat{d}\;\widehat{\lambda}_{i}=y_{i},$

implying ${\boldsymbol{y}}\in S({\boldsymbol{x}},d)$ . Thus $S(\widehat{{\boldsymbol{x}}},\widehat{d})\subset S_{n}\cap S({\boldsymbol{x}},d)$ . This finishes the proof for (3.3).

We now proceed to prove (3.4). Following the definition of $S({\boldsymbol{x}},d)$ , we have

S({\boldsymbol{x}},d)=(nd)S_{n}+({\boldsymbol{x}}-d{\boldsymbol{1}}_{n}).

Translating to $({\boldsymbol{x}}_{1},d_{1})$ and $({\boldsymbol{x}}_{2},d_{2})$ , we have

	$\displaystyle S({\boldsymbol{x}}_{1},d_{1})$	$\displaystyle=(nd_{1})S_{n}+({\boldsymbol{x}}_{1}-d_{1}{\boldsymbol{1}}_{n}),$
	$\displaystyle S({\boldsymbol{x}}_{2},d_{2})$	$\displaystyle=(nd_{2})S_{n}+({\boldsymbol{x}}_{2}-d_{2}{\boldsymbol{1}}_{n})$
		$\displaystyle=(nd_{1})\left[n\times\frac{d_{2}}{nd_{1}}S_{n}+\left(\frac{{\boldsymbol{x}}_{2}-({\boldsymbol{x}}_{1}-d_{1}{\boldsymbol{1}}_{n})}{nd_{1}}-\frac{d_{2}}{nd_{1}}{\boldsymbol{1}}_{n}\right)\right]+({\boldsymbol{x}}_{1}-d_{1}{\boldsymbol{1}}_{n})$
		$\displaystyle=nd_{1}S\left(\frac{{\boldsymbol{x}}_{2}-({\boldsymbol{x}}_{1}-d_{1}{\boldsymbol{1}}_{n})}{nd_{1}},\frac{d_{2}}{nd_{1}}\right)+({\boldsymbol{x}}_{1}-d_{1}{\boldsymbol{1}}_{n}).$

Therefore,

S({\boldsymbol{x}}_{1},d_{1})\cap S({\boldsymbol{x}}_{2},d_{2})=({\boldsymbol{x}}_{1}-d_{1}{\boldsymbol{1}}_{n})+(nd_{1})\left[S_{n}\cap\left(\frac{{\boldsymbol{x}}_{2}-({\boldsymbol{x}}_{1}-d_{1}{\boldsymbol{1}}_{n})}{nd_{1}},\frac{d_{2}}{nd_{1}}\right)\right]

Using (3.3), we have

S_{n}\cap S\left(\frac{{\boldsymbol{x}}_{2}-({\boldsymbol{x}}_{1}-d_{1}{\boldsymbol{1}}_{n})}{nd_{1}},\frac{d_{2}}{nd_{1}}\right)=S(\widehat{{\boldsymbol{x}}},\widehat{d}),

where

	$\displaystyle\widehat{d}$	$\displaystyle=\frac{1}{n}\sum_{i=1}^{n}\min\left\{\frac{d_{2}}{nd_{1}},\frac{x_{2}(i)-x_{1}(i)+d_{1}}{nd_{1}}\right\}=\frac{\sum_{i=1}^{n}\min\{d_{2},x_{2}(i)-x_{1}(i)+d_{1}\}}{n^{2}d_{1}}$
		$\displaystyle\stackrel{{\scriptstyle(a)}}{{=}}\frac{1+\sum_{i=1}^{n}\min\{d_{1}-x_{1}(i),d_{2}-x_{2}(i)\}}{n^{2}d_{1}}$
	$\displaystyle\widehat{x}_{i}$	$\displaystyle=\max\left\{\frac{d_{2}}{nd_{1}},\frac{x_{2}(i)-x_{1}(i)+d_{1}}{nd_{1}}\right\}+\left(\widehat{d}-\frac{d_{2}}{nd_{1}}\right),\quad\forall i\in[n].$

Here, $(a)$ follows from the fact ${\boldsymbol{x}}_{2}\in S_{n}$ . We then have

$\displaystyle S({\boldsymbol{x}}_{1},d_{1})\cap S({\boldsymbol{x}}_{2},d_{2})$	$\displaystyle=$	$\displaystyle(nd_{1})S(\widehat{{\boldsymbol{x}}},\widehat{d})+({\boldsymbol{x}}_{1}-d_{1}{\boldsymbol{1}}_{n})$
	$\displaystyle=$	$\displaystyle(nd_{1})\left[(n\widehat{d})S_{n}+(\widehat{{\boldsymbol{x}}}-\widehat{d}{\boldsymbol{1}}_{n})\right]+({\boldsymbol{x}}_{1}-d_{1}{\boldsymbol{1}}_{n})$
	$\displaystyle=$	$\displaystyle n(nd_{1}\widehat{d})S_{n}+\left[({\boldsymbol{x}}_{1}+nd_{1}\widehat{{\boldsymbol{x}}}-d_{1}{\boldsymbol{1}}_{n})-(nd_{1}\widehat{d}){\boldsymbol{1}}_{n}\right]$
	$\displaystyle=$	$\displaystyle S\Big({\boldsymbol{x}}_{1}+nd_{1}\widehat{{\boldsymbol{x}}}-d_{1}{\boldsymbol{1}}_{n},\ nd_{1}\widehat{d}\Big)=S({\boldsymbol{x}}_{3},d_{3}),$

where

	$\displaystyle d_{3}$	$\displaystyle=nd_{1}\widehat{d}=\frac{1+\sum_{i=1}^{n}\min\{d_{1}-x_{1}(i),d_{2}-x_{2}(i)\}}{n},$
	$\displaystyle x_{3}(i)$	$\displaystyle=nd_{1}\widehat{x}(i)+(x_{1}{(i)}-d_{1})$
		$\displaystyle=\max\{d_{2},x_{2}(i)-x_{1}(i)+d_{1}\}+(nd_{1}\widehat{d}-d_{2})+(x_{1}{(i)}-d_{1})$
		$\displaystyle=\max\{d_{2},x_{2}(i)-x_{1}(i)+d_{1}\}+(x_{1}{(i)}-d_{1}-d_{2})+d_{3}$
		$\displaystyle=\max\{x_{1}(i)-d_{1},x_{2}(i)-d_{2}\}+d_{3}.$

Thus, we have proven (3.4).

(3) Since the optimal solution ${\boldsymbol{y}}^{*}$ lies on the extreme point of $S({\boldsymbol{x}},d)$ , we have

	$\displaystyle{\boldsymbol{y}}^{*}$	$\displaystyle={\operatorname*{\arg\min}}_{{\boldsymbol{y}}={\boldsymbol{x}}+d(n{\boldsymbol{e}}_{i}-{\boldsymbol{1}}_{n}),i\in[n]}\langle{\boldsymbol{c}},{\boldsymbol{x}}+d(n{\boldsymbol{e}}_{i}-{\boldsymbol{1}}_{n})\rangle$
		$\displaystyle={\operatorname{\arg\min}}_{{\boldsymbol{y}}={\boldsymbol{x}}+d(n{\boldsymbol{e}}_{i}-{\boldsymbol{1}}_{n}),i\in[n]}\langle{\boldsymbol{c}},e_{i}\rangle={\boldsymbol{x}}+d(n{\boldsymbol{e}}_{i^{}}-{\boldsymbol{1}}_{n}),$

where $i^{*}=\operatorname*{\arg\min}_{i\in[n]}c_{i}$ .

(4) By (3.1), for any points ${\boldsymbol{y}}_{1},{\boldsymbol{y}}_{2}\in S({\boldsymbol{x}},d)$ , there exist $\boldsymbol{\lambda}_{1},\boldsymbol{\lambda}_{2}\in S_{n}$ such that ${\boldsymbol{y}}_{i}=({\boldsymbol{x}}-d{\boldsymbol{1}}_{n})+nd\boldsymbol{\lambda}_{i}$ for $i\in[2]$ . Thus, we have

\max_{{\boldsymbol{y}}_{1},{\boldsymbol{y}}_{2}\in S({\boldsymbol{x}},d)}\lVert{\boldsymbol{y}}_{1}-{\boldsymbol{y}}_{2}\rVert=nd\cdot\max_{\boldsymbol{\lambda}_{1},\boldsymbol{\lambda}_{2}\in S_{n}}\lVert\boldsymbol{\lambda}_{1}-\boldsymbol{\lambda}_{2}\rVert=\sqrt{2}nd.

(5) Let $\boldsymbol{\lambda}=\frac{1}{n}({\boldsymbol{1}}_{n}+\frac{{\boldsymbol{y}}-{\boldsymbol{x}}}{d})$ . Since $\lVert{\boldsymbol{y}}-{\boldsymbol{x}}\rVert\leq d$ , we have $\lambda_{i}\geq 0$ . Moreover, since ${\boldsymbol{x}},{\boldsymbol{y}}\in S_{n}$ , we have $\sum_{i=1}^{n}\lambda_{i}=1$ , which implies $\boldsymbol{\lambda}\in S_{n}$ . Thus by (3.1), ${\boldsymbol{y}}=({\boldsymbol{x}}-d{\boldsymbol{1}}_{n})+nd\boldsymbol{\lambda}\in S({\boldsymbol{x}},d)$ . We now turn to the last part of Lemma 3.1(5). This simply follows from

\max_{{\boldsymbol{y}}\in S({\boldsymbol{x}},d)}\lVert{\boldsymbol{y}}-{\boldsymbol{x}}\rVert=nd\cdot\max_{\boldsymbol{\lambda}\in S_{n}}\lVert\boldsymbol{\lambda}-\frac{{\boldsymbol{1}}_{n}}{n}\rVert=\sqrt{n(n-1)}d\leq nd.

The proof is completed. ∎

Appendix B Proofs for Section 4

B.1 A Useful Bound

The proof of Lemma 4.3 relies on the following lemma, whose proof used some key technical results established in [14]. In particular, for given ${\boldsymbol{x}},{\boldsymbol{y}}\in\mathcal{P}$ and $\boldsymbol{\lambda}\in\mathcal{M}({{\boldsymbol{x}}})$ , there must exist ${\boldsymbol{z}}\in\mathcal{P}$ and $\gamma\in[0,1]$ such that

{\boldsymbol{y}}=\gamma{\boldsymbol{x}}+(1-\gamma){\boldsymbol{z}}=\gamma\sum_{j=1}^{N}\lambda_{j}{\boldsymbol{v}}_{j}+(1-\gamma){\boldsymbol{z}}=\sum_{i=1}^{N}\Big(\lambda_{j}-\lambda_{j}(1-\gamma)\Big){\boldsymbol{v}}_{j}+(1-\gamma){\boldsymbol{z}}.

Let $\Delta_{j}:=\lambda_{j}(1-\gamma)\in[0,\lambda_{j}]$ . We then have $1-\gamma=\sum_{j=1}^{N}\Delta_{j}=:\Delta.$ To put another way, the point ${\boldsymbol{y}}$ can always be represented by

{\boldsymbol{y}}=\sum_{j=1}^{N}(\lambda_{j}-\Delta_{j}){\boldsymbol{v}}_{j}+\Delta{\boldsymbol{z}},

(B.1)

for some ${\boldsymbol{z}}\in\mathcal{P}$ , $\Delta_{j}\in[0,\lambda_{j}]$ . Since $\mathcal{P}$ is compact. There must exist a representation of (B.1) with the smallest $\Delta$ among all such representations. An important fact established in [14, lemma 5.3] is that the minimal value $\Delta$ can be bounded. We refine this bound below for the largest $\Delta_{j}$ in $\Delta$ .

Lemma B.1.

Let ${\boldsymbol{x}},{\boldsymbol{y}}\in\mathcal{P}$ with $\boldsymbol{\lambda}\in\mathcal{M}({\boldsymbol{x}})$ Let ${\boldsymbol{y}}$ be represented as in (B.1) with $\Delta$ having been minimized. Then it holds that

\max_{i\in[N]}\{\Delta_{i}\}\leq\frac{\psi}{\xi}\lVert{\boldsymbol{x}}-{\boldsymbol{y}}\rVert.

Proof.

The claim is trivial for the case $\sum_{i=1}^{N}\Delta_{i}=0$ . Now we suppose that $\sum_{i=1}^{N}\Delta_{i}>0$ (i.e., at least one $\Delta_{i}>0$ ). The following index sets $C({\boldsymbol{z}})$ and $C_{0}({\boldsymbol{z}})$ are defined in [14]. We simply describe them and use some established results relating to them. Denote the index set $C({\boldsymbol{z}}):=\{j\in[m]\mid A_{2}(j){\boldsymbol{z}}=b_{2}(j)\}$ . By [14, Lemma 5.3] we have $C({\boldsymbol{z}})\neq\emptyset$ since one $\Delta_{i}>0$ . Let $C_{0}({\boldsymbol{z}})\subseteq C({\boldsymbol{z}})$ be such that the set $\{A_{2}(j)\}_{j\in C_{0}({\boldsymbol{z}})}$ forms a basis for the set $\{A_{2}(j)\}_{j\in C({\boldsymbol{z}})}$ . Denote by $A_{2,{z}}\in\mathbb{R}^{|C_{0}({\boldsymbol{z}})|\times n}$ consisting of the set $\{A_{2}(j)\}_{j\in C_{0}({\boldsymbol{z}})}$ . By definition we have $\|A_{2,{z}}\|\leq\psi$ . Then we obtain

	$\displaystyle\\|{\boldsymbol{x}}-{\boldsymbol{y}}\\|^{2}$	$\displaystyle=\left\\|\sum_{i\in[N]:\Delta_{i}>0}\Delta_{i}({\boldsymbol{v}}_{i}-{\boldsymbol{z}})\right\\|^{2}$
		$\displaystyle\geq\frac{1}{\psi^{2}}\sum_{j\in C_{0}({\boldsymbol{z}})}\left(\sum_{i\in[N]:\Delta_{i}>0}\Delta_{i}(b_{2}(j)-A_{2}(j){\boldsymbol{v}}_{i})\right)^{2}$
		$\displaystyle\stackrel{{\scriptstyle(a)}}{{\geq}}\frac{1}{\psi^{2}}\sum_{j\in C_{0}({\boldsymbol{z}})}\sum_{i\in[N]:\Delta_{i}>0}\Delta_{i}^{2}(b_{2}(j)-A_{2}(j){\boldsymbol{v}}_{i})^{2}$
		$\displaystyle=\frac{1}{\psi^{2}}\sum_{i\in[N]:\Delta_{i}>0}\sum_{j\in C_{0}({\boldsymbol{z}})}\Delta_{i}^{2}(b_{2}(j)-A_{2}(j){\boldsymbol{v}}_{i})^{2},$

where the first inequality is established in the proof of [14, Lemma 5.5], $(a)$ follows from the fact that for any $i\in[N]$ , and any $j\in C_{0}({\boldsymbol{z}})$ we have $b_{2}(j)-A_{2}(j){\boldsymbol{v}}_{i}\geq 0$ . Combining [14, Lemma 5.3] and [14, Lemma 5.4], we obtain that for all $i\in[N]$ such that $\Delta_{i}>0$ there exists $j\in C_{0}({\boldsymbol{z}})$ such that $b_{2}(j)-A_{2}(j){\boldsymbol{v}}\geq\xi$ . Hence,

\|{\boldsymbol{x}}-{\boldsymbol{y}}\|^{2}\geq\frac{\xi^{2}}{\psi^{2}}\sum_{i\in[N]:\Delta_{i}>0}\Delta_{i}^{2}\geq\frac{\xi^{2}}{\psi^{2}}\max_{i\in[N]}\{\Delta_{i}^{2}\}.

Thus we conclude that $\max_{i\in[N]}\{\Delta_{i}\}\leq\frac{\psi}{\xi}\lVert{\boldsymbol{x}}-{\boldsymbol{y}}\rVert$ . ∎

B.2 Proof of Lemma 4.3

Proof.

We begin by proving the first part. Write ${\boldsymbol{x}}=\sum_{i=1}^{N}\lambda_{i}{\boldsymbol{v}}_{i}$ for $\boldsymbol{\lambda}\in S_{N}$ and express ${\boldsymbol{y}}=\sum_{i=1}^{N}(\lambda_{i}-\Delta_{i}){\boldsymbol{v}}_{i}+(\sum_{i=1}^{N}\Delta_{i}){\boldsymbol{z}}$ , where $\Delta_{i}\in[0,\lambda_{i}],\forall i\in[N]$ and ${\boldsymbol{z}}\in\mathcal{P}$ . Here, the sum $\Delta=\sum_{i=1}^{N}\Delta_{i}$ is minimized (as in Lemma B.1). We then have

\max_{i\in[N]}\{\Delta_{i}\}\leq\frac{\psi}{\xi}\lVert{\boldsymbol{x}}-{\boldsymbol{y}}\rVert\leq\frac{\psi}{\xi}\times\frac{dD}{\eta}=d,

where the first inequality used Lemma B.1, the second inequality used the assumption $\|{\boldsymbol{x}}-{\boldsymbol{y}}\|\leq(dD)/\eta$ , and the last equation is by the definition of $\eta$ in (2.6). Express ${\boldsymbol{z}}$ as ${\boldsymbol{z}}=\sum_{i=1}^{N}\lambda_{i}^{\prime}{\boldsymbol{v}}_{i}$ , where $\boldsymbol{\lambda}^{\prime}\in S_{N}$ . We can then rewrite ${\boldsymbol{y}}$ as follows:

{\boldsymbol{y}}=\sum_{i=1}^{N}(\lambda_{i}-\Delta_{i}+\Delta\lambda_{i}^{\prime}){\boldsymbol{v}}_{i}=\sum_{i=1}^{N}\left((\lambda_{i}-d)+d-\Delta_{i}+\Delta\lambda_{i}^{\prime}\right){\boldsymbol{v}}_{i}.

Since $\max_{i\in[N]}\{\Delta_{i}\}\leq d$ , we have $d-\Delta_{i}+\Delta\lambda_{i}^{\prime}\geq 0$ for all $i\in[N]$ . Moreover, the sum $\sum_{i=1}^{N}(d-\Delta_{i}+\Delta\lambda_{i})=Nd$ , which implies that $\frac{(d-\Delta_{i}+\Delta\lambda_{i})_{i}}{Nd}\in S_{N}$ . By the definition in (3.1), we have $\boldsymbol{\lambda}_{y}:=(\lambda_{i}-\Delta_{i}+\Delta\lambda_{i}^{\prime})\in S(\boldsymbol{\lambda},d)$ , thus ${\boldsymbol{y}}\in S_{\mathcal{P}}({\boldsymbol{x}},d)$ . Moreover, we have

\langle{\boldsymbol{y}},{\boldsymbol{c}}\rangle=\langle\boldsymbol{\lambda}_{y},{\boldsymbol{c}}_{ext}\rangle\geq\langle\boldsymbol{\lambda}^{*},{\boldsymbol{c}}_{ext}\rangle=\langle{\boldsymbol{y}}^{*},{\boldsymbol{c}}\rangle.

We now turn to prove the second part. Referring to Algorithm 4, we note that

\boldsymbol{\lambda}_{+}-d{\boldsymbol{1}}_{N}=\max\{\boldsymbol{\lambda}_{x},d{\boldsymbol{1}}_{N}\}-d{\boldsymbol{1}}_{N}=\boldsymbol{\lambda}_{x}-\min\{\boldsymbol{\lambda}_{x},d{\boldsymbol{1}}_{N}\}

and

N\widehat{d}=\sum_{i=1}^{N}\min\{\lambda_{x}(i),d\}.

Denote ${\mathcal{I}}_{+}(\boldsymbol{\lambda}_{x}):=\{i\in[N]\mid\lambda_{x}(i)>0\}$ , $\delta_{i}:=\min\{\lambda_{x}(i),d\}$ , and $\delta:=\sum_{i\in{\mathcal{I}}_{+}(\boldsymbol{\lambda}_{x})}\delta_{i}$ , the optimal solution ${\boldsymbol{y}}^{*}$ produced by Algorithm 4 has

{\boldsymbol{y}}^{*}=\boldsymbol{\lambda}_{+}+d{\boldsymbol{1}}_{N}+N\widehat{d}{\boldsymbol{e}}_{i^{*}}=\sum_{i\in{\mathcal{I}}_{+}(\boldsymbol{\lambda}_{x})}(\lambda_{x}(i)-\delta_{i}){\boldsymbol{v}}_{i}+\delta{\boldsymbol{v}}_{i^{*}},

Thus we have that

	$\displaystyle\lVert{\boldsymbol{x}}-{\boldsymbol{y}}^{*}\rVert$	$\displaystyle=\lVert\sum_{i\in{\mathcal{I}}_{+}(\boldsymbol{\lambda}_{x})}\min\{\lambda_{x}(i),d\}({\boldsymbol{v}}_{i}-{\boldsymbol{v}}_{i^{*}})\rVert$
		$\displaystyle\leq\sum_{i\in{\mathcal{I}}_{+}(\boldsymbol{\lambda}_{x})}\min\{\lambda_{x}(i),d\}\lVert{\boldsymbol{v}}_{i}-{\boldsymbol{v}}_{i^{*}}\rVert\leq\|{\mathcal{I}}_{+}(\boldsymbol{\lambda}_{x})\|dD\leq(n+1)dD.$

∎

B.3 Proof of Theorem 4.4

Proof.

The proof follows the framework of the proof of Theorem 3.4 and Lemma 4.3. We first claim that ${\boldsymbol{x}}^{*}\in S_{\mathcal{P}}({\boldsymbol{x}}_{k},\frac{\eta}{D}d_{k})$ and that $f({\boldsymbol{x}}_{k})-B_{k}\leq\frac{\mu d_{k}^{2}}{2}$ . We prove this by induction. First, we have

\frac{\mu d_{0}^{2}}{2}=f({\boldsymbol{x}}_{0})-B_{0}\geq f({\boldsymbol{x}}_{0})-f^{*}\stackrel{{\scriptstyle(a)}}{{\geq}}\frac{\mu}{2}\lVert{\boldsymbol{x}}_{0}-{\boldsymbol{x}}^{*}\rVert^{2},

where $(a)$ comes from (2.1). This implies that $\lVert{\boldsymbol{x}}_{0}-{\boldsymbol{x}}^{*}\rVert\leq d_{0}$ , and by Lemma 4.3, we have ${\boldsymbol{x}}^{*}\in S_{\mathcal{P}}({\boldsymbol{x}}_{0},\frac{\eta}{D}d_{0})$ . Therefore, the claim holds for $k=0$ .

Now suppose that ${\boldsymbol{x}}^{*}\in S_{\mathcal{P}}({\boldsymbol{x}}_{t},\frac{\eta}{D}d_{t})$ and $f({\boldsymbol{x}}_{t})-B_{t}\leq\frac{\mu d_{t}^{2}}{2}$ for all $t\leq k-1$ . Let $\gamma:=\frac{\mu}{2L(n+1)^{2}\eta^{2}}$ . In the same manner as the proof in Theorem 3.4, for step size policy (2.3), (2.4) or (4.5), we all have

	$\displaystyle f({\boldsymbol{x}}_{k})-B_{k}\leq$	$\displaystyle(1-\gamma)(f({\boldsymbol{x}}_{k-1})-B_{k-1})+\frac{L\gamma^{2}}{2}\lVert{\boldsymbol{y}}_{k}-{\boldsymbol{x}}_{k-1}\rVert^{2}$
	$\displaystyle\stackrel{{\scriptstyle(d)}}{{\leq}}$	$\displaystyle(1-\gamma)\frac{\mu}{2}d_{k-1}^{2}+\frac{L\gamma^{2}}{2}(n+1)^{2}\eta^{2}d_{k-1}^{2}$
	$\displaystyle=$	$\displaystyle\left[(1-\gamma)\frac{\mu}{2}+\frac{L\gamma^{2}(n+1)^{2}\eta^{2}}{2}\right]d_{k-1}^{2},$

where $(d)$ is due to our inductive hypothesis and Lemma 4.3. By plugging in the value of $\gamma$ , and using $1-x\leq e^{-x}$ , we have that

f({\boldsymbol{x}}_{k})-B_{k}\leq\frac{\mu}{2}(1-\frac{\mu}{4L(n+1)^{2}\eta^{2}})d_{k-1}^{2}\leq\frac{\mu}{2}e^{-\frac{\mu}{4L(n+1)^{2}\eta^{2}}}d_{k-1}^{2}.

Combining the above inequality with the fact that $f({\boldsymbol{x}}_{k})-B_{k}=\frac{\mu}{2}(\sqrt{\frac{2(f({\boldsymbol{x}}_{k})-B_{k})}{\mu}})^{2}$ , and by the definition of $d_{k}$ , we conclude that $f({\boldsymbol{x}}_{k})-B_{k}\leq\frac{\mu d_{k}^{2}}{2}$ . By the inductive hypothesis, we know that ${\boldsymbol{x}}^{*}\in S_{\mathcal{P}}({\boldsymbol{x}}_{t},d_{t})$ holds for all $t\leq k-1$ . Thus $B_{t+1}^{w}$ is a valid lower bound of $f^{*}$ , and consequently, $B_{k}$ is also a lower bound of $f^{*}$ . Now by (2.1), we have

\lVert{\boldsymbol{x}}_{k}-{\boldsymbol{x}}^{*}\rVert^{2}\leq(2/\mu)(f({\boldsymbol{x}}_{k})-f^{*})\leq(2/\mu)(f({\boldsymbol{x}}_{k})-B_{k})\leq d_{k}^{2}.

(B.2)

This implies that ${\boldsymbol{x}}^{*}\in S_{\mathcal{P}}({\boldsymbol{x}}_{k},\frac{\eta}{D}d_{k})$ by Lemma 4.3. Therefore, we have completed the proof of the claim.

We now start to prove the conclusion in Theorem 4.4. From the earlier proof, we know that $B_{k}\leq f^{*}$ , thus confirming the first part of the inequality. By the definition of $d_{k}$ and the established claim, we have

f({\boldsymbol{x}}_{k})-B_{k}\leq\frac{1}{2}\mu d_{k}^{2}\leq\frac{\mu d_{0}^{2}}{2}e^{-\frac{\mu}{4L(n+1)^{2}\eta^{2}}k}.

The proof is thus completed. ∎

B.4 Proof of Theorem 4.5

The proof of Theorem 4.5 relies on the following lemma.

Lemma B.2.

For any $\boldsymbol{\lambda}_{1},\boldsymbol{\lambda}_{2}\in S(\boldsymbol{\lambda}_{x},d)$ , define the two corresponding points:

{\boldsymbol{y}}_{j}=\sum_{i=1}^{N}\lambda_{j}(i){\boldsymbol{v}}_{i}\in S_{\mathcal{P}}({\boldsymbol{x}},d),\quad j=1,2.

We must have $\|{\boldsymbol{y}}_{1}-{\boldsymbol{y}}_{2}\|\leq(n+1)dD.$

Proof.

We note that $S_{\mathcal{P}}({\boldsymbol{x}},d)$ is compact. Let $\mathcal{V}(S)$ denote the set of its vertices. Obviously, we have

\|{\boldsymbol{y}}_{1}-{\boldsymbol{y}}_{2}\|\leq\max_{{\boldsymbol{u}}_{1},{\boldsymbol{u}}_{2}\in\mathcal{V}(S)}\|{\boldsymbol{u}}_{1}-{\boldsymbol{u}}_{2}\|.

For ${\boldsymbol{u}}_{1}\in\mathcal{V}(S)$ being a vertex of $S_{\mathcal{P}}({\boldsymbol{x}},d)$ , according to the separation theorem [28] that there exists a vector ${\boldsymbol{c}}_{1}\in\mathbb{R}^{N}$ such that

\langle{\boldsymbol{c}}_{1},\;{\boldsymbol{u}}_{1}\rangle<\langle{\boldsymbol{c}}_{1},\;{\boldsymbol{u}}\rangle\quad\mbox{for all}\ {\boldsymbol{u}}\in\mathcal{V}(S)\setminus\{{\boldsymbol{u}}_{1}\}.

In other words, ${\boldsymbol{u}}_{1}$ is the unique solution of the following problem:

\min\;\langle{\boldsymbol{c}}_{1},\;{\boldsymbol{u}}\rangle\quad\mbox{s.t.}\quad{\boldsymbol{u}}\in S_{\mathcal{P}}({\boldsymbol{x}},d)\cap\mathcal{P}.

It follows Lemma 4.2 that ${\boldsymbol{u}}_{1}$ can be represented as

{\boldsymbol{u}}_{1}=\sum_{j=1}^{N}(\lambda_{x}(j)-\delta_{j}){\boldsymbol{v}}_{j}+\delta{\boldsymbol{z}}_{1}\quad\mbox{for some}\ {\boldsymbol{z}}_{1}\in\mathcal{P},

where $\delta_{j}:=\min\{\lambda_{x}(j),d\}$ and $\delta:=\sum_{j=1}^{N}\delta_{j}$ independent of ${\boldsymbol{z}}_{1}$ . Similarly, ${\boldsymbol{u}}_{2}$ has a representation:

{\boldsymbol{u}}_{2}=\sum_{j=1}^{N}(\lambda_{x}(j)-\delta_{j}){\boldsymbol{v}}_{j}+\delta{\boldsymbol{z}}_{2}\quad\mbox{for some}\ {\boldsymbol{z}}_{2}\in\mathcal{P}.

Therefore, we have

$\displaystyle\\|{\boldsymbol{y}}_{1}-{\boldsymbol{y}}_{2}\\|$	$\displaystyle\leq$	$\displaystyle\max_{{\boldsymbol{u}}_{1},{\boldsymbol{u}}_{2}\in\mathcal{V}(S)}\\|{\boldsymbol{u}}_{1}-{\boldsymbol{u}}_{2}\\|$
	$\displaystyle\leq$	$\displaystyle\delta\max_{{\boldsymbol{z}}_{1},{\boldsymbol{z}}_{2}\in\mathcal{P}}\\|{\boldsymbol{z}}_{1}-{\boldsymbol{z}}_{2}\\|\leq\delta D$
	$\displaystyle=$	$\displaystyle D\sum_{i\in\mathcal{I}_{+}(\boldsymbol{\lambda}_{x})}\delta_{i}\leq D\|\mathcal{I}_{+}(\boldsymbol{\lambda}_{x})\|d\leq(n+1)dD,$

where $\mathcal{I}_{+}(\boldsymbol{\lambda}_{x}):=\left\{i\ |\ \lambda_{i}(x)>0\right\}$ and we have used $|\mathcal{I}_{+}(\boldsymbol{\lambda}_{x})|\leq(n+1)$ . ∎

Proof of Theorem 4.5.

We first claim that ${\boldsymbol{x}}^{*}\in S_{\mathcal{P}}({{\boldsymbol{x}}}_{k},{d}_{k})$ for any $k\geq 0$ and prove this by induction. From the proof of Theorem 4.4, this is true for $k=0$ . Now suppose that ${\boldsymbol{x}}^{*}\in S_{\mathcal{P}}({{\boldsymbol{x}}}_{k-1},{d}_{k-1})$ for some $k\geq 1$ . Note that the inner loop of Alg. 6 corresponds to the standard Frank-Wolfe algorithm. By Theorem 2.1 and Lemma B.2, we have

f({\boldsymbol{p}}_{j})-f^{*}\leq\frac{2L}{j+1}\left((n+1)^{2}{d}_{k-1}D\right)^{2}=\frac{2L(n+1)^{2}{d}_{k-1}^{2}D^{2}}{j+1}

hold for all $j\in[J]$ . In the case where the inner loop terminates at $j=J$ , we obtain

f({\boldsymbol{x}}_{k})-f^{*}=f({\boldsymbol{p}}_{J})-f^{*}\leq f({\boldsymbol{p}}_{J})-C_{J}\leq\frac{\mu d_{k}^{2}D^{2}}{2\eta^{2}}.

Similarly, if the inner loop is interrupted due to lines 9-11 of the algorithm, we still have $f({\boldsymbol{x}}_{k})-f^{*}\leq f({\boldsymbol{p}}_{j})-C_{j}\leq\frac{\mu}{2\rho^{2}\eta^{2}}{d}_{k-1}^{2}D^{2}=\frac{\mu d_{k}^{2}D^{2}}{2\eta^{2}}$ . Using the fact that $f({\boldsymbol{x}}_{k})-f^{*}\geq\frac{\mu}{2}\lVert{\boldsymbol{x}}_{k}-{\boldsymbol{x}}^{*}\rVert^{2}$ , we have $\lVert{\boldsymbol{x}}_{k}-{\boldsymbol{x}}^{*}\rVert^{2}\leq\frac{d_{k}^{2}D^{2}}{\eta^{2}}$ , which implies via Lemma 4.3 that ${\boldsymbol{x}}^{*}\in S_{\mathcal{P}}({{\boldsymbol{x}}}_{k},{d}_{k})$ .

We now start to prove the conclusion in Theorem 4.5. Since $d_{0}=\frac{\eta}{D}\sqrt{\frac{2(f({\boldsymbol{x}}_{0})-B_{0})}{\mu}}$ and $d_{k}=\frac{{d}_{k-1}}{\rho}$ , we have

f({\boldsymbol{x}}_{k})-f^{*}\leq f({\boldsymbol{x}}_{k})-B_{k}\leq\frac{\mu d_{k}^{2}D^{2}}{2\eta^{2}}\leq(f({\boldsymbol{x}}_{0})-B_{0})\rho^{-2k}.

∎

Appendix C Properties of Some Common Polytopes

C.1 Carath odory Representation Examples

Hypercube: When $\mathcal{P}$ is a hypercube $B_{n}:=\{{\boldsymbol{x}}\in\mathbb{R}^{n}\mid x_{i}\in[0,1],\forall i\in[n]\}$ , any point ${\boldsymbol{x}}\in B_{n}$ can be naturally represented as

{\boldsymbol{x}}=\sum_{i=1}^{n-1}(x_{j_{i}}-x_{j_{i+1}}){\boldsymbol{v}}_{i}+x_{j_{n}}{\boldsymbol{1}}_{n}+(1-x_{j_{1}}){\boldsymbol{0}}_{n},

where $j_{1},\dots,j_{n}$ is a permutation over $[n]$ such that $x_{j_{1}}\geq\dots\geq x_{j_{n}}$ and ${\boldsymbol{v}}_{i}$ is a vector with components from $j_{1}$ to $j_{i}$ equal to $1$ and the rest equal to $0$ .

$\ell_{1}$ -ball: When $\mathcal{P}$ is a $\ell_{1}$ -ball $L_{n}:=\{{\boldsymbol{x}}\in\mathbb{R}^{n}\mid\sum_{i=1}^{n}|x_{i}|\leq 1\}$ , any point ${\boldsymbol{x}}\in L_{n}$ can be naturally represented as

{\boldsymbol{x}}=\sum_{i=1}^{n-1}|x_{i}|(\text{sgn}(x_{i}){\boldsymbol{e}}_{i})+\left(|x_{n}|+s_{x}\right)(\text{sgn}(x_{n}){\boldsymbol{e}}_{n})+s_{x}(-\text{sgn}(x_{n}){\boldsymbol{e}}_{n}),

where $s_{x}={1-\sum_{i=1}^{n}|x_{i}|}/{2}$ and $\text{sgn}(x)=1$ if $x\geq 0$ and $-1$ otherwise.

Flow polytope: Let $G$ be a directed acyclic graph (DAG) with a set of vertices $V$ and edges $E$ such that $|E|=n$ . Let $s,t$ be two vertices in $V$ , referred to as the source and target, respectively. The $s$ - $t$ flow polytope, here denoted by $\mathcal{F}_{s,t}$ , is the set of all unit $s$ - $t$ flows in $G$ . For any point ${\boldsymbol{x}}\in\mathcal{F}_{s,t}$ and $i\in[n]$ , the entry $x_{i}$ represents the amount of flow through edge $i\in[n]$ , where the flow vector ${\boldsymbol{x}}$ satisfies the flow conservation constraints at each vertex, ensuring that the flow entering any vertex (except $s$ and $t$ ) equals the flow leaving it. The extreme points of $\mathcal{F}_{s,t}$ are the extreme unit flows. To find the Carath odory representation of a given flow ${\boldsymbol{x}}\in\mathcal{F}_{s,t}$ , we can proceed recursively as follows.

Starting with the flow ${\boldsymbol{x}}$ , we repeatedly perform the following steps until ${\boldsymbol{x}}={\boldsymbol{0}}_{n}$ :

1.

Remove all edges with zero flow from the graph.
2.

Identify the edge $i$ corresponding to the smallest non-zero flow in ${\boldsymbol{x}}$ , i.e., $i\leftarrow{\operatorname*{\arg\min}}_{x_{i}>0}x_{i}$ .
3.

Find the extreme unit flow ${\boldsymbol{v}}$ in the reduced graph that includes edge $i$ .
4.

Subtract $x_{i}{\boldsymbol{v}}$ from the current flow, i.e., ${\boldsymbol{x}}\leftarrow{\boldsymbol{x}}-x_{i}{\boldsymbol{v}}$ .

Since each operation eliminates at least one non-zero entry in the current flow, the loop will terminate within at most $m$ steps. As a result, we obtain a Carath odory representation of ${\boldsymbol{x}}$ . This algorithm can be implemented in $O(n^{2})$ time when the graph is represented using sparsely structured adjacency matrices.

C.2 Quantities of Some Common Polytopes

Hypercube: The diameter of $B_{n}$ is given by

D(B_{n})=\max_{{\boldsymbol{x}},{\boldsymbol{y}}\in B_{n}}\|{\boldsymbol{x}}-{\boldsymbol{y}}\|=\|{\boldsymbol{1}}_{n}-{\boldsymbol{0}}_{n}\|=\sqrt{n}.

Since $B_{n}$ can be represented as

B_{n}=\left\{{\boldsymbol{x}}\in\mathbb{R}^{n}\mid\left(\begin{array}[]{c}I_{n}\\ -I_{n}\end{array}\right){\boldsymbol{x}}\leq\left(\begin{array}[]{c}{\boldsymbol{1}}_{n}\\ {\boldsymbol{0}}_{n}\end{array}\right)\right\},

it follows from the definition in Subsection 2.3 that $\xi(B_{n})=1$ and $\psi(B_{n})=1$ . Thus, the quantity of $B_{n}$ is

\eta(B_{n})={\psi(B_{n})D(B_{n})}/{\xi(B_{n})}=\sqrt{n}.

$\ell_{1}$ -ball: The diameter of $L_{n}$ is given by

D(L_{n})=\max_{{\boldsymbol{x}},{\boldsymbol{y}}\in L_{n}}\|{\boldsymbol{x}}-{\boldsymbol{y}}\|=\|{\boldsymbol{e}}_{1}-(-{\boldsymbol{e}}_{1})\|=2.

Note that $L_{n}$ can be described by the linear inequalities system $L_{n}=\{{\boldsymbol{x}}\in\mathbb{R}^{n}\mid A_{2}{\boldsymbol{x}}\leq\frac{1}{\sqrt{n}}{\boldsymbol{1}}_{2^{n}}\}$ , where $A_{2}\in\mathbb{R}^{2^{n}\times n}$ is a matrix whose entries are either $\pm\frac{1}{\sqrt{n}}$ and whose rows all have unit $\ell_{2}$ norm. Following the definition in Subsection 2.3, we have $\xi(L_{n})=\frac{2}{\sqrt{n}}$ and

\frac{1}{\sqrt{n}}=\max_{M\in\mathbb{A}(L_{n})}\frac{1}{\sqrt{n}}\|M\|_{F}\leq\psi(L_{n})=\max_{M\in\mathbb{A}(L_{n})}\|M\|\leq\max_{M\in\mathbb{A}(L_{n})}\|M\|_{F}=\sqrt{n},

where $\|\cdot\|_{F}$ denotes the Frobenius norm. Thus, the quantity of $L_{n}$ can be estimated as

\eta(L_{n})={\psi(L_{n})D(L_{n})}/{\xi(L_{n})}\in[1,n].

Flow Polytope: For every two extreme unit flows ${\boldsymbol{x}}_{1},{\boldsymbol{x}}_{2}\in\mathcal{V}(\mathcal{F}_{s,t})$ , since $\mathcal{V}(\mathcal{F}_{s,t})\subseteq\{0,1\}^{n}$ , we have $\|{\boldsymbol{x}}_{1}-{\boldsymbol{x}}_{2}\|\leq\sqrt{n}$ . Thus, the diameter of $\mathcal{F}_{s,t}$ can be estimated as

D(\mathcal{F}_{s,t})=\max_{{\boldsymbol{x}}_{1},{\boldsymbol{x}}_{2}\in\mathcal{V}(\mathcal{F}_{s,t})}\|{\boldsymbol{x}}_{1}-{\boldsymbol{x}}_{2}\|\leq\sqrt{n}.

When representing $\mathcal{F}_{s,t}$ using a system of linear equations and inequalities, the inequality constraints are given by $-I_{n}{\boldsymbol{x}}\leq{\boldsymbol{0}}_{n}$ . Thus, by definition, we have $\xi(\mathcal{F}_{s,t})=\psi(\mathcal{F}_{s,t})=1$ , leading to

\eta(\mathcal{F}_{s,t})=D(\mathcal{F}_{s,t})\leq\sqrt{n}.

It is worth noting that in the numerical experiment in Subsection 5.2, we set $\eta=D=\sqrt{66}$ while the dimension is $n=660$ . This choice stems from the specific characteristics of the dataset used in [22, 15]. Specifically, we observe that each extreme unit flow contains exactly $33$ entries of $1$ , with the remaining entries being $0$ . Consequently, we can estimate $\eta=D\leq\sqrt{66}$ .

Appendix D Backtracking Details

The routine of estimating local parameters $L,\mu$ and step-size $\delta$ is shown as follows. For rSFW, if we use the simple step-size, we can employ only the estimates of $L$ and $\mu$ without applying the corresponding step size.

Algorithm 7

\mbox{Backtracking-Routine}({\boldsymbol{x}},{\boldsymbol{d}},L,\mu,\delta_{\text{max}})

0: Iterate

{\boldsymbol{x}}\in\mathcal{P}

, update direction

{\boldsymbol{d}}

, previous estimation

L

and

\mu

1: Choose

\tau_{1}>1,\tau_{2}\leq 1

L\leftarrow\tau_{2}L,\ \mu\leftarrow\mu/\tau_{2}

\delta\leftarrow\min\left\{\frac{\langle\nabla f({\boldsymbol{x}}),{\boldsymbol{d}}\rangle}{L\lVert{\boldsymbol{d}}\rVert^{2}},\delta_{\text{max}}\right\}

4: while

f({\boldsymbol{x}}+\delta{\boldsymbol{d}})>f({\boldsymbol{x}})+\delta\langle\nabla f({\boldsymbol{x}}),{\boldsymbol{d}}\rangle+\frac{\delta^{2}L}{2}\lVert{\boldsymbol{d}}\rVert^{2}

L\leftarrow\tau_{1}L

\mu\leftarrow\min\left\{\frac{2(f({\boldsymbol{x}}+\delta{\boldsymbol{d}})-f({\boldsymbol{x}})-\delta\langle\nabla f({\boldsymbol{x}}),{\boldsymbol{d}}\rangle)}{\delta^{2}\lVert{\boldsymbol{d}}\rVert^{2}},\mu\right\}

\delta\leftarrow\min\left\{\frac{\langle\nabla f({\boldsymbol{x}}),{\boldsymbol{d}}\rangle}{L\lVert{\boldsymbol{d}}\rVert^{2}},\delta_{\text{max}}\right\}

8: end while

\delta,L,\mu

References

[1] Beck, A., Teboulle, M.: A conditional gradient method with linear rate of convergence for solving convex linear systems. Mathematical Methods of Operations Research 59, 235–247 (2004)
[2] Bernd, G., Jaggi, M.: Optimization for machine learning. Lecture Notes CS-439, ETH, Spring 2023 (2023)
[3] Bomze, I.M., Rinaldi, F., Zeffiro, D.: Frank–wolfe and friends: a journey into projection-free first-order optimization methods. 4OR 19(3), 313–345 (2021)
[4] Braun, G., Carderera, A., Combettes, C.W., Hassani, H., Karbasi, A., Mokhtari, A., Pokutta, S.: Conditional gradient methods. arXiv preprint arXiv:2211.14103 (2022)
[5] Chandrasekaran, V., Recht, B., Parrilo, P.A., Willsky, A.S.: The convex geometry of linear inverse problems. Foundations of Computational mathematics 12(6), 805–849 (2012)
[6] Clarkson, K.L.: Coresets, sparse greedy approximation, and the frank-wolfe algorithm. ACM Transactions on Algorithms (TALG) 6(4), 1–30 (2010)
[7] Combettes, C.W., Pokutta, S.: Complexity of linear minimization and projection on some sets. Operations Research Letters 49(4), 565–571 (2021)
[8] Condat, L.: Fast projection onto the simplex and the l 1 ball. Mathematical Programming 158(1), 575–585 (2016)
[9] Damla Ahipasaoglu, S., Sun, P., Todd, M.J.: Linear convergence of a modified frank–wolfe algorithm for computing minimum-volume enclosing ellipsoids. Optimisation Methods and Software 23(1), 5–19 (2008)
[10] Frank, M., Wolfe, P., et al.: An algorithm for quadratic programming. Naval research logistics quarterly 3(1-2), 95–110 (1956)
[11] Freund, R.M., Grigas, P.: New analysis and results for the frank–wolfe method. Mathematical Programming 155(1), 199–230 (2016)
[12] Garber, D.: Faster projection-free convex optimization over the spectrahedron. Advances in Neural Information Processing Systems 29 (2016)
[13] Garber, D., Hazan, E.: Playing non-linear games with linear oracles. In: 2013 IEEE 54th annual symposium on foundations of computer science, pp. 420–428. IEEE (2013)
[14] Garber, D., Hazan, E.: A linearly convergent variant of the conditional gradient algorithm under strong convexity, with applications to online and stochastic optimization. SIAM Journal on Optimization 26(3), 1493–1528 (2016)
[15] Garber, D., Wolf, N.: Frank-wolfe with a nearest extreme point oracle. In: Conference on Learning Theory, pp. 2103–2132. PMLR (2021)
[16] Guélat, J., Marcotte, P.: Some comments on wolfe’s ‘away step’. Mathematical Programming 35(1), 110–119 (1986)
[17] Guyon, I., Gunn, S., Nikravesh, M., Zadeh, L.A.: Feature extraction: foundations and applications, vol. 207. Springer (2008)
[18] Hazan, E.: Sparse approximate solutions to semidefinite programs. In: Latin American symposium on theoretical informatics, pp. 306–316. Springer (2008)
[19] Jaggi, M.: Revisiting frank-wolfe: Projection-free sparse convex optimization. In: International conference on machine learning, pp. 427–435. PMLR (2013)
[20] Jaggi, M., Sulovsk, M., et al.: A simple algorithm for nuclear norm regularized problems. In: Proceedings of the 27th international conference on machine learning (ICML-10), pp. 471–478 (2010)
[21] Joulin, A., Tang, K., Fei-Fei, L.: Efficient image and video co-localization with frank-wolfe algorithm. In: D. Fleet, T. Pajdla, B. Schiele, T. Tuytelaars (eds.) Computer Vision – ECCV 2014, pp. 253–268. Springer International Publishing, Cham (2014)
[22] Lacoste-Julien, S., Jaggi, M.: On the global linear convergence of frank-wolfe optimization variants. Advances in neural information processing systems 28 (2015)
[23] Lan, G.: The complexity of large-scale convex programming under a linear optimization oracle. arXiv preprint arXiv:1309.5550 (2013)
[24] Lan, G.: First-order and stochastic optimization methods for machine learning, vol. 1. Springer (2020)
[25] Levitin, E.S., Polyak, B.T.: Constrained minimization methods. USSR Computational mathematics and mathematical physics 6(5), 1–50 (1966)
[26] Pedregosa, F., Negiar, G., Askari, A., Jaggi, M.: Linearly convergent frank-wolfe with backtracking line-search. In: International conference on artificial intelligence and statistics, pp. 1–10. PMLR (2020)
[27] Pokutta, S.: The frank-wolfe algorithm: a short introduction. Jahresbericht der Deutschen Mathematiker-Vereinigung 126(1), 3–35 (2024)
[28] Rockafellar, R.T.: Convex analysis, vol. 11. Princeton university press (1997)
[29] Végh, L.A.: Strongly polynomial algorithm for a class of minimum-cost flow problems with separable convex objectives. In: Proceedings of the forty-fourth annual ACM symposium on Theory of computing, pp. 27–40 (2012)

$\displaystyle\\|{\boldsymbol{y}}_{1}-{\boldsymbol{y}}_{2}\\|$	$\displaystyle\leq$	$\displaystyle\max_{{\boldsymbol{u}}_{1},{\boldsymbol{u}}_{2}\in\mathcal{V}(S)}\\|{\boldsymbol{u}}_{1}-{\boldsymbol{u}}_{2}\\|$
	$\displaystyle\leq$	$\displaystyle\delta\max_{{\boldsymbol{z}}_{1},{\boldsymbol{z}}_{2}\in\mathcal{P}}\\|{\boldsymbol{z}}_{1}-{\boldsymbol{z}}_{2}\\|\leq\delta D$
	$\displaystyle=$	$\displaystyle D\sum_{i\in\mathcal{I}_{+}(\boldsymbol{\lambda}_{x})}\delta_{i}\leq D\|\mathcal{I}_{+}(\boldsymbol{\lambda}_{x})\|d\leq(n+1)dD,$

Simplex Frank-Wolfe: Linear Convergence and Its Numerical Efficiency for Convex Optimization over Polytopes

Abstract

1 Introduction

1.1 Related Work

1.2 Simplex LMO and Simplex FW: a New Proposal

1.3 Organization

2 Notation and Background

2.1 Notation

2.2 Smoothness, Strong Convexity and Stepsizes

Definition 1 (Smooth function).

Definition 2 (Strongly convex function).

Theorem 2.1.

2.3 Quantities of Polytope

2.4 LLOO

3 Simplex FW Method

3.1 Simplex Ball and Simplex-based Linear Minimization Oracle

Definition 3 (Simplex ball).

Lemma 3.1.

Definition 4 (SLMO).

Lemma 3.2.

Lemma 3.3.

Proof.

Remark 1.

3.2 SFW: Simplex Frank-Wolfe Method

Remark 2.

Theorem 3.4.

Proof.

Remark 3.

3.3 Refining SFW

Theorem 3.5.

Proof.

Remark 4.

Remark 5.

4 Generalization to Arbitrary Polytopes

4.1 Simplex Ball and SLMO for Arbitrary Polytopes

Lemma 4.1.

Proof.

Definition 5 (SLMO𝒫\mbox{SLMO}_{\mathcal{P}}: SLMO over 𝒫\mathcal{P}).

Lemma 4.2.

Remark 6.

Lemma 4.3.

4.2 SFW𝒫\mbox{SFW}_{\mathcal{P}}: Simplex Frank-Wolfe for Arbitrary Polytopes

Theorem 4.4.

4.3 rSFW𝒫\mbox{rSFW}_{\mathcal{P}}: Refining SFW𝒫\mbox{SFW}_{\mathcal{P}}

Theorem 4.5.

Remark 7.

Remark 8.

5 Numerical Experiments

5.1 Efficiency of SLMO and SLMO-2

5.2 Linear Convergence of SFW and rSFW

5.3 SFW/rSFW with Backtracking

5.4 rSFW Framework Combined with the Away Step Technique

6 Conclusion

acknowledgement

Appendix

Appendix A Proof of Lemma 3.1

Proof.

Appendix B Proofs for Section 4

B.1 A Useful Bound

Lemma B.1.

Proof.

B.2 Proof of Lemma 4.3

Proof.

B.3 Proof of Theorem 4.4

Proof.

B.4 Proof of Theorem 4.5

Lemma B.2.

Proof.

Proof of Theorem 4.5.

Appendix C Properties of Some Common Polytopes

C.1 Carath odory Representation Examples

C.2 Quantities of Some Common Polytopes

Appendix D Backtracking Details

References

Definition 5 ( $\mbox{SLMO}_{\mathcal{P}}$ : SLMO over $\mathcal{P}$ ).

4.2 $\mbox{SFW}_{\mathcal{P}}$ : Simplex Frank-Wolfe for Arbitrary Polytopes

4.3 $\mbox{rSFW}_{\mathcal{P}}$ : Refining $\mbox{SFW}_{\mathcal{P}}$