Ringleader ASGD: The First Asynchronous SGD with Optimal Time Complexity under Data Heterogeneity

Artavazd Maranjyan, Peter Richtárik
King Abdullah University of Science and Technology (KAUST)
{arto.maranjyan,richtarik}@gmail.com

Abstract

Asynchronous stochastic gradient methods are central to scalable distributed optimization, particularly when devices differ in computational capabilities. Such settings arise naturally in federated learning, where training takes place on smartphones and other heterogeneous edge devices. In addition to varying computation speeds, these devices often hold data from different distributions. However, existing asynchronous SGD methods struggle in such heterogeneous settings and face two key limitations. First, many rely on unrealistic assumptions of similarity across workers’ data distributions. Second, methods that relax this assumption still fail to achieve theoretically optimal performance under heterogeneous computation times. We introduce Ringleader ASGD, the first asynchronous SGD algorithm that attains the theoretical lower bounds for parallel first-order stochastic methods in the smooth nonconvex regime, thereby achieving optimal time complexity under data heterogeneity and without restrictive similarity assumptions. Our analysis further establishes that Ringleader ASGD remains optimal under arbitrary and even time-varying worker computation speeds, closing a fundamental gap in the theory of asynchronous optimization.

1 Introduction

Modern machine learning increasingly depends on large-scale distributed training across clusters with hundreds or even thousands of GPUs (Shoeybi et al., 2019; Brown et al., 2020; Narayanan et al., 2021). However, classical synchronous training methods struggle to scale in these settings, as device failures, network instabilities, and synchronization overheads introduce significant inefficiencies (Chen et al., 2016; Grattafiori et al., 2024). These issues become even more pronounced in environments with heterogeneous computational power, such as Federated Learning (FL), where devices range from high-end datacenter GPUs to resource-constrained edge hardware (Konečný et al., 2016; McMahan et al., 2016; Li et al., 2020; Kairouz et al., 2021). Because synchronous methods are bottlenecked by the slowest participants, faster devices remain idle, leading to severe underutilization of computational resources when stragglers—nodes slowed down by computation or communication—lag significantly behind.

One way to reduce synchronization bottlenecks is to equip data centers with homogeneous GPUs. However, this approach is prohibitively expensive and difficult to scale: upgrading to faster GPUs would require replacing all devices simultaneously, since heterogeneous hardware cannot be combined efficiently. Even then, homogeneity does not eliminate synchronization issues, as hardware failures and device dropouts during training still cause stragglers and idle time. Moreover, this solution applies only to controlled datacenter environments and is infeasible in FL, where edge devices are outside the server’s control.

A more promising approach is to shift from hardware solutions to algorithmic ones by adopting asynchronous optimization methods. These methods remove the need for synchronization, allowing fast workers to contribute updates without waiting for slower ones (Tsitsiklis et al., 1986; Recht et al., 2011; Agarwal & Duchi, 2011; Dean et al., 2012; Li et al., 2014). Despite their appeal, asynchronous methods are more difficult to analyze. In particular, a meaningful analysis would require studying time to convergence, rather than iteration complexity only. While iteration complexity is the traditional metric in optimization, it does not necessarily reflect real-world training speed in parallel settings: a method that performs more iterations may finish faster in wall-clock time if those iterations can be computed without waiting for slow workers. This distinction raises a fundamental question: among all parallel methods, which ones are provably fastest in theory? To make this question precise, we restrict our attention to smooth nonconvex problems and to stochastic first-order methods, encompassing algorithms with or without synchronization. This will be the only setting considered in this paper.

Table 1: Comparison of time complexities for parallel first-order methods under the fixed computation time model, where each worker

i

takes a fixed time

\tau_{i}

to compute a stochastic gradient, with the times ordered so that

\tau_{n}

is the largest (2). We denote by

\tau_{\mathrm{avg}}:=\tfrac{1}{n}\sum_{i=1}^{n}\tau_{i}

the average computation time across all workers. The table shows how the time complexity of different algorithms depends on key problem parameters: the initial function suboptimality

\Delta:=f(x^{0})-f^{*}

(Section 2.3), the target stationarity

\varepsilon

, the variance bound of the stochastic gradients

\sigma^{2}

(Section 2.3), and smoothness constants. Specifically,

L_{f}

is the smoothness constant of

f

(Section 2.3);

L_{\max}:=\max_{i\in[n]}L_{f_{i}}

with

L_{f_{i}}

the smoothness constant of

f_{i}

; and

L

is a constant associated with our new smoothness-type assumption (Section 2.3). They satisfy

L_{f}\leq L\leq L_{\max}

(Section 2.3). All stated time complexities hide universal constant factors.
Each column indicates whether a method satisfies the following desirable properties: Optimal: achieves the theoretical lower bound derived by Tyurin & Richtárik (2024) for parallel first-order stochastic methods in heterogeneous data setting. No sync.: does not require synchronization and is therefore asynchronous. No idle workers: all workers remain busy without waiting, so computational resources are fully utilized. No discarded work: no computation is wasted, and no worker is stopped mid-computation. Our new method, Ringleader ASGD, is the first asynchronous method to achieve optimal time complexity, while also ensuring full resource utilization (no idle workers) and no discarded computations / work.

Method

Time Complexity

Optimal

No sync.

No idle workers

No discarded work

Naive Minibatch SGD

(Section 3.1)

\frac{L_{f}\Delta}{\varepsilon}\left(\tau_{n}+\tau_{n}\frac{\sigma^{2}}{n\varepsilon}\right)

✘

✔

IA²SGD

(Wang et al., 2025)

(Appendix D)

\frac{L_{\max}\Delta}{\varepsilon}\left(\tau_{n}+\tau_{n}\frac{\sigma^{2}}{n\varepsilon}\right)

^(†)

✘

✔

Malenia SGD

(Tyurin & Richtárik, 2024)

\frac{L_{f}\Delta}{\varepsilon}\left(\tau_{n}+\tau_{\mathrm{avg}}\frac{\sigma^{2}}{n\varepsilon}\right)

✔

✘

✔

✘

Ringleader ASGD (new)

(Algorithm 1; Section 5.2)

\frac{L\Delta}{\varepsilon}\left(\tau_{n}+\tau_{\mathrm{avg}}\frac{\sigma^{2}}{n\varepsilon}\right)

✔^(‡)

✔

(†)

The analysis presented by Wang et al. (2025) is carried out under the assumption that each $f_{i}$ is smooth with the same smoothness constant, which is equivalent to requiring that each $f_{i}$ is $L_{f_{i}}$ –smooth and then using the upper bound $L_{\max}$ for all $L_{f_{i}}$ . However, this strict assumption is not necessary: they could instead use our relaxed smoothness-type assumption (Section 2.3), in which case the constant improves to $L$ , and their analysis remains unchanged.
(‡)

The time complexities of Ringleader ASGD and Malenia SGD differ in the smoothness constant only. Since Malenia SGD is optimal, Ringleader ASGD is also optimal whenever $L$ exceeds $L_{f}$ by at most a universal constant factor, that is, $L=\mathcal{O}(L_{f})$ .

Recently, Tyurin & Richtárik (2024) studied this very regime, where they derived lower bounds.

Recently, Tyurin & Richtárik (2024) studied this very regime, where they derived lower bounds. They then proposed two algorithms: Rennala SGD, designed for the homogeneous data setting, where all workers draw samples from the same distribution, and Malenia SGD, for the heterogeneous data setting, where data distributions differ across workers. They showed that both methods are optimal—achieving the lower bounds—and, perhaps surprisingly, both are synchronous (they periodically synchronize the workers). The key idea in both is to fully utilize the available computational resources by keeping workers continuously busy: each worker computes independently, and synchronization occurs only after a sufficient number of gradient computations have been accumulated.

At first, the result of Tyurin & Richtárik (2024) suggested a rather pessimistic outlook for asynchronous methods: despite their practical appeal, they showed that existing asynchronous methods are not optimal and that the method achieving the lower bound is synchronous. This created the view that optimality is inherently tied to synchronization. However, this view was overturned by Maranjyan et al. (2025d), who, in the homogeneous data setting, introduced Ringmaster ASGD—the first asynchronous SGD method to achieve the same optimal time complexity as the synchronous Rennala SGD. Although both methods share the same theoretical guarantees, Ringmaster ASGD can be faster than Rennala SGD in practice, since it avoids synchronization and benefits from more frequent updates.

Nevertheless, the work of Maranjyan et al. (2025d) established optimality in the homogeneous data setting only. The question of whether some variant of a parallel method that does not rely on synchronization (i.e., is asynchronous) can also be optimal in the more general heterogeneous data setting remained open. In this work, we close this gap and answer the question affirmatively.

The heterogeneous data setting is both important and practically relevant. In FL, for instance, such heterogeneity arises naturally as participants hold distinct datasets (Zhao et al., 2018; Li et al., 2020; Tan et al., 2022). Yet this setting is significantly more challenging than the homogeneous one. The standard philosophy of asynchronous SGD—updating the model after every gradient computation—can be harmful here: fast workers contribute updates more frequently, causing the optimization process to become biased toward their local data. To mitigate this, most existing asynchronous methods address this issue by assuming similarity across client distributions (Mishchenko et al., 2022; Koloskova et al., 2022; Nguyen et al., 2022; Islamov et al., 2024). While this assumption simplifies the analysis, it is often unrealistic in practice, where clients may involve fundamentally different populations (e.g., hospitals with distinct demographics, mobile users in different countries, or financial institutions under varied regulations).

Recent work by Wang et al. (2025) took an important step toward removing these restrictive assumptions by proposing Incremental Aggregated Asynchronous SGD (IA²SGD), a method that provably converges without similarity assumptions. However, their method achieves the same time complexity as standard Naive Minibatch SGD (see the first two rows of Table 1)—the simplest synchronous SGD baseline, which waits to collect one gradient from each worker before every update—thus failing to provide the computational advantages that motivate asynchronous approaches in the first place.

To the best of our knowledge, the only method proven to be optimal in the heterogeneous data setting is the synchronous algorithm Malenia SGD of Tyurin & Richtárik (2024), which, notably, does not rely on similarity assumptions. However, synchronization is a major bottleneck in practice: although synchronous and asynchronous methods can share the same theoretical complexity, asynchronous methods are often faster in practice because they avoid costly synchronization and benefit from more frequent updates, as demonstrated in the homogeneous case by Maranjyan et al. (2025d).

This raises a fundamental question: Is it possible to design an asynchronous method that requires no similarity assumptions while still achieving optimal time complexity? In this paper, we answer this question affirmatively by introducing Ringleader ASGD, the first asynchronous SGD method that achieves optimal time complexity¹¹1Throughout the paper, we refer to our method as optimal. Formally, this holds whenever the constant $L$ —associated with our new smoothness-type assumption (Section 2.3)—is at most a constant factor larger than the smoothness constant $L_{f}$ used in the derived lower bounds (Tyurin & Richtárik, 2024). See Table 1 for details. in the heterogeneous data setting. Importantly, Ringleader ASGD attains this without relying on restrictive similarity assumptions.

1.1 Contributions

Our main contributions are the following:

•

Optimal asynchronous SGD under data heterogeneity. We prove that Ringleader ASGD (Algorithm 1) is, to the best of our knowledge, the first asynchronous method in the heterogeneous setting under the fixed computation model (2) that matches the lower bounds for parallel methods established by Tyurin & Richtárik (2024) (Table 1). Crucially, it does not rely on any similarity assumptions across clients’ datasets.
•

Additional useful properties. Beyond achieving optimal time complexity, our method Ringleader ASGD satisfies two additional desirable properties: (i) all workers remain continuously active (no idle workers), and (ii) every computed gradient is incorporated into the update (no discarded work). These properties are crucial in practice, as they ensure maximum resource utilization: all workers contribute at all times, and their computations are never wasted. Table 1 compares Ringleader ASGD against benchmark algorithms with respect to these properties.
•

Parameter-free design. In contrast to the optimal synchronous method Malenia SGD (Tyurin & Richtárik, 2024), which requires prior knowledge of the gradient variance bound and target accuracy, our method operates in the fixed computation time model without such parameters (except for the stepsize, needed only to match the optimal theoretical rate). This makes it far more practical for real-world deployments, where these quantities are typically unknown or difficult to estimate. The same parameter-free improvement can also be extended to Malenia SGD, as we discuss in Appendix E.
•

Universal computation model. In Appendix B, we extend our analysis beyond the fixed computation time model to the general setting of arbitrarily varying computation times, accommodating virtually any computational behavior, including stochastic or adversarial patterns, while retaining optimality time complexity.
•

Empirical validation. In Section 6, we evaluate Ringleader ASGD against benchmark methods on illustrative toy problems. The results validate our theoretical findings and demonstrate clear practical advantages over the baselines.

2 Problem Setup

We consider a distributed learning setting with $n$ workers, where each worker $i$ possesses its own local data distribution $\mathcal{D}_{i}$ . Our goal is to solve the following distributed optimization problem:

\mathop{\mathrm{minimize}}\limits_{x\in\mathbb{R}^{d}}\left\{f(x):=\frac{1}{n}\sum_{i=1}^{n}f_{i}(x)\right\},\quad\text{where}\quad f_{i}(x):={\mathbb{E}}_{\xi_{i}\sim\mathcal{D}_{i}}\left[f_{i}(x;\xi_{i})\right].

(1)

Here $f_{i}\colon\mathbb{R}^{d}\to\mathbb{R}$ denotes the local objective of worker $i$ , defined as the expectation of the sample loss $f_{i}(x;\xi_{i})$ over data points $\xi_{i}$ drawn from its local distribution $\mathcal{D}_{i}$ .

2.1 Worker Heterogeneity Model

We first focus on the case where workers have constant computation speeds, as this setting is more intuitive and serves as a foundational model for understanding the dynamics of asynchronous distributed optimization. The extension to arbitrary computation times is presented in Appendix B.

Following the fixed computation model (Mishchenko et al., 2022), we formalize:

		$\displaystyle\text{Each worker }i\text{ requires }\tau_{i}\text{ seconds\ to compute one stochastic gradient}\ \nabla f_{i}(x,\xi_{i})~.$		(2)
		$\displaystyle\text{Without loss of generality, assume }0<\tau_{1}\leq\tau_{2}\leq\cdots\leq\tau_{n}~.$		(2)

We assume communication is infinitely fast (taking $0$ seconds), both from workers to the server and from the server to workers³³3Alternatively, one could define $\tau_{i}$ as the time required for a worker to both compute a gradient and communicate it to the server, while keeping server-to-worker communication infinitely fast. Our upper bounds would still hold under this formulation, but the lower bounds would no longer apply, so we use the simpler model.. This is a modeling choice—arguably the simplest one—and has been the standard assumption in prior work (Mishchenko et al., 2022; Koloskova et al., 2022; Tyurin & Richtárik, 2024; Maranjyan et al., 2025d), even if not always stated explicitly. A related study by Tyurin et al. (2024b) considers the case where communication is non-negligible and proposes techniques to address it.

Finally, we denote by $\tau_{\mathrm{avg}}:=\tfrac{1}{n}\sum_{i=1}^{n}\tau_{i}$ the average computation time across all workers.

2.2 Notations

We denote the standard inner product in $\mathbb{R}^{d}$ by

\left\langle x,y\right\rangle=\sum_{i=1}^{d}x_{i}y_{i}~,

and the corresponding Euclidean norm by $\left\|x\right\|:=\sqrt{\left\langle x,x\right\rangle}$ . We use $[n]:=\{1,2,\dots,n\}$ to denote the index set, and ${\mathbb{E}}\left[\cdot\right]$ for mathematical expectation. For functions $\phi,\psi:\mathcal{Z}\to\mathbb{R}$ , we write $\phi=\mathcal{O}(\psi)$ if there exists a constant $C>0$ such that $\phi(z)\leq C\psi(z)$ for all $z\in\mathcal{Z}$ .

2.3 Assumptions

We consider the following standard assumptions for the nonconvex setting. {boxedassumption} For each $i\in[n]$ and every $\xi$ , the function $f_{i}(x;\xi)$ is differentiable with respect to its first argument $x$ . Moreover, the stochastic gradients are unbiased and have bounded variance $\sigma^{2}\geq 0$ , that is,

	$\displaystyle{\mathbb{E}}_{\xi_{i}\sim\mathcal{D}_{i}}\left[\nabla f_{i}(x;\xi_{i})\right]=\nabla f_{i}(x),\quad\forall x\in\mathbb{R}^{d},\;\;\forall i\in[n]~,$
	$\displaystyle{\mathbb{E}}_{\xi_{i}\sim\mathcal{D}_{i}}\left[\left\\|\nabla f_{i}(x;\xi_{i})-\nabla f_{i}(x)\right\\|^{2}\right]\leq\sigma^{2},\quad\forall x\in\mathbb{R}^{d},\;\;\forall i\in[n]~.$

{boxedassumption}

Each function $f_{i}$ is differentiable. There exists a constant $L>0$ such that, for all $x\in\mathbb{R}^{d}$ and $y_{1},\dots,y_{n}\in\mathbb{R}^{d}$ ,

\left\|\nabla f(x)-\frac{1}{n}\sum_{i=1}^{n}\nabla f_{i}(y_{i})\right\|^{2}\leq\frac{L^{2}}{n}\sum_{i=1}^{n}\left\|x-y_{i}\right\|^{2}.

Recall the standard definition of smoothness {boxeddefinition}[Smoothness] A differentiable function $\phi\colon\mathbb{R}^{d}\to\mathbb{R}$ is called $L_{\phi}$ –smooth if

\left\|\nabla\phi(x)-\nabla\phi(y)\right\|\leq L_{\phi}\left\|x-y\right\|,\quad\forall x,y\in\mathbb{R}^{d}~.

By convention, $L_{\phi}$ denotes the smallest such constant. Note that Section 2.3 is stronger than requiring $f$ itself to be $L_{f}$ –smooth, yet weaker than all $f_{i}$ being $L_{f_{i}}$ –smooth. The following lemma establishes the relation among the constants $L_{f}$ , $L$ , and $L_{f_{i}}$ . {boxedlemma}[Smoothness Bounds; Proof in Section C.1] Suppose Section 2.3 holds with constant $L>0$ . Then $f$ is $L_{f}$ –smooth with $L_{f}\leq L$ . Moreover, if each $f_{i}$ is $L_{f_{i}}$ –smooth, then Section 2.3 is satisfied, and we have

L_{f}\;\leq\;L\;\leq\;\sqrt{\frac{1}{n}\sum_{i=1}^{n}L_{f_{i}}^{2}}\;\leq\;\max_{i\in[n]}L_{f_{i}}=:L_{\max}~.

Finally, if all $f_{i}$ are identical, i.e., $f_{i}=f$ for all $i\in[n]$ , then $L=L_{f}$ . To the best of our knowledge, prior work on asynchronous SGD under data heterogeneity has always assumed smoothness of each $f_{i}$ , including works by Koloskova et al. (2022); Nguyen et al. (2022); Wang et al. (2025). {boxedassumption} There exists $f^{*}>-\infty$ such that $f(x)\geq f^{*}$ for all $x\in\mathbb{R}^{d}$ . We define $\Delta:=f(x^{0})-f^{*},$ where $x^{0}$ is the starting point of the optimization methods. Under these assumptions; our objective is to find an $\varepsilon$ –stationary point—a (possibly random) vector $x$ satisfying $\mathbb{E}\left[\left\|\nabla f(x)\right\|^{2}\right]\leq\varepsilon$ .

3 Background and Motivation

In this section, we review relevant existing methods in distributed optimization and discuss their limitations to motivate our algorithmic design. We begin with Naive Minibatch SGD as the canonical synchronous baseline, then consider Malenia SGD (Tyurin & Richtárik, 2024), the first synchronous SGD method to attain optimal time complexity. We then turn to the challenges of asynchronous approaches, focusing on the recent IA²SGD (Wang et al., 2025) and its limitations. Finally, we outline how these limitations can be addressed and introduce the core idea behind our algorithm.

3.1 Naive Minibatch SGD

Naive Minibatch SGD provides the most straightforward approach to solving problem (1) in a distributed setting.

Algorithm Description.

At each iteration $k$ , the algorithm performs the following update:

x^{k+1}=x^{k}-\gamma\frac{1}{n}\sum_{i=1}^{n}\nabla f_{i}\left(x^{k};\xi_{i}^{k}\right).

The algorithm operates synchronously: at each iteration, the server waits to receive one stochastic gradient from each of the $n$ workers, all of which are evaluated at the current iterate $x^{k}$ . Once all gradients are collected, the server constructs an unbiased minibatch estimator of the full gradient by averaging these stochastic gradients and performs a standard gradient descent step.

Limitations.

This synchronous approach has a significant computational bottleneck. Each iteration requires waiting for the slowest worker to complete its gradient computation, resulting in the iteration time $\tau_{n}:=\max_{i\in[n]}\tau_{i}$ (2). Consequently, faster workers remain idle while waiting for stragglers, which leads to inefficient utilization of available computational resources.

Theoretical Performance.

From a convergence perspective, Naive Minibatch SGD achieves the iteration complexity $\mathcal{O}\left(\nicefrac{{L_{f}\Delta}}{{\varepsilon}}\left(1+\nicefrac{{\sigma^{2}}}{{n\varepsilon}}\right)\right)$ to reach an $\varepsilon$ –stationary point (Cotter et al., 2011; Goyal et al., 2017; Gower et al., 2019). The corresponding time complexity becomes

\mathcal{O}\left(\frac{\tau_{n}L_{f}\Delta}{\varepsilon}\left(1+\frac{\sigma^{2}}{n\varepsilon}\right)\right).

This motivates the development of methods that can better utilize fast workers without waiting for stragglers.

3.2 Malenia SGD

Malenia SGD (Tyurin & Richtárik, 2024) addresses the straggler problem of Naive Minibatch SGD by ensuring continuous worker utilization, rather than waiting for the slowest worker in each iteration. However, it is fundamentally a Minibatch SGD algorithm: in every iteration, it constructs an unbiased gradient estimator from multiple gradients and then performs a synchronous update. The key distinction lies in how the batch is collected—Malenia SGD gathers gradients asynchronously, allowing potentially multiple contributions from the same worker within a single iteration, unlike Naive Minibatch SGD.

Algorithm Description.

At each iteration $k$ , all workers continuously compute stochastic gradients at the current point $x^{k}$ . The server accumulates these gradients until the stopping condition

\left(\frac{1}{n}\sum_{i=1}^{n}\frac{1}{b_{i}^{k}}\right)^{-1}\geq\max\left\{1,\frac{\sigma^{2}}{n\varepsilon}\right\}

(3)

is satisfied, where $b_{i}^{k}$ denotes the number of gradients received from worker $i$ . Once this condition is met, the server performs the update

x^{k+1}=x^{k}-\gamma\bar{g}^{k}:=x^{k}-\gamma\frac{1}{n}\sum_{i=1}^{n}\bar{g}_{i}^{k}~,

where $\bar{g}_{i}^{k}$ is the average of the stochastic gradients received from worker $i$

\bar{g}_{i}^{k}=\frac{1}{b_{i}^{k}}\sum_{j=1}^{b_{i}^{k}}\nabla f_{i}\left(x^{k};\xi_{i}^{k,j}\right).

Because stochastic gradients are collected asynchronously, but the update is performed only after all required gradients are received, the algorithm can be regarded as semi-asynchronous—asynchronous in gradient collection but synchronous in parameter updates.

Stopping Condition Rationale.

The left-hand side of condition (3) appears in the variance of the gradient estimator $\bar{g}^{k}$ . Ensuring that this quantity is sufficiently large allows the algorithm to control the variance at each step. Moreover, condition (3) guarantees that every worker computes at least one gradient, which in turn yields a smaller variance than that of the estimator in Naive Minibatch SGD.

Theoretical Performance.

This asynchronous gradient collection strategy achieves the optimal time complexity

\mathcal{O}\left(\frac{L_{f}\Delta}{\varepsilon}\left(\tau_{n}+\tau_{\mathrm{avg}}\frac{\sigma^{2}}{n\varepsilon}\right)\right),

as shown by Tyurin & Richtárik (2024). The key improvement over Naive Minibatch SGD is that the variance term $\sigma^{2}$ is now multiplied by $\tau_{\text{avg}}$ instead of $\tau_{n}$ . Since $\tau_{\text{avg}}\ll\tau_{n}$ in computationally heterogeneous environments, Malenia SGD can potentially provide substantial speedup in highly heterogeneous regimes.

The algorithm’s benefits are most pronounced in the high-noise settings (where $\nicefrac{{\sigma^{2}}}{{n\varepsilon}}$ is large). In low-noise scenarios or when $\sigma=0$ , Malenia SGD offers no advantage over Naive Minibatch SGD, since collecting multiple gradients per worker provides no benefit in terms of variance reduction.

Limitations.

The main limitation of Malenia SGD is its synchronous nature. After collecting the necessary gradients, the server must broadcast the updated model to all workers simultaneously. This all-at-once synchronization creates substantial communication overhead, which can become a major scalability bottleneck in large-scale or bandwidth-limited environments.

Moreover, the synchronization process forces the server to discard ongoing gradient computations from workers who are actively computing when the broadcast occurs. Since workers operate continuously during the gradient collection phase, they must abandon their current computations upon receiving the new model, wasting valuable computational resources that could otherwise contribute to convergence.

Additionally, Malenia SGD requires prior knowledge of the noise level $\sigma$ and the target accuracy $\varepsilon$ to evaluate the stopping condition (3). This dependence on problem-specific parameters, which are often unknown or difficult to estimate in practice, significantly limits its practical applicability.

These synchronization bottlenecks motivate the development of asynchronous optimal methods. Beyond avoiding coordination overhead, asynchronous approaches enable immediate model updates upon gradient arrival, which can accelerate convergence through more frequent parameter updates. This immediate processing is particularly beneficial for sparse models where gradients typically affect disjoint parameter subsets, allowing parallel updates without interference (Recht et al., 2011).

3.3 Toward Asynchronous Methods

A naive approach to making optimization asynchronous would be to update the model immediately upon receiving any gradient, using

x^{k+1}=x^{k}-\gamma\nabla f_{i_{k}}\left(x^{k-\delta^{k}};\xi_{i_{k}}^{k-\delta^{k}}\right),

where $i_{k}$ denotes the worker that sent the gradient at iteration $k$ , and $\delta^{k}\geq 0$ is its delay, i.e., the number of server updates that occurred while the gradient was being computed. Delays arise naturally in asynchronous execution: fast workers return gradients quickly and proceed with updated models, while slower workers compute on stale iterates; by the time they return their gradients, several server updates may already have occurred. Consequently, $\delta^{k}$ is only determined when the server receives a gradient: at that point, it knows both the model iterate used by the worker and the current server iterate, and $\delta^{k}$ is simply the difference between the two.

Limitations.

This approach suffers from a fundamental bias problem when workers have heterogeneous data distributions. Faster workers send updates more frequently, causing their local objectives to dominate the optimization and pull the model toward their own minimizers. Slower workers, when their updates finally arrive, push the model back toward different solutions. As a result, the iterates may oscillate or stagnate, failing to converge to a stationary point.

This bias also makes theoretical analysis difficult. Classical SGD-style proofs rely on one-step progress toward minimizing the global function, but here each update direction reflects a different local objective. Without additional data similarity assumptions (Mishchenko et al., 2022; Koloskova et al., 2022; Nguyen et al., 2022; Islamov et al., 2024), it becomes impossible to extend the analysis to the global function—yet such assumptions are rarely realistic when data can be arbitrarily heterogeneous across machines or organizations.

The root cause is that each update uses a gradient from one worker only. A better strategy is to incorporate information from all workers, even if some gradients are computed at stale iterates. This idea motivates methods such as Incremental Aggregated Gradient (IAG) (Blatt et al., 2007; Gurbuzbalaban et al., 2017; Vanli et al., 2018) and Stochastic Averaged Gradient (SAG) (Roux et al., 2012; Schmidt et al., 2017), which maintain and aggregate gradients from all workers.

3.4 IA²SGD

As discussed above, the key insight is to construct a gradient estimator using information from all workers at each model update. IA²SGD (Wang et al., 2025) achieves this by maintaining a gradient table on the server, similar to SAG or IAG, but with asynchronous table updates.

Algorithm Overview.

The server maintains a gradient table $\{g_{i}\}_{i=1}^{n}$ that stores the most recent gradient received from each of the $n$ workers. The table is initialized by collecting one gradient from each worker at the starting point $x^{0}$ . The server then computes the first model update

x^{1}=x^{0}-\gamma\bar{g}:=x^{0}-\gamma\frac{1}{n}\sum_{i=1}^{n}g_{i}

and broadcasts $x^{1}$ to all workers. Subsequently, the workers compute stochastic gradients in parallel, and their corresponding table entries are updated asynchronously as soon as the computations finish.

At each iteration $k$ , the server performs the following steps:

1.

Receive gradient $\nabla f_{i_{k}}\left(x^{k-\delta_{i_{k}}^{k}};\xi_{i_{k}}^{k-\delta_{i_{k}}^{k}}\right)$ from worker $i_{k}$
2.

Update the gradient table entry: $g_{i_{k}}=\nabla f_{i_{k}}\left(x^{k-\delta_{i_{k}}^{k}};\xi_{i_{k}}^{k-\delta_{i_{k}}^{k}}\right)$
3.

Perform model update: $x^{k+1}=x^{k}-\gamma\bar{g}=x^{k}-\gamma\frac{1}{n}\sum_{i=1}^{n}g_{i}$
4.

Send the updated iterate $x^{k+1}$ to worker $i_{k}$ for its next computation

The gradient estimator $\bar{g}$ combines the most recent gradient from each worker, ensuring that every update reflects information from the entire set of workers despite asynchronous execution. In this way, the method retains the statistical benefits of using all workers’ data while allowing them to operate independently, thereby avoiding the synchronization bottlenecks that limit scalability.

Note that, due to asynchrony, the stochastic gradients stored in the table generally correspond to different iterates of the model. We therefore record each worker $i$ ’s delay $\delta_{i}^{k}$ to track the iterate at which its gradient was computed.

Theoretical Performance.

The iteration complexity of this algorithm was established by Wang et al. (2025), and by a straightforward conversion to the fixed computation time model (see Appendix D), we obtain the corresponding time complexity

\mathcal{O}\left(\frac{\tau_{n}L_{\max}\Delta}{\varepsilon}\left(1+\frac{\sigma^{2}}{n\varepsilon}\right)\right),

which matches the complexity of Naive Minibatch SGD. This indicates that asynchronous execution alone does not provide computational advantages over synchronous approaches. Thus, a fundamental challenge lies in how gradient delay affects convergence, which we address next.

3.5 Motivation

The primary limitation of asynchronous algorithms stems from gradient delay, which can significantly degrade convergence performance. Large delays can cause the optimization steps to follow suboptimal trajectories, which disrupts convergence.

This delay problem occurs even in homogeneous settings where all functions $f_{i}$ are equal ( $f_{i}\equiv f$ for all $i\in[n]$ ). The state-of-the-art solution for this case, Ringmaster ASGD (Maranjyan et al., 2025d), achieves optimal time complexity by explicitly controlling delay to prevent it from becoming large. Ringmaster ASGD accomplishes this by discarding gradients that arrive with large delays.

Unfortunately, this gradient-discarding strategy fails in the heterogeneous setting of IA²SGD. The fundamental issue is that slow workers inevitably experience large delays due to their computational constraints. If we ignore or discard their delayed gradients, the corresponding table entries remain outdated and may never be updated again if subsequent gradients also arrive late and are discarded. This creates a persistent information bottleneck that degrades the quality of the gradient estimator and harms convergence.

This issue suggests we should prevent such situations from occurring by controlling the number of updates performed using fast workers. The simplest approach would be to ignore some updates from fast workers, but this contradicts the core principle of asynchronous methods whereby all workers compute gradients continuously.

Instead, our approach buffers the gradients produced by fast workers rather than applying them immediately, similar to the strategy in Malenia SGD. By buffering gradients and performing a model update only once a sufficient number have been collected, we control the delays induced by slow workers while keeping all workers continuously active. This buffering mechanism provides an additional advantage: when multiple gradients computed at the same iterate are aggregated, they yield lower-variance estimates, thereby improving convergence.

4 Ringleader ASGD

We now formally introduce our algorithm, Ringleader ASGD.

Ringleader ASGD builds upon three key insights from existing methods. First, inspired by IA²SGD, we maintain a gradient table⁴⁴4It is not strictly necessary to maintain a table on the server. An alternative, as in IA²SGD (Wang et al., 2025), is to store one vector on each worker along with an additional vector on the server. However, this can be problematic when workers have limited memory. For clarity and simplicity, we adopt the server-side table formulation in our description. to ensure that information from all workers is incorporated into every update, which eliminates the need for data similarity assumptions between workers. Second, following Ringmaster ASGD, we recognize that controlling gradient delay is essential for efficient asynchronous optimization. Third, drawing from Malenia SGD, we use buffering of stochastic gradients—rather than discarding delayed ones—to control delays while preserving valuable computations, enabling continuous utilization of all resources.

An important property of the algorithm is that all workers remain continuously active, computing stochastic gradients. As soon as a worker finishes computing a gradient, it immediately sends it to the server. Recall that we assumed communication is instantaneous, i.e., takes zero time (Section 2.1). When the server receives a gradient, it either buffers it for later use or applies it immediately to perform a model update. If the gradient is buffered, no further action is taken and the worker simply continues computing and sending new gradients. If the server decides to perform an update, it updates the model and sends the updated model back to the worker that provided the gradient. This server-to-worker communication is also assumed instantaneous, after which the worker resumes computing gradients at the new model point, ensuring that workers are never idle.

The algorithm proceeds in rounds. In each round, exactly $n$ model updates are performed—one for each worker. Specifically, when a worker sends a stochastic gradient, the server may apply an update and return the updated model to that worker, but it ensures that each worker receives an updated model at most once per round. Repeating this procedure $n$ times ensures that every worker obtains exactly one fresh model per round, which in turn keeps delays bounded. To avoid discarding the computations of fast workers, the server buffers their gradients and applies them only at the appropriate moment, thereby guaranteeing that each round consists of exactly $n$ updates.

Each round consists of two phases:

•

Phase 1: Buffer stochastic gradients in a table until at least one gradient from each worker is available.
•

Phase 2: Perform exactly $n$ updates (one for each worker), then discard the old stochastic gradients from the table and return to Phase 1 to start collecting fresh ones.

The complete algorithm is formalized in Algorithm 1.

Algorithm 1 Ringleader ASGD (server algorithm)

1: Input: Stepsize

\gamma>0

, initial point

x^{0}\in\mathbb{R}^{d}

2: Initialization: Broadcast

x^{0}

to all workers, which then start running Algorithm 2 in parallel

3: Set

k=0

S=\emptyset

; initialize

G_{i}=0

b_{i}=0

for all

i\in[n]

4: while True do

5: — Phase 1: await stochastic gradients from all workers —

6: while

S\neq[n]

7: Receive stochastic gradient

g_{j}^{k}

(computed at

x^{k-\delta_{j}^{k}}

) from some worker

j\in[n]

G_{j}=G_{j}+g_{j}^{k}

;

b_{j}=b_{j}+1

;

S=S\cup\{j\}

9: end while

10: — Phase 2: perform exactly one update for every worker —

11:

x^{k+1}=x^{k}-\gamma\frac{1}{n}\sum_{i=1}^{n}\nicefrac{{G_{i}}}{{b_{i}}}

\diamond

Update using averaged gradients from all workers

12: Broadcast

x^{k+1}

to worker

j

\diamond

Last worker to complete Phase 1

13:

k=k+1

;

S=S\setminus\{j\}

14:

g_{i}^{+}=0

b_{i}^{+}=0

for all

i\in[n]

;

S^{+}=\emptyset

\diamond

Initialize temporary buffer for the next round

15: while

S\neq\emptyset

16: Receive stochastic gradient

g_{j}^{k}

(computed at

x^{k-\delta_{j}^{k}}

) from some worker

j\in[n]

17: if

j\in S

then

18:

G_{j}=G_{j}+g_{j}^{k}

;

b_{j}=b_{j}+1

19:

x^{k+1}=x^{k}-\gamma\frac{1}{n}\sum_{i=1}^{n}\nicefrac{{G_{i}}}{{b_{i}}}

20: Broadcast

x^{k+1}

to worker

j

21:

k=k+1

;

S=S\setminus\{j\}

22: else

23:

G_{j}^{+}=G_{j}^{+}+g_{j}^{k}

;

b_{j}^{+}=b_{j}^{+}+1

;

S^{+}=S^{+}\cup\{j\}

\diamond

Buffer for next round

24: end if

25: end while

26:

G_{i}=G_{i}^{+}

;

b_{i}=b_{i}^{+}

for all

i\in[n]

;

S=S^{+}

\diamond

Transfer buffered gradients to main table

27: end while

Algorithm 2 Worker

i

’s subroutine

1: Input: Model

x

2: while True do

3: Compute

g_{i}=\nabla f_{i}(x;\xi_{i})

using a freshly sampled data point

\xi_{i}\sim\mathcal{D}_{i}

4: Send

g_{i}

to the server

5: end while

4.1 Detailed Description

Initialization.

The algorithm begins by broadcasting the initial point $x^{0}\in\mathbb{R}^{d}$ to all workers, which then start executing the worker subroutine (Algorithm 2). Each worker continuously computes stochastic gradients at its current point and sends them to the server until instructed to stop, at which point the server provides a new point to resume from. This design ensures that workers remain fully utilized and never idle.

The server maintains a gradient table with entries $\{(G_{i},b_{i})\}_{i=1}^{n}$ , all initialized to zero. Here, $G_{i}$ accumulates gradients, while $b_{i}$ tracks the number of stochastic gradients received from worker $i$ to form proper minibatch estimators, with $b_{i}=0$ for all $i\in[n]$ at the start.

Before Phase 1 begins, we also initialize the set $S=\emptyset$ , which tracks which table entries contain at least one stochastic gradient. Since no gradients have yet arrived, $S$ is empty.

Phase 1 — Gradient Collection.

In this phase, the server continuously receives stochastic gradients from the workers and stores them in the gradient table $\{(G_{i},b_{i})\}_{i=1}^{n}$ . We denote by $g_{j}^{k}$ the stochastic gradient sent by worker $j$ at iteration $k$ , which is computed at point $x^{k-\delta_{j}^{k}}$ using an i.i.d. sample $\xi_{j}\sim D_{j}$ .

We do not specify how the delays $\delta_{j}^{k}$ evolve, since this information is not needed to run the algorithm: whenever necessary, a delay can be obtained as the difference between the current model iterate and the iterate at which the stochastic gradient was computed. The delays will only play a role in the analysis, not in the execution of the method.

Upon receiving $g_{j}^{k}$ from worker $j$ (Line 7), the server updates the corresponding table entry and the stochastic gradient counter as follows (Line 8)

G_{j}=G_{j}+g_{j}^{k}~,\quad b_{j}=b_{j}+1~,\quad S=S\cup\{j\}~.

This process continues until $S=[n]$ , i.e., the server has collected at least one gradient from every worker. No model updates are performed during this phase, and workers do not receive new points; hence, all stochastic gradients from a given worker are computed at the same point.

Phase 2 — Sequential Updates.

In this phase, the server performs exactly one model update for each worker $i$ , for a total of $n$ updates. Phase 2 starts with the last worker that completed Phase 1, i.e., the worker whose gradient made the table complete. The server first computes an update by averaging the accumulated stochastic gradients in the table $\{(G_{i},b_{i})\}_{i=1}^{n}$ and taking a descent step with this estimate (Line 11). The updated model is then sent to this worker (Line 12), the worker is removed from the set $S$ , and the iteration counter is incremented (Line 13).

Next, the server must update the remaining $n-1$ workers. These updates are performed sequentially as soon as each worker finishes its current computation. During this waiting period, new gradients may arrive from workers not in $S$ —e.g. for example, the last updated worker may send another stochastic gradient before the other workers complete their computation. Since discarding these gradients would waste information, they are instead buffered for the next round.

Temporary Table Management.

To achieve this, the server maintains a temporary table $\{(G_{i}^{+},b_{i}^{+})\}_{i=1}^{n}$ , initialized to zero (Line 14), together with a set $S^{+}$ that records which workers have contributed to the table. Whenever a gradient arrives from a worker not in $S$ , it is stored in the temporary table (Line 23).

If instead the gradient comes from a worker $j\in S$ —i.e., one of the workers whose model we still need to update—the server first updates the main table $\{(G_{i},b_{i})\}_{i=1}^{n}$ with this new stochastic gradient (Line 18). It then performs a model update by again averaging the accumulated stochastic gradients in the table (Line 19), broadcasts the new model to worker $j$ (Line 20), increments the iteration counter, and removes $j$ from the set $S=S\setminus\{j\}$ (Line 21).

Preparing for the Next Round.

Once all workers in $S$ have been updated and Phase 2 is complete ( $S=\emptyset$ ), the server prepares for the next round by copying the contents of the temporary table $\{(G_{i}^{+},b_{i}^{+})\}_{i=1}^{n}$ into the main table $\{(G_{i},b_{i})\}_{i=1}^{n}$ (Line 26). The set $S$ is also reset to $S=S^{+}$ , since these workers already contributed gradients at their updated models. Entries in the main table corresponding to workers not in $S^{+}$ remain zero, as the temporary table was initialized with zeros at the start of Phase 2 (Line 14).

The server can now proceed to the next round by returning to Phase 1 and beginning a new gradient collection phase.

4.2 Delay Control Analysis

The structure of Ringleader ASGD, with its two phases, is specifically designed to prevent the unbounded delays that arise in standard asynchronous methods. To understand why this works, consider that in each round we perform exactly $n$ updates—one per worker—before moving to the next round. This ensures that no worker can fall more than one full round behind the current iteration. The precise bound on delays is given in the following lemma {boxedlemma}[Bounded Delay] In Ringleader ASGD (Algorithm 1), the delays $\delta_{i}^{k}$ satisfy

\delta_{i}^{k}\leq 2n-2~,

for any worker $i\in[n]$ and any iteration $k\geq 0$ .

Proof.

We prove this by analyzing the structure of Ringleader ASGD. The algorithm operates in rounds, where each round consists of Phase 1 (gradient collection) followed by Phase 2 (sequential updates). In each Phase 2, the server performs exactly $n$ updates, one for each worker. Phase 2 begins at iterations $0,n,2n,3n,\ldots$ , i.e., at multiples of $n$ .

First round (iterations $0$ to $n-1$ ):

Initially, all workers compute gradients at the point $x^{0}$ , so during iterations $0,1,\ldots,n-1$ , the server receives gradients computed at $x^{0}$ . For any iteration $k$ in this range, the server processes stochastic gradients computed at point $x^{k-\delta_{i}^{k}}$ , so $\delta_{i}^{k}=k\leq n-1$ . Thus, delays simply increment during this first Phase 2.

At the end of this round, each worker $i$ has received a new model $x^{j}$ for some $j\in\{1,2,\ldots,n\}$ , and these update iterations are distinct across workers.

Second round (iterations $n$ to $2n-1$ ):

At the start of the second Phase 2 (at iteration $n$ ), the gradient table contains gradients computed at points $x^{n-\delta_{i}^{n}}$ for worker $i$ , where $\delta_{i}^{n}\in\{0,1,\ldots,n-1\}$ . These delay values are distinct across workers since each worker received its update at a different iteration in the previous round.

During iterations $n$ to $2n-1$ , these delays increase by $1$ at each iteration for the same reason as in the first Phase 2, giving $\delta_{i}^{2n-1}\in\{n-1,n,\ldots,2n-2\}$ at the end of this round. At the same time, all workers receive new points to compute gradients from, so during the next Phase 2, the delays will again be distinct for all workers and in $\{0,1,\ldots,n-1\}$ .

General pattern:

By induction, at the beginning of each round starting at iteration $cn$ (for integer $c\geq 1$ ), the delays $\delta_{i}^{cn}$ take distinct values in $\{0,1,\ldots,n-1\}$ . During each Phase 2, these delays increase by at most $n-1$ , giving the bound

\delta_{i}^{k}\leq(n-1)+(n-1)=2n-2~.

∎

4.3 Comparison to IA²SGD

Our method is a delay-controlled version of IA²SGD. We can recover IA²SGD by removing Phase 1 (gradient collection) and Phase 2 (structured updates), and thus perform updates naively—immediately upon gradient arrival. In contrast, our algorithm operates in structured rounds, performing exactly one update per worker in each round, which provides the crucial delay control that IA²SGD lacks.

In IA²SGD, delays for slow workers can grow unboundedly because the server continuously updates the model using gradients from fast workers, causing slow workers to fall increasingly behind. Our method prevents this issue by buffering the gradients from fast workers rather than immediately applying these gradients, to ensure that all workers receive updated models within $n$ subsequent iterations.

4.4 Comparison to Malenia SGD

Malenia SGD also operates as an algorithm with two phases. In Phase 1, Malenia SGD collects gradients using a similar method to our approach, but uses a different termination condition (3) that requires knowledge of the noise parameter $\sigma$ and the target stationarity level $\varepsilon$ , making it impractical. In Phase 2, Malenia SGD performs a synchronous update by averaging all collected gradients and broadcasting the new model to all workers simultaneously before returning to Phase 1. This synchronization forces Malenia SGD to discard ongoing gradient computations from workers that are active during the broadcast.

In contrast, our method performs Phase 2 asynchronously: we update workers sequentially as they complete their current gradient computations, which ensures that no computational work is wasted.

Regarding Malenia SGD’s termination condition (3), in Appendix E we demonstrate that this condition can be replaced with our simpler requirement of obtaining at least one gradient from every worker. With this modification, Malenia SGD remains optimal in the fixed-time regime (2) while becoming parameter-free, which eliminates the need for prior knowledge of $\sigma$ and $\varepsilon$ . Under this parameter-free variant, the only difference between Malenia SGD and Ringleader ASGD lies in Phase 2: we perform updates asynchronously without discarding gradients, while Malenia SGD operates synchronously.

5 Theoretical Results

Before presenting the theoretical results, we first write the algorithm in a compact form. The gradients for each worker in the table are all computed at the same point; for worker $i$ at iteration $k$ , the point is $x^{k-\delta_{i}^{k}}$ . The update rule can be written compactly as

x^{k+1}=x^{k}-\gamma\bar{g}^{k}~,

where the gradient estimator $\bar{g}^{k}$ is defined by

\bar{g}^{k}:=\frac{1}{n}\sum_{i=1}^{n}\bar{g}_{i}^{k}:=\frac{1}{n}\sum_{i=1}^{n}\frac{1}{b_{i}^{k}}\sum_{j=1}^{b_{i}^{k}}g_{i}^{k,j}~.

Since multiple gradients may be received from the same worker, we denote by $g_{i}^{k,j}$ the $j$ -th gradient from worker $i$ at iteration $k$ . Here the index $j$ corresponds to the i.i.d. sampled data point, and more concretely

g_{i}^{k,j}:=\nabla f_{i}\left(x^{k-\delta_{i}^{k}};\xi_{i}^{k-\delta_{i}^{k},j}\right)~.

The quantity $b_{i}^{k}$ denotes the number of gradients from worker $i$ stored in the table at iteration $k$ , i.e., the value of $b_{i}$ in Lines 11 and 19. Thus, the pair $(b_{i}^{k},\delta_{i}^{k})$ fully determines the method’s behavior at iteration $k$ .

Note that the sequence $\{b_{i}^{k}\}$ depends only on the computation times $\{\tau_{i}\}$ and the algorithm design (i.e., the stopping rule for collecting gradients). Once these are fixed, all $b_{i}^{k}$ for every $i\in[n]$ and iteration $k$ are determined. Crucially, the values of $b_{i}^{k}$ do not depend on the method’s hyperparameters $\gamma$ , $x^{0}$ , or on the variance parameter $\sigma$ or the stationarity level $\varepsilon$ .

5.1 Iteration Complexity

Our convergence analysis follows the structured approach employed by Maranjyan et al. (2025d), which decomposes the proof into two key components: a descent lemma that tracks the progress of the objective function and a residual estimation lemma that controls the accumulated delays in the system.

We begin by establishing notation for the harmonic means of the batch sizes across rounds:

B^{k}=\left(\frac{1}{n}\sum_{i=1}^{n}\frac{1}{b_{i}^{k}}\right)^{-1},\quad\text{and}\quad B=\inf_{k\geq 0}B^{k}.

Note that $B\geq 1$ , since by the algorithm’s design each $b_{i}^{k}\geq 1$ . A sharper bound on $B$ will be established later in Section 5.2.

The first lemma quantifies how much the objective function decreases at each iteration, accounting for both the standard gradient descent progress and the additional complexities introduced by asynchronous updates. {boxedlemma}[Descent Lemma; Proof in Section C.3] Under Assumptions 2.3 and 2.3, if the stepsize in Algorithm 1 satisfies $\gamma\leq\nicefrac{{1}}{{4L}}$ , then the following inequality holds

	$\displaystyle\mathbb{E}\left[f\left(x^{k+1}\right)\right]$	$\displaystyle\leq\mathbb{E}\left[f\left(x^{k}\right)\right]-\frac{\gamma}{2}\mathbb{E}\left[\left\\|\nabla f\left(x^{k}\right)\right\\|^{2}\right]-\frac{\gamma}{4}\mathbb{E}\left[\left\\|\frac{1}{n}\sum_{i=1}^{n}\nabla f_{i}\left(x^{k-\delta_{i}^{k}}\right)\right\\|^{2}\right]$
		$\displaystyle\quad+\frac{\gamma L^{2}}{2n}\sum_{i=1}^{n}\mathbb{E}\left[\left\\|x^{k}-x^{k-\delta_{i}^{k}}\right\\|^{2}\right]+\frac{3\gamma^{2}L\sigma^{2}}{2B}$
		$\displaystyle\quad+\gamma^{2}L\sum_{\ell=k-(k\bmod n)}^{k-1}\mathbb{E}\left[\left\\|\frac{1}{n}\sum_{i=1}^{n}\nabla f_{i}\left(x^{\ell-\delta_{i}^{\ell}}\right)\right\\|^{2}\right].$

This descent lemma shares a similar structure with its counterpart in the homogeneous setting analyzed by Maranjyan et al. (2025d), but with a crucial additional term. The final summation term in the upper bound captures the effect of using stale gradients from the gradient table—a phenomenon we refer to as “table delay”. This term is absent in the homogeneous case because no gradient table is maintained. Indeed, when $n=1$ , our setting reduces to the homogeneous case, the gradient table becomes unnecessary, and this additional term vanishes, recovering the original descent lemma established by Maranjyan et al. (2025d).

Next, similar to (Maranjyan et al., 2025d), we derive a lemma to bound the term involving the difference between current and old points. {boxedlemma}[Residual Estimation; Proof in Section C.4] Under Section 2.3, the iterates of Ringleader ASGD (Algorithm 1) with stepsize $\gamma\leq\nicefrac{{1}}{{4nL}}$ satisfy the following bound

\frac{1}{K}\sum_{k=0}^{K-1}\frac{1}{n}\sum_{i=1}^{n}\mathbb{E}\left[\left\|x^{k}-x^{k-\delta_{i}^{k}}\right\|^{2}\right]\leq\frac{2\gamma n}{LK}\sum_{k=0}^{K-1}\mathbb{E}\left[\left\|\frac{1}{n}\sum_{j=1}^{n}\nabla f_{j}\left(x^{k-\delta_{j}^{k}}\right)\right\|^{2}\right]+\frac{2\gamma\sigma^{2}}{LB}~.

Finally, we get the iteration complexity combining these two lemmas. {boxedtheorem}[Iteration Complexity] Under Assumptions 2.3, 2.3, and 2.3, let the stepsize in Ringleader ASGD (Algorithm 1) be

\gamma=\min\left\{\frac{1}{8nL},\frac{\varepsilon B}{10L\sigma^{2}}\right\}.

Then,

\frac{1}{K}\sum_{k=0}^{K-1}\mathbb{E}\left[\left\|\nabla f\left(x^{k}\right)\right\|^{2}\right]\leq\varepsilon~,

for

K\geq\frac{32nL\Delta}{\varepsilon}+\frac{40L\Delta\sigma^{2}}{B\varepsilon^{2}}=\mathcal{O}\left(\frac{nL\Delta}{\varepsilon}\left(1+\frac{\sigma^{2}}{Bn\varepsilon}\right)\right).

Proof.

We start by averaging the inequality from Section 5.1 over $K$ iterations and dividing both sides by $\nicefrac{{\gamma}}{{2}}$

	$\displaystyle\frac{1}{K}\sum_{k=0}^{K-1}$	$\displaystyle\mathbb{E}\left[\left\\|\nabla f\left(x^{k}\right)\right\\|^{2}\right]+\frac{1}{2K}\sum_{k=0}^{K-1}\mathbb{E}\left[\left\\|\frac{1}{n}\sum_{i=1}^{n}\nabla f_{i}\left(x^{k-\delta_{i}^{k}}\right)\right\\|^{2}\right]$
		$\displaystyle\leq\frac{2\Delta}{\gamma K}+\frac{3\gamma L\sigma^{2}}{B}$
		$\displaystyle\quad+\frac{L^{2}}{n}\frac{1}{K}\sum_{k=0}^{K-1}\sum_{i=1}^{n}\mathbb{E}\left[\left\\|x^{k}-x^{k-\delta_{i}^{k}}\right\\|^{2}\right]$
		$\displaystyle\quad+2\gamma L\frac{1}{K}\sum_{k=0}^{K-1}\sum_{\ell=k-(k\bmod n)}^{k-1}\mathbb{E}\left[\left\\|\frac{1}{n}\sum_{i=1}^{n}\nabla f_{i}\left(x^{\ell-\delta_{i}^{\ell}}\right)\right\\|^{2}\right].$

We now bound the third term on the right using Section 5.1

	$\displaystyle\frac{1}{K}\sum_{k=0}^{K-1}$	$\displaystyle\mathbb{E}\left[\left\\|\nabla f\left(x^{k}\right)\right\\|^{2}\right]+\frac{1}{2K}\sum_{k=0}^{K-1}\mathbb{E}\left[\left\\|\frac{1}{n}\sum_{i=1}^{n}\nabla f_{i}\left(x^{k-\delta_{i}^{k}}\right)\right\\|^{2}\right]$
		$\displaystyle\leq\frac{2\Delta}{\gamma K}+\frac{3\gamma L\sigma^{2}}{B}+\frac{2\gamma L\sigma^{2}}{B}$
		$\displaystyle\quad+2\gamma Ln\frac{1}{K}\sum_{k=0}^{K-1}\mathbb{E}\left[\left\\|\frac{1}{n}\sum_{j=1}^{n}\nabla f_{j}\left(x^{k-\delta_{j}^{k}}\right)\right\\|^{2}\right]$
		$\displaystyle\quad+2\gamma L\frac{1}{K}\sum_{k=0}^{K-1}\sum_{\ell=k-(k\bmod n)}^{k-1}\mathbb{E}\left[\left\\|\frac{1}{n}\sum_{i=1}^{n}\nabla f_{i}\left(x^{\ell-\delta_{i}^{\ell}}\right)\right\\|^{2}\right]$
		$\displaystyle\leq\frac{2\Delta}{\gamma K}+\frac{5\gamma L\sigma^{2}}{B}$
		$\displaystyle\quad+2\gamma Ln\frac{1}{K}\sum_{k=0}^{K-1}\mathbb{E}\left[\left\\|\frac{1}{n}\sum_{j=1}^{n}\nabla f_{j}\left(x^{k-\delta_{j}^{k}}\right)\right\\|^{2}\right]$
		$\displaystyle\quad+2\gamma Ln\frac{1}{K}\sum_{k=0}^{K-1}\mathbb{E}\left[\left\\|\frac{1}{n}\sum_{i=1}^{n}\nabla f_{i}\left(x^{k-\delta_{i}^{k}}\right)\right\\|^{2}\right].$

Now, using the bound $\gamma\leq\nicefrac{{1}}{{8nL}}$ , we obtain

\frac{1}{K}\sum_{k=0}^{K-1}\mathbb{E}\left[\left\|\nabla f\left(x^{k}\right)\right\|^{2}\right]\leq\frac{2\Delta}{\gamma K}+\frac{5\gamma L\sigma^{2}}{B}~.

Finally, plugging in the stepsize and the expression for $K$ ensures the right-hand side is bounded by $\varepsilon$ . ∎

For parallel and asynchronous methods, iteration complexity is less important than time complexity. What truly matters is how quickly we can finish training. We are willing to perform more iterations and extra computation if it means completing the process faster. Having established the iteration complexity, we now turn to the time complexity.

5.2 Time Complexity

Since the algorithm operates in rounds with $n$ steps per round, and its iteration complexity is already known, it remains to determine the duration of each round. We have the following lemma {boxedlemma} Each block of $n$ consecutive iterations (each round) of Algorithm 1 takes at most $2\tau_{n}$ seconds. Moreover, we have

B\geq\frac{\tau_{n}}{2}\left(\frac{1}{n}\sum_{i=1}^{n}\tau_{i}\right)^{-1}=\frac{\tau_{n}}{2\tau_{\mathrm{avg}}}~.

Proof.

The upper bound of $2\tau_{n}$ follows from the structure of the algorithm, which consists of two phases. In the first phase, the server waits until all workers complete at least one gradient computation, which takes at most $\tau_{n}$ seconds. In the second phase, the server applies the received gradients and waits for all ongoing computations to finish—which again takes at most $\tau_{n}$ seconds. Thus, the total time for $n$ iterations is bounded by $2\tau_{n}$ .

We now prove the second part of the lemma. Recall that

B=\inf_{k\geq 0}B^{k}=\inf_{k\geq 0}\left(\frac{1}{n}\sum_{i=1}^{n}\frac{1}{b_{i}^{k}}\right)^{-1}\geq\left(\frac{1}{n}\sum_{i=1}^{n}\frac{1}{b_{i}}\right)^{-1},

where we define

b_{i}=\inf_{k\geq 0}b_{i}^{k}~.

We are interested in the number of gradients stored in the table at iteration $k$ . This count includes gradients computed during Phase 1 plus one additional gradient from Phase 2 (except for the worker that finished Phase 1 last).

Since every worker needs to compute at least one gradient during Phase 1, the slowest worker will take $\tau_{n}$ seconds to complete single gradient computation. During this $\tau_{n}$ -second interval, faster workers $i<n$ may still be finishing gradients from the previous round’s Phase 2 before starting their Phase 1 computations for the current round.

After completing any remaining Phase 2 work (which takes at most $\tau_{i}$ seconds), worker $i$ has at least $\tau_{n}-\tau_{i}$ seconds remaining to compute additional gradients for the current round’s Phase 1. The number of gradients that worker $i$ can compute satisfies

b_{i}\geq\max\left\{1,\left\lceil\frac{\tau_{n}-\tau_{i}}{\tau_{i}}\right\rceil\right\}\geq\max\left\{1,\frac{\tau_{n}}{\tau_{i}}-1\right\}.

For workers $i$ where $\tau_{n}\geq 2\tau_{i}$ , we have

\frac{\tau_{n}}{\tau_{i}}-1\geq\frac{\tau_{n}}{2\tau_{i}}~,

and hence

b_{i}\geq\frac{\tau_{n}}{2\tau_{i}}~.

Plugging this bound into the expression for $B$ gives the claimed result. ∎

Based on this lemma, we derive the time complexity guarantee of our algorithm {boxedtheorem} Let Assumptions 2.3, 2.3, and 2.3 hold. Let the stepsize in Ringleader ASGD (Algorithm 1) be $\gamma=\min\left\{\nicefrac{{1}}{{8nL}},\nicefrac{{\varepsilon B}}{{10L\sigma^{2}}}\right\}$ . Then, under the fixed computation model (2), Ringleader ASGD achieves the optimal time complexity

\mathcal{O}\left(\frac{L\Delta}{\varepsilon}\left(\tau_{n}+\tau_{\mathrm{avg}}\frac{\sigma^{2}}{n\varepsilon}\right)\right).

Proof.

We start with the iteration complexity from Section 5.1

K\geq\frac{32nL\Delta}{\varepsilon}+\frac{40L\Delta\sigma^{2}}{B\varepsilon^{2}}=\mathcal{O}\left(\frac{nL\Delta}{\varepsilon}\left(1+\frac{\sigma^{2}}{Bn\varepsilon}\right)\right).

The time to do $n$ steps is at most $2\tau_{n}$ form Section 5.2. Then the time complexity is

2\tau_{n}\times\frac{K}{n}=\mathcal{O}\left(\tau_{n}\frac{L\Delta}{\varepsilon}\left(1+\frac{\sigma^{2}}{Bn\varepsilon}\right)\right).

It remains to put $B\geq\nicefrac{{\tau_{n}}}{{2\tau_{\mathrm{avg}}}}$ from Section 5.2. ∎

The obtained time complexity consists of two key terms that illuminate the algorithm’s behavior. The first term depends on the slowest device, which is fundamental since all devices must contribute to solving the problem. The second term, however, involves $\tau_{\mathrm{avg}}$ rather than $\tau_{n}$ as in Naive Minibatch SGD (see Table 1)—this substitution captures the core benefit of asynchronous execution. Specifically, this advantage becomes pronounced when $\sigma$ is relatively large. Intuitively, in high-noise regimes, collecting many gradients from workers is essential for convergence, and asynchronous methods can leverage faster workers more effectively. Conversely, in low-noise settings, fewer gradient evaluations suffice for good performance, making Naive Minibatch SGD already quite effective and rendering the additional complexity of asynchrony unnecessary.

This time complexity result matches the lower bound derived by Tyurin & Richtárik (2024), thus making Ringleader ASGD the first asynchronous algorithm to achieve optimality.

6 Experiments

To validate our theoretical results we perform a toy simulation.

We consider image classification on MNIST (LeCun et al., 2010) and on Fashion-MNIST (Xiao et al., 2017) with standard normalization for both datasets. To enable equal client sizes, we first trim the dataset so that the total number of examples is divisible by the number of clients $n=100$ . To obtain heterogeneous datasets across clients, we then partition the trimmed set using an equal-size Dirichlet procedure with concentration parameter $\alpha=0.1$ (Yurochkin et al., 2019). For each client $j\in[n]$ , we draw proportions $p_{j}\sim\mathrm{Dirichlet}(\alpha,\ldots,\alpha)$ over the classes and compute a rounded class-allocation vector whose entries sum exactly to $\nicefrac{{N}}{{n}}$ , where $N$ is the trimmed dataset size. This creates non-IID data where each client has a skewed distribution over classes (with $\alpha=0.1$ , clients typically observe only 1-2 classes frequently).

When assigning samples, we take the requested number from each class pool for client $j$ . If a class pool does not have enough remaining examples to meet the requested amount, the client receives whatever is left from that class and the shortfall is topped up using samples from other classes that still have available examples.

Our model is a two-layer MLP $\mathrm{Linear}(d,128)\!\to\!\mathrm{ReLU}\!\to\!\mathrm{Linear}(128,10)$ trained with mean cross-entropy. Stochastic gradients at the clients use a minibatch of size $4$ , while reported gradient norms are computed on the full dataset.

We emulate heterogeneous compute by assigning each client $i$ a base delay and a random jitter:

\tau_{i}\;=\;i\;+\;|\eta_{i}|~,\qquad\eta_{i}\sim\mathcal{N}(0,i)~,\quad\text{for all }i\in[n]~.

For each method we tune the stepsize $\gamma$ within a fixed wall-clock budget. We sweep

\gamma\in\{0.001,\,0.005,\,0.01,\,0.02,\,0.05,\,0.1,\,0.2,\,0.5,\,1.0,\,2.0\}~,

and then fix the best $\gamma$ per method for evaluation.

We report the full-batch gradient-norm squared $\|\nabla f(x^{k})\|^{2}$ , versus wall-clock time. Each method is run $30$ times over different seeds. We report the median with interquartile range (IQR). To reduce high-frequency noise, we apply a centered moving-average smoothing to the aggregated curves (post-aggregation), while keeping the initial point unchanged.

Figure 1 shows the results. We observe that Ringleader ASGD converges faster compared to Malenia SGD and IA²SGD. Although theory suggests that Ringleader ASGD and Malenia SGD have the same time complexity, in practice Ringleader ASGD benefits from the $n$ updates performed in Phase 2 instead of one synchronous update. This design enables more optimization steps within the same wall-clock budget, which is especially advantageous when updates are sparse.

7 Conclusion

We have introduced Ringleader ASGD, the first asynchronous stochastic gradient method to achieve optimal time complexity under arbitrary data heterogeneity and arbitrarily heterogeneous computation times in distributed learning, without requiring similarity assumptions between workers’ datasets.

Its core innovation is a two-phase structure within each round: the model is updated once per worker (for a total of $n$ updates), while a buffering mechanism manages gradient delays and preserves the efficiency of asynchronous execution. By maintaining a gradient table and alternating between gradient collection and sequential updates, Ringleader ASGD prevents the unbounded delays common in naive asynchronous methods. Every gradient received by the server is either used in the current round or stored for future use, ensuring no computation is wasted.

Our analysis shows that Ringleader ASGD matches the optimal time complexity bounds established by Tyurin & Richtárik (2024). In contrast to the optimal but synchronous Malenia SGD method, Ringleader ASGD is asynchronous and requires no prior knowledge of problem parameters in the algorithm design, making it practical for real-world deployments.

Finally, with a minor modification, Ringleader ASGD also achieves optimality in the more general setting of arbitrarily varying computation times (Appendix B).

Acknowledgments

The research reported in this publication was supported by funding from King Abdullah University of Science and Technology (KAUST): i) KAUST Baseline Research Scheme, ii) CRG Grant ORFS-CRG12-2024-6460, and iii) Center of Excellence for Generative AI, under award number 5940.

References

Agarwal & Duchi (2011) Alekh Agarwal and John C Duchi. Distributed delayed stochastic optimization. Advances in Neural Information Processing Systems, 24, 2011.
Alahyane et al. (2025) Abdelkrim Alahyane, Céline Comte, Matthieu Jonckheere, and Éric Moulines. Optimizing asynchronous federated learning: A delicate trade-off between model-parameter staleness and update frequency. arXiv preprint arXiv:2502.08206, 2025.
Arjevani et al. (2020) Yossi Arjevani, Ohad Shamir, and Nathan Srebro. A tight convergence analysis for stochastic gradient descent with delayed updates. In Algorithmic Learning Theory, pp. 111–132. PMLR, 2020.
Assran et al. (2020) Mahmoud Assran, Arda Aytekin, Hamid Reza Feyzmahdavian, Mikael Johansson, and Michael G Rabbat. Advances in asynchronous parallel and distributed optimization. Proceedings of the IEEE, 108(11):2013–2031, 2020.
Blatt et al. (2007) Doron Blatt, Alfred O Hero, and Hillel Gauchman. A convergent incremental gradient method with a constant step size. SIAM Journal on Optimization, 18(1):29–51, 2007.
Brown et al. (2020) Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language models are few-shot learners. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin (eds.), Advances in Neural Information Processing Systems, volume 33, pp. 1877–1901. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf.
Chen et al. (2016) Jianmin Chen, Xinghao Pan, Rajat Monga, Samy Bengio, and Rafal Jozefowicz. Revisiting distributed synchronous SGD. arXiv preprint arXiv:1604.00981, 2016.
Cohen et al. (2021) Alon Cohen, Amit Daniely, Yoel Drori, Tomer Koren, and Mariano Schain. Asynchronous stochastic optimization robust to arbitrary delays. Advances in Neural Information Processing Systems, 34:9024–9035, 2021.
Cotter et al. (2011) Andrew Cotter, Ohad Shamir, Nati Srebro, and Karthik Sridharan. Better mini-batch algorithms via accelerated gradient methods. Advances in Neural Information Processing Systems, 24, 2011.
Dean et al. (2012) Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, Marc' aurelio Ranzato, Andrew Senior, Paul Tucker, Ke Yang, Quoc Le, and Andrew Ng. Large scale distributed deep networks. In F. Pereira, C.J. Burges, L. Bottou, and K.Q. Weinberger (eds.), Advances in Neural Information Processing Systems, volume 25. Curran Associates, Inc., 2012. URL https://proceedings.neurips.cc/paper_files/paper/2012/file/6aca97005c68f1206823815f66102863-Paper.pdf.
Dutta et al. (2018) Sanghamitra Dutta, Gauri Joshi, Soumyadip Ghosh, Parijat Dube, and Priya Nagpurkar. Slow and stale gradients can win the race: Error-runtime trade-offs in distributed SGD. In International Conference on Artificial Intelligence and Statistics, pp. 803–812. PMLR, 2018.
Feyzmahdavian & Johansson (2023) Hamid Reza Feyzmahdavian and Mikael Johansson. Asynchronous iterations in optimization: New sequence results and sharper algorithmic guarantees. Journal of Machine Learning Research, 24(158):1–75, 2023.
Feyzmahdavian et al. (2016) Hamid Reza Feyzmahdavian, Arda Aytekin, and Mikael Johansson. An asynchronous mini-batch algorithm for regularized stochastic optimization. IEEE Transactions on Automatic Control, 61(12):3740–3754, 2016.
Fraboni et al. (2023) Yann Fraboni, Richard Vidal, Laetitia Kameni, and Marco Lorenzi. A general theory for federated optimization with asynchronous and heterogeneous clients updates. Journal of Machine Learning Research, 24(110):1–43, 2023.
Glasgow & Wootters (2022) Margalit R Glasgow and Mary Wootters. Asynchronous distributed optimization with stochastic delays. In International Conference on Artificial Intelligence and Statistics, pp. 9247–9279. PMLR, 2022.
Gower et al. (2019) Robert Mansel Gower, Nicolas Loizou, Xun Qian, Alibek Sailanbayev, Egor Shulgin, and Peter Richtárik. SGD: General analysis and improved rates. In International Conference on Machine Learning, pp. 5200–5209. PMLR, 2019.
Goyal et al. (2017) Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. Accurate, large minibatch SGD: Training imagenet in 1 hour. arXiv preprint arXiv:1706.02677, 2017.
Grattafiori et al. (2024) Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024.
Gurbuzbalaban et al. (2017) Mert Gurbuzbalaban, Asuman Ozdaglar, and Pablo A Parrilo. On the convergence rate of incremental aggregated gradient algorithms. SIAM Journal on Optimization, 27(2):1035–1048, 2017.
Islamov et al. (2024) Rustem Islamov, Mher Safaryan, and Dan Alistarh. AsGrad: A sharp unified analysis of asynchronous-SGD algorithms. In International Conference on Artificial Intelligence and Statistics, pp. 649–657. PMLR, 2024.
J Reddi et al. (2015) Sashank J Reddi, Ahmed Hefny, Suvrit Sra, Barnabas Poczos, and Alexander J Smola. On variance reduction in stochastic gradient descent and its asynchronous variants. Advances in Neural Information Processing Systems, 28, 2015.
Kairouz et al. (2021) Peter Kairouz, H Brendan McMahan, Brendan Avent, Aurélien Bellet, Mehdi Bennis, Arjun Nitin Bhagoji, Kallista Bonawitz, Zachary Charles, Graham Cormode, Rachel Cummings, et al. Advances and open problems in federated learning. Foundations and Trends® in Machine Learning, 14(1–2):1–210, 2021.
Koloskova et al. (2022) Anastasiia Koloskova, Sebastian U Stich, and Martin Jaggi. Sharper convergence guarantees for asynchronous SGD for distributed and federated learning. Advances in Neural Information Processing Systems, 35:17202–17215, 2022.
Konečný et al. (2016) Jakub Konečný, H. Brendan McMahan, Felix X. Yu, Peter Richtárik, Ananda Theertha Suresh, and Dave Bacon. Federated learning: Strategies for improving communication efficiency. In NIPS Workshop on Private Multi-Party Machine Learning, 2016.
Leblond et al. (2018) Remi Leblond, Fabian Pedregosa, and Simon Lacoste-Julien. Improved asynchronous parallel optimization analysis for stochastic incremental methods. Journal of Machine Learning Research, 19(81):1–68, 2018. URL http://jmlr.org/papers/v19/17-650.html.
LeCun et al. (2010) Yann LeCun, Corinna Cortes, and CJ Burges. Mnist handwritten digit database. ATT Labs [Online]. Available: http://yann.lecun.com/exdb/mnist, 2, 2010.
Li et al. (2014) Mu Li, David G Andersen, Alexander Smola, and Kai Yu. Communication efficient distributed machine learning with the parameter server. Advances in Neural Information Processing Systems, 27, 2014.
Li et al. (2020) Tian Li, Anit Kumar Sahu, Manzil Zaheer, Maziar Sanjabi, Ameet Talwalkar, and Virginia Smith. Federated optimization in heterogeneous networks. Proceedings of Machine Learning and Systems, 2:429–450, 2020.
Lian et al. (2015) Xiangru Lian, Yijun Huang, Yuncheng Li, and Ji Liu. Asynchronous parallel stochastic gradient for nonconvex optimization. Advances in Neural Information Processing Systems, 28, 2015.
Mania et al. (2017) Horia Mania, Xinghao Pan, Dimitris Papailiopoulos, Benjamin Recht, Kannan Ramchandran, and Michael I Jordan. Perturbed iterate analysis for asynchronous stochastic optimization. SIAM Journal on Optimization, 27(4):2202–2229, 2017.
Maranjyan et al. (2025a) Artavazd Maranjyan, Omar Shaikh Omar, and Peter Richtárik. Mindflayer SGD: Efficient parallel SGD in the presence of heterogeneous and random worker compute times. In The 41st Conference on Uncertainty in Artificial Intelligence, 2025a. URL https://openreview.net/forum?id=RNpvu3MSvm.
Maranjyan et al. (2025b) Artavazd Maranjyan, El Mehdi Saad, Peter Richtárik, and Francesco Orabona. ATA: Adaptive task allocation for efficient resource management in distributed machine learning. In International Conference on Machine Learning, 2025b.
Maranjyan et al. (2025c) Artavazd Maranjyan, Mher Safaryan, and Peter Richtárik. GradSkip: Communication-accelerated local gradient methods with better computational complexity. Transactions on Machine Learning Research, 2025c. ISSN 2835-8856. URL https://openreview.net/forum?id=6R3fRqFfhn.
Maranjyan et al. (2025d) Artavazd Maranjyan, Alexander Tyurin, and Peter Richtárik. Ringmaster ASGD: The first Asynchronous SGD with optimal time complexity. In International Conference on Machine Learning, 2025d.
McMahan et al. (2017) Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas. Communication-efficient learning of deep networks from decentralized data. In Artificial Intelligence and Statistics, pp. 1273–1282. PMLR, 2017.
McMahan et al. (2016) H Brendan McMahan, Eider Moore, Daniel Ramage, and Blaise Agüera y Arcas. Federated learning of deep networks using model averaging. arXiv preprint arXiv:1602.05629, 2:2, 2016.
Mishchenko et al. (2022) Konstantin Mishchenko, Francis Bach, Mathieu Even, and Blake E Woodworth. Asynchronous SGD beats minibatch SGD under arbitrary delays. Advances in Neural Information Processing Systems, 35:420–433, 2022.
Narayanan et al. (2021) Deepak Narayanan, Mohammad Shoeybi, Jared Casper, Patrick LeGresley, Mostofa Patwary, Vijay Korthikanti, Dmitri Vainbrand, Prethvi Kashinkunti, Julie Bernauer, Bryan Catanzaro, et al. Efficient large-scale language model training on GPU clusters using Megatron-LM. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–15, 2021.
Nesterov (2018) Yurii Nesterov. Lectures on Convex Optimization, volume 137. Springer, 2018.
Nguyen et al. (2022) John Nguyen, Kshitiz Malik, Hongyuan Zhan, Ashkan Yousefpour, Mike Rabbat, Mani Malek, and Dzmitry Huba. Federated learning with buffered asynchronous aggregation. In International Conference on Artificial Intelligence and Statistics, pp. 3581–3607. PMLR, 2022.
Nguyen et al. (2018) Lam Nguyen, Phuong Ha Nguyen, Marten Dijk, Peter Richtárik, Katya Scheinberg, and Martin Takác. SGD and Hogwild! convergence without the bounded gradients assumption. In International Conference on Machine Learning, pp. 3750–3758. PMLR, 2018.
Recht et al. (2011) Benjamin Recht, Christopher Re, Stephen Wright, and Feng Niu. HOGWILD!: A lock-free approach to parallelizing stochastic gradient descent. Advances in Neural Information Processing Systems, 24, 2011.
Roux et al. (2012) Nicolas Roux, Mark Schmidt, and Francis Bach. A stochastic gradient method with an exponential convergence rate for finite training sets. Advances in Neural Information Processing Systems, 25, 2012.
Schmidt et al. (2017) Mark Schmidt, Nicolas Le Roux, and Francis Bach. Minimizing finite sums with the stochastic average gradient. Mathematical Programming, 162:83–112, 2017.
Shoeybi et al. (2019) Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-LM: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053, 2019.
Tan et al. (2022) Alysa Ziying Tan, Han Yu, Lizhen Cui, and Qiang Yang. Towards personalized federated learning. IEEE Transactions on Neural Networks and Learning Systems, 34(12):9587–9603, 2022.
Tsitsiklis et al. (1986) John Tsitsiklis, Dimitri Bertsekas, and Michael Athans. Distributed asynchronous deterministic and stochastic gradient optimization algorithms. IEEE Transactions on Automatic Control, 31(9):803–812, 1986.
Tyurin (2024) Alexander Tyurin. Tight time complexities in parallel stochastic optimization with arbitrary computation dynamics. arXiv preprint arXiv:2408.04929, 2024.
Tyurin & Richtárik (2024) Alexander Tyurin and Peter Richtárik. Optimal time complexities of parallel stochastic optimization methods under a fixed computation model. Advances in Neural Information Processing Systems, 36, 2024.
Tyurin & Richtárik (2024) Alexander Tyurin and Peter Richtárik. On the optimal time complexities in decentralized stochastic asynchronous optimization. Advances in Neural Information Processing Systems, 37, 2024.
Tyurin et al. (2024a) Alexander Tyurin, Kaja Gruntkowska, and Peter Richtárik. Freya PAGE: First optimal time complexity for large-scale nonconvex finite-sum optimization with heterogeneous asynchronous computations. Advances in Neural Information Processing Systems, 37, 2024a.
Tyurin et al. (2024b) Alexander Tyurin, Marta Pozzi, Ivan Ilin, and Peter Richtárik. Shadowheart SGD: Distributed asynchronous SGD with optimal time complexity under arbitrary computation and communication heterogeneity. Advances in Neural Information Processing Systems, 37, 2024b.
Vanli et al. (2018) N Denizcan Vanli, Mert Gurbuzbalaban, and Asuman Ozdaglar. Global convergence rate of proximal incremental aggregated gradient methods. SIAM Journal on Optimization, 28(2):1282–1300, 2018.
Wang et al. (2022a) Qiyuan Wang, Qianqian Yang, Shibo He, Zhiguo Shi, and Jiming Chen. AsyncFedED: Asynchronous federated learning with euclidean distance based adaptive weight aggregation. arXiv preprint arXiv:2205.13797, 2022a.
Wang et al. (2024) Xiaolu Wang, Zijian Li, Shi Jin, and Jun Zhang. Achieving linear speedup in asynchronous federated learning with heterogeneous clients. IEEE Transactions on Mobile Computing, 2024.
Wang et al. (2025) Xiaolu Wang, Yuchang Sun, Hoi To Wai, and Jun Zhang. Incremental aggregated asynchronous SGD for arbitrarily heterogeneous data, 2025. URL https://openreview.net/forum?id=m3x4kDbYAK.
Wang et al. (2022b) Zhongyu Wang, Zhaoyang Zhang, Yuqing Tian, Qianqian Yang, Hangguan Shan, Wei Wang, and Tony QS Quek. Asynchronous federated learning over wireless communication networks. IEEE Transactions on Wireless Communications, 21(9):6961–6978, 2022b.
Xiao et al. (2017) Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-MNIST: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747, 2017.
Xie et al. (2019) Cong Xie, Sanmi Koyejo, and Indranil Gupta. Asynchronous federated optimization. arXiv preprint arXiv:1903.03934, 2019.
Yurochkin et al. (2019) Mikhail Yurochkin, Mayank Agarwal, Soumya Ghosh, Kristjan Greenewald, Nghia Hoang, and Yasaman Khazaeni. Bayesian nonparametric federated learning of neural networks. In Kamalika Chaudhuri and Ruslan Salakhutdinov (eds.), Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pp. 7252–7261. PMLR, 09–15 Jun 2019. URL https://proceedings.mlr.press/v97/yurochkin19a.html.
Zhang et al. (2023) Feilong Zhang, Xianming Liu, Shiyi Lin, Gang Wu, Xiong Zhou, Junjun Jiang, and Xiangyang Ji. No one idles: Efficient heterogeneous federated learning with parallel edge and server computation. In International Conference on Machine Learning, pp. 41399–41413. PMLR, 2023.
Zhao & Li (2016) Shen-Yi Zhao and Wu-Jun Li. Fast asynchronous parallel stochastic gradient descent: A lock-free approach with convergence guarantee. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 30, 2016.
Zhao et al. (2018) Yue Zhao, Meng Li, Liangzhen Lai, Naveen Suda, Damon Civin, and Vikas Chandra. Federated learning with non-iid data. arXiv preprint arXiv:1806.00582, 2018.
Zheng et al. (2017) Shuxin Zheng, Qi Meng, Taifeng Wang, Wei Chen, Nenghai Yu, Zhi-Ming Ma, and Tie-Yan Liu. Asynchronous stochastic gradient descent with delay compensation. In International Conference on Machine Learning, pp. 4120–4129. PMLR, 2017.
Zhou et al. (2018) Kaiwen Zhou, Fanhua Shang, and James Cheng. A simple stochastic variance reduced algorithm with fast convergence rates. In International Conference on Machine Learning, pp. 5980–5989. PMLR, 2018.

Appendix A Related Work

Research on asynchronous stochastic gradient methods dates back to the seminal work of Tsitsiklis et al. (1986), and gained renewed momentum with the introduction of the HOGWILD! algorithm (Recht et al., 2011). HOGWILD! is fundamentally an asynchronous coordinate descent method: updates are performed lock-free with inconsistent reads and writes, and its convergence guarantees rely on sparsity assumptions that are rarely realistic in modern large-scale machine learning. Subsequent refinements of this paradigm include works by J Reddi et al. (2015); Zhao & Li (2016); Mania et al. (2017); Leblond et al. (2018); Nguyen et al. (2018); Zhou et al. (2018), but these works remain tied to the coordinate descent setting with inconsistent memory accesses, and thus differ substantially from our focus.

Closer to our setting are works where updates are based on gradients that are applied consistently. Early contributions, typically under the homogeneous-data assumption (all workers sample from the same distribution), include the work of Agarwal & Duchi (2011), who studied convex objectives, as well as later extensions to the non-convex case such as the work of Lian et al. (2015) and Dutta et al. (2018), the latter analyzing exponentially distributed computation times. Other relevant results in this line include those of Feyzmahdavian et al. (2016); Zheng et al. (2017); Arjevani et al. (2020); Feyzmahdavian & Johansson (2023), all of which assume fixed delays. More recently, delay-adaptive methods have been proposed, aiming to improve performance by down-weighting very stale gradients (Cohen et al., 2021; Koloskova et al., 2022; Mishchenko et al., 2022).

Particularly relevant to our work are asynchronous variants of SAGA. Leblond et al. (2018) developed a shared-parameter version in the spirit of HOGWILD!, while Glasgow & Wootters (2022) studied a distributed setting that employs full gradients, in contrast to our stochastic-gradient perspective.

A large body of recent work investigates asynchronous methods in federated learning (FL), where clients hold data from heterogeneous distributions. Notable contributions include works by Xie et al. (2019); Mishchenko et al. (2022); Koloskova et al. (2022); Wang et al. (2022a; b); Glasgow & Wootters (2022); Fraboni et al. (2023); Zhang et al. (2023); Wang et al. (2024); Islamov et al. (2024); Alahyane et al. (2025).

More broadly, Assran et al. (2020) provide a comprehensive survey of asynchronous optimization methods.

There is another line of work that began with Tyurin & Richtárik (2024), who established lower bounds for parallel methods and proposed optimal synchronous algorithms together with an asynchronous counterpart. Several follow-up papers extended this semi-asynchronous framework to other settings (Tyurin et al., 2024a; b; Tyurin & Richtárik, 2024; Maranjyan et al., 2025a).

Finally, beyond asynchronous approaches, several synchronous methods address system heterogeneity by adapting local training to worker speeds. The canonical method, FedAvg (McMahan et al., 2017), performs multiple local steps on each worker. Variants have adapted the number of local steps to match workers’ computation speeds (Li et al., 2020; Maranjyan et al., 2025c), effectively balancing task assignments across heterogeneous systems. More recently, Maranjyan et al. (2025b) proposed adapting the number of local steps dynamically, without requiring prior knowledge of worker speeds.

Appendix B Arbitrarily Changing Computation Times

In practice, the fixed computation model (2) is often not satisfied. The compute power of devices can vary over time due to temporary disconnections, hardware or network delays, fluctuations in processing capacity, or other transient effects (Maranjyan et al., 2025a).

In this section we extend our theory to the more general setting of arbitrarily varying computation times.

B.1 Universal Computation Model

To formalize this setting, we adopt the universal computation model introduced by Tyurin (2024).

For each worker $i\in[n]$ , we define a compute power function

p_{i}:\mathbb{R}_{+}\to\mathbb{R}_{+}~,

assumed nonnegative and continuous almost everywhere (countably many jumps allowed). For any $T^{2}\geq T^{1}\geq 0$ , the number of stochastic gradients completed by worker $i$ on $[T^{1},T^{2}]$ is

\#\text{gradients in }[T^{1},T^{2}]\;=\;\left\lfloor\int_{T^{1}}^{T^{2}}p_{i}(t)\,dt\right\rfloor.

Here, $p_{i}(t)$ models the worker’s time-varying computational ability: smaller values over an interval yield fewer completed gradients, and larger values yield more.

For instance, if worker $i$ remains idle for the first $T$ seconds and then becomes active, this corresponds to $p_{i}(t)=0$ for $t\leq T$ and $p_{i}(t)>0$ for $t>T$ . More generally, $p_{i}(t)$ may follow periodic or irregular patterns, leading to bursts of activity, pauses, or chaotic changes in compute power. The process $p_{i}(t)$ may even be random, and all results hold conditional on the realized sample paths of $\{p_{i}\}$ .

The universal computation model reduces to the fixed computation model (2) when $p_{i}(t)=\nicefrac{{1}}{{\tau_{i}}}$ for all $t\geq 0$ and $i\in[n]$ . In this case,

\#\text{gradients in }[T^{1},T^{2}]=\left\lfloor\frac{T^{2}-T^{1}}{\tau_{i}}\right\rfloor,

meaning that worker $i$ computes one stochastic gradient after $T^{1}+\tau_{i}$ seconds, two gradients after $T^{1}+2\tau_{i}$ seconds, and so on.

B.2 Toward an Optimal Method

In the general setting of arbitrarily varying computation times, Algorithm 1 is not optimal. To see why, consider the following adversarial timing pattern.

Suppose there are two workers. During one gradient computation by the slower worker, the faster worker computes $s$ gradients. Immediately afterwards, they switch roles: the previously fast worker slows down by a factor of $s$ , while the previously slow one speeds up by the same factor. This pattern repeats each time the slower worker finishes a gradient computation.

In this setting, if we run Algorithm 1, the server waits in each Phase 1 for a single gradient from every worker. Thus, the slower worker always contributes only one gradient, and the harmonic mean of the batch sizes satisfies

1\;\leq\;B^{k}\;\leq\;2~.

From Section 5.1, the iteration complexity is

\mathcal{O}\!\left(\frac{nL\Delta}{\varepsilon}\left(1+\frac{\sigma^{2}}{Bn\varepsilon}\right)\right).

When $\nicefrac{{\sigma^{2}}}{{n\varepsilon}}$ is much larger than $B$ , this dependence can be highly suboptimal.

Instead, suppose the server waits until one full round of the above process completes, collecting $s+1$ gradients from each worker. Then the harmonic mean satisfies $B^{k}\geq s+1$ , which can be arbitrarily larger than 2. Since in practice both $s$ and $\nicefrac{{\sigma^{2}}}{{n\varepsilon}}$ can be very large, the naive strategy of waiting for only one gradient per worker (as in Algorithm 1) cannot be optimal in the arbitrary-time setting.

B.3 An Optimal Method

The solution is simple and follows directly from the iteration complexity bound. From

\mathcal{O}\!\left(\frac{nL\Delta}{\varepsilon}\left(1+\frac{\sigma^{2}}{Bn\varepsilon}\right)\right),

we see that to balance the terms it suffices to ensure

B\;\geq\;\frac{\sigma^{2}}{n\varepsilon}~.

Accordingly, we modify the stopping condition in Phase 1 of Algorithm 1. Instead of requiring the server to receive at least one gradient from each worker, we require the stronger condition used in Malenia SGD, namely

\left(\frac{1}{n}\sum_{i=1}^{n}\frac{1}{b_{i}}\right)^{-1}\;\;\geq\;\;\max\left\{1,\frac{\sigma^{2}}{n\varepsilon}\right\}~,

(3)

where $b_{i}$ is the number of gradients received from worker $i$ .

In the low-noise regime, where $\nicefrac{{\sigma^{2}}}{{n\varepsilon}}\leq 1$ , the condition reduces to requiring $b_{i}\geq 1$ for all $i$ , so the algorithm coincides with the original Algorithm 1. In the high-noise regime, the algorithm collects more gradients in Phase 1, ensuring that $B$ is sufficiently large for optimal convergence.

With this change, Phase 1 of our algorithm matches that of Malenia SGD. The difference lies in Phase 2: our algorithm continues to use the ongoing gradient computations from all workers to perform $n$ updates, while Malenia SGD discards any unfinished gradients, performs a single update, and then proceeds to the next round.

The following theorem establishes the time complexity of our algorithm under the universal computation model. {boxedtheorem} Under Assumptions 2.3, 2.3, and 2.3, let the stepsize in Ringleader ASGD be

\gamma=\frac{1}{10nL}~.

Then, under the universal computation model, Ringleader ASGD finds an $\varepsilon$ –stationary point within at most $T^{K}$ seconds, where

K\;:=\;\left\lceil\frac{160L\Delta}{\varepsilon}\right\rceil,

and $T^{K}$ denotes the $K$ -th element of the recursively defined sequence

\displaystyle T^{k}\;=\;\min\left\{T\geq 0:\left(\frac{1}{n}\sum_{i=1}^{n}\left\lfloor\int_{T_{k-1}}^{T}p_{i}(t)\,dt\right\rfloor^{-1}\right)^{-1}\;\;\geq\;\;\max\!\left\{1,\frac{\sigma^{2}}{n\varepsilon}\right\}\right\}~,

for all $k\geq 1$ , with initialization $T^{0}=0$ . This result matches the lower bound derived by Tyurin (2024), and therefore the proposed method is optimal.

Proof.

Under the condition in (3), each gradient-type step of the algorithm satisfies

B^{k}=\left(\frac{1}{n}\sum_{i=1}^{n}\frac{1}{b_{i}^{k}}\right)^{-1}\;\;\geq\;\;\max\left\{1,\frac{\sigma^{2}}{n\varepsilon}\right\}.

In Section 5.1, instead of using $B$ we can substitute any valid lower bound. Here we choose

B=\max\left\{1,\frac{\sigma^{2}}{n\varepsilon}\right\}.

With this substitution, the iteration complexity becomes

K=\frac{80\,nL\Delta}{\varepsilon}~.

To derive the time complexity, consider the time required to perform $n$ iterations. Each block of $n$ updates occurs in Phase 2 following the Phase 1 gradient collection. Starting from time $T=0$ , Phase 1 ends once the accumulated number of gradients satisfies condition (3), which occurs at time

T_{+}^{1}=\min\left\{T\geq 0:\left(\frac{1}{n}\sum_{i=1}^{n}\left\lfloor\int_{0}^{T}p_{i}(t)\,dt\right\rfloor^{-1}\right)^{-1}\;\;\geq\;\;\max\left\{1,\frac{\sigma^{2}}{n\varepsilon}\right\}\right\}.

After Phase 1, to complete $n$ updates in Phase 2 we must wait for the ongoing computations to finish. This requires at most

T^{1}=\min\left\{T\geq 0:\left(\frac{1}{n}\sum_{i=1}^{n}\left\lfloor\int_{T_{+}^{1}}^{T}p_{i}(t)\,dt\right\rfloor^{-1}\right)^{-1}\;\;\geq\;\;1\right\}.

Thus, the total time to complete all $K$ iterations is bounded by

T^{\left\lceil\nicefrac{{2K}}{{n}}\right\rceil}~,

where the sequence $\{T^{k}\}_{k\geq 0}$ is defined recursively as

T^{k}=\min\left\{T\geq 0:\left(\frac{1}{n}\sum_{i=1}^{n}\left\lfloor\int_{T_{k-1}}^{T}p_{i}(t)\,dt\right\rfloor^{-1}\right)^{-1}\;\;\geq\;\;\max\left\{1,\frac{\sigma^{2}}{n\varepsilon}\right\}\right\},\qquad T^{0}=0~.

∎

Appendix C Auxiliary Lemmas

Here we provide proofs of lemmas omitted from the main text, along with auxiliary results that will be used later.

C.1 Proof of Section 2.3

We begin with a lemma relating the different smoothness constants. {myboxed}

Lemma 2.3 (Smoothness Bounds).

Let $L_{f}$ denote the smoothness constant of $f$ , $L_{f_{i}}$ the smoothness constant of $f_{i}$ , and $L$ the constant from Section 2.3. We have

L_{f}\;\leq\;L\;\leq\;\sqrt{\frac{1}{n}\sum_{i=1}^{n}L_{f_{i}}^{2}}\;\leq\;\max_{i\in[n]}L_{f_{i}}=:L_{\max}~.

Moreover, if all $f_{i}$ are identical, i.e., $f_{i}=f$ for all $i\in[n]$ , then $L=L_{f}$ .

Recall from Section 2.3 that we assumed the following generalized smoothness condition: for some constant $L>0$ and for all $x\in\mathbb{R}^{d}$ and $y_{1},\dots,y_{n}\in\mathbb{R}^{d}$ ,

\displaystyle\left\|\nabla f(x)-\frac{1}{n}\sum_{i=1}^{n}\nabla f_{i}(y_{i})\right\|^{2}\;\leq\;\frac{L^{2}}{n}\sum_{i=1}^{n}\left\|x-y_{i}\right\|^{2}~.

(4)

Recall that a function $\phi$ is called $L_{\phi}$ –smooth if

\left\|\nabla\phi(x)-\nabla\phi(y)\right\|\leq L_{\phi}\left\|x-y\right\|,\quad\forall x,y\in\mathbb{R}^{d}~.

Here $L_{\phi}$ denotes the minimal such constant. We are ready to prove the lemma.

Proof.

For the first inequality, take $y_{1}=\dots=y_{n}=y$ . Then (4) reduces to

\left\|\nabla f(x)-\nabla f(y)\right\|^{2}\leq L^{2}\left\|x-y\right\|^{2},

so $f$ is $L$ –smooth. By definition of $L_{f}$ as the minimal smoothness constant, this implies $L_{f}\leq L$ .

For the second inequality, by the triangle inequality, then by the smoothness of each $f_{i}$ , and finally by Cauchy–Schwarz,

	$\displaystyle\left\\|\nabla f(x)-\frac{1}{n}\sum_{i=1}^{n}\nabla f_{i}(y_{i})\right\\|$	$\displaystyle\leq\frac{1}{n}\sum_{i=1}^{n}\left\\|\nabla f_{i}(x)-\nabla f_{i}(y_{i})\right\\|\leq\frac{1}{n}\sum_{i=1}^{n}L_{f_{i}}\left\\|x-y_{i}\right\\|$
		$\displaystyle\leq\sqrt{\frac{1}{n}\sum_{i=1}^{n}L_{f_{i}}^{2}}\ \sqrt{\frac{1}{n}\sum_{i=1}^{n}\left\\|x-y_{i}\right\\|^{2}}~.$

Squaring both sides shows that (4) holds with $L=\sqrt{\frac{1}{n}\sum_{i=1}^{n}L_{f_{i}}^{2}}$ .

Finally, suppose all $f_{i}$ are identical: $f_{i}\equiv f$ for all $i$ . Then

	$\displaystyle\left\\|\nabla f(x)-\frac{1}{n}\sum_{i=1}^{n}\nabla f(y_{i})\right\\|$	$\displaystyle\leq\frac{1}{n}\sum_{i=1}^{n}\left\\|\nabla f(x)-\nabla f(y_{i})\right\\|\leq\frac{L_{f}}{n}\sum_{i=1}^{n}\left\\|x-y_{i}\right\\|$
		$\displaystyle\leq L_{f}\sqrt{\frac{1}{n}\sum_{i=1}^{n}\left\\|x-y_{i}\right\\|^{2}}~,$

where the last step uses Cauchy–Schwarz. Squaring both sides yields

\left\|\nabla f(x)-\frac{1}{n}\sum_{i=1}^{n}\nabla f(y_{i})\right\|^{2}\leq\frac{L_{f}^{2}}{n}\sum_{i=1}^{n}\left\|x-y_{i}\right\|^{2},

i.e., (4) holds with $L\leq L_{f}$ . Combined with $L_{f}\leq L$ , we conclude $L=L_{f}$ . ∎

C.2 Variance Term

The following lemma bounds the variance of the gradient estimator in Ringleader ASGD. {boxedlemma}[Variance Bound] Under Section 2.3, the following variance-type inequality holds for the gradient estimator used in Algorithm 1:

\mathbb{E}\left[\left\|\bar{g}^{k}-\frac{1}{n}\sum_{i=1}^{n}\nabla f_{i}\left(x^{k-\delta_{i}^{k}}\right)\right\|^{2}\right]\leq\frac{\sigma^{2}}{B^{k}n}~.

Proof.

Recall that the gradient estimator is defined as

\bar{g}^{k}=\frac{1}{n}\sum_{i=1}^{n}\bar{g}_{i}^{k}=\frac{1}{n}\sum_{i=1}^{n}g_{i}^{k,j}=\frac{1}{n}\sum_{i=1}^{n}\nabla f_{i}\left(x^{k-\delta_{i}^{k}};\xi_{i}^{k-\delta_{i}^{k},j}\right)~.

Let $\mathcal{F}^{k}$ denote the sigma-field containing all randomness up to the start of the current round, i.e., up to iteration $k-(k\bmod n)$ . Conditioning on $\mathcal{F}^{k}$ , the evaluation points $x^{k-\delta_{i}^{k}}$ are deterministic, and the stochastic gradients $g_{i}^{k,j}$ are independent across both workers $i$ and samples $j$ .

Using the law of total expectation and the independence of stochastic gradients, we have

	$\displaystyle\mathbb{E}\left[\left\\|\bar{g}^{k}-\frac{1}{n}\sum_{i=1}^{n}\nabla f_{i}\left(x^{k-\delta_{i}^{k}}\right)\right\\|^{2}\right]$	$\displaystyle=\mathbb{E}\left[{\mathbb{E}}\left[\left.\left\\|\bar{g}^{k}-\frac{1}{n}\sum_{i=1}^{n}\nabla f_{i}\left(x^{k-\delta_{i}^{k}}\right)\right\\|^{2}\ \right\|\ \mathcal{F}^{k}\right]\right]$
		$\displaystyle=\mathbb{E}\left[\frac{1}{n^{2}}\sum_{i=1}^{n}{\mathbb{E}}\left[\left.\left\\|\bar{g}_{i}^{k}-\nabla f_{i}\left(x^{k-\delta_{i}^{k}}\right)\right\\|^{2}\ \right\|\ \mathcal{F}^{k}\right]\right].$

For each worker $i$ , the conditional variance of the minibatch gradient estimator is

	$\displaystyle{\mathbb{E}}\left[\left.\left\\|\bar{g}_{i}^{k}-\nabla f_{i}\left(x^{k-\delta_{i}^{k}}\right)\right\\|^{2}\ \right\|\ \mathcal{F}^{k}\right]$	$\displaystyle={\mathbb{E}}\left[\left.\left\\|\frac{1}{b_{i}^{k}}\sum_{j=1}^{b_{i}^{k}}g_{i}^{k,j}-\nabla f_{i}\left(x^{k-\delta_{i}^{k}}\right)\right\\|^{2}\ \right\|\ \mathcal{F}^{k}\right]$
		$\displaystyle=\frac{1}{b_{i}^{k}}{\mathbb{E}}\left[\left.\left\\|g_{i}^{k,1}-\nabla f_{i}\left(x^{k-\delta_{i}^{k}}\right)\right\\|^{2}\ \right\|\ \mathcal{F}^{k}\right]\leq\frac{\sigma^{2}}{b_{i}^{k}}~,$

where the last inequality follows from Section 2.3.

Combining these results, we get

\displaystyle\mathbb{E}\left[\left\|\bar{g}^{k}-\frac{1}{n}\sum_{i=1}^{n}\nabla f_{i}\left(x^{k-\delta_{i}^{k}}\right)\right\|^{2}\right]

\displaystyle\leq\frac{1}{n^{2}}\sum_{i=1}^{n}\frac{\sigma^{2}}{b_{i}^{k}}=\frac{\sigma^{2}}{n}\frac{1}{n}\sum_{i=1}^{n}\frac{1}{b_{i}^{k}}=\frac{\sigma^{2}}{B^{k}n}~,

where the last equality uses the definition of the harmonic mean $B^{k}=\left(\frac{1}{n}\sum_{i=1}^{n}\frac{1}{b_{i}^{k}}\right)^{-1}$ . ∎

C.3 Proof of Section 5.1

We now prove the descent lemma. {myboxed}

Lemma 5.1 (Descent Lemma).

Under Assumptions 2.3 and 2.3, if the stepsize in Algorithm 1 satisfies $\gamma\leq\nicefrac{{1}}{{4L}}$ , then the following inequality holds

	$\displaystyle\mathbb{E}\left[f\left(x^{k+1}\right)\right]$	$\displaystyle\leq\mathbb{E}\left[f\left(x^{k}\right)\right]-\frac{\gamma}{2}\mathbb{E}\left[\left\\|\nabla f\left(x^{k}\right)\right\\|^{2}\right]-\frac{\gamma}{4}\mathbb{E}\left[\left\\|\frac{1}{n}\sum_{i=1}^{n}\nabla f_{i}\left(x^{k-\delta_{i}^{k}}\right)\right\\|^{2}\right]$
		$\displaystyle\quad+\frac{\gamma L^{2}}{2n}\sum_{i=1}^{n}\mathbb{E}\left[\left\\|x^{k}-x^{k-\delta_{i}^{k}}\right\\|^{2}\right]+\frac{3\gamma^{2}L\sigma^{2}}{2B}$
		$\displaystyle\quad+\gamma^{2}L\sum_{\ell=k-(k\bmod n)}^{k-1}\mathbb{E}\left[\left\\|\frac{1}{n}\sum_{i=1}^{n}\nabla f_{i}\left(x^{\ell-\delta_{i}^{\ell}}\right)\right\\|^{2}\right].$

Proof.

Some proof techniques are adapted from the works of Maranjyan et al. (2025d) and Wang et al. (2025).

From Section 2.3 and Section 2.3, we know that $f$ is $L$ –smooth. Therefore, the following standard inequality holds (Nesterov, 2018)

\mathbb{E}\left[f(x^{k+1})\right]\leq\mathbb{E}\left[f(x^{k})-\gamma\left\langle\nabla f(x^{k}),\bar{g}^{k}\right\rangle+\frac{L\gamma^{2}}{2}\left\|\bar{g}^{k}\right\|^{2}\right].

(5)

Recall that the gradient estimator is defined as

\bar{g}^{k}=\frac{1}{n}\sum_{i=1}^{n}\bar{g}_{i}^{k}=\frac{1}{n}\sum_{i=1}^{n}\frac{1}{b_{i}^{k}}\sum_{j=1}^{b_{i}^{k}}g_{i}^{k,j}~.

Let $\mathcal{F}^{k}$ denote the sigma-field containing all randomness up to the start of the current Phase 2, i.e., up to iteration $k-(k\bmod n)$ . A key observation is that all gradients in the current gradient table were computed and received during the current round. Since these gradients were computed at points from previous iterations within the current round, we have $k-\delta_{i}^{k}\leq k-(k\bmod n)$ for all $i\in[n]$ . Conditioning on $\mathcal{F}^{k}$ , the points $x^{k-\delta_{i}^{k}}$ are deterministic. Therefore, we can compute the conditional expectation of the gradient estimator:

{\mathbb{E}}\left[\left.\bar{g}^{k}\ \right|\ \mathcal{F}^{k}\right]=\frac{1}{n}\sum_{i=1}^{n}\frac{1}{b_{i}^{k}}\sum_{j=1}^{b_{i}^{k}}{\mathbb{E}}\left[\left.g_{i}^{k,j}\ \right|\ \mathcal{F}^{k}\right]=\frac{1}{n}\sum_{i=1}^{n}\nabla f_{i}\left(x^{k-\delta_{i}^{k}}\right).

The last equality follows from the unbiasedness of the stochastic gradient estimator (Section 2.3).

Using this conditional expectation and the law of total expectation, we can now simplify the inner product term in (5):

	$\displaystyle\mathbb{E}\left[\left\langle\nabla f\left(x^{k}\right),\bar{g}^{k}\right\rangle\right]$	$\displaystyle=\mathbb{E}\left[\left\langle\nabla f\left(x^{k}\right),\frac{1}{n}\sum_{i=1}^{n}\nabla f_{i}\left(x^{k-\delta_{i}^{k}}\right)\right\rangle\right]$
		$\displaystyle\quad+\mathbb{E}\left[\left\langle\nabla f\left(x^{k}\right),\bar{g}^{k}-\frac{1}{n}\sum_{i=1}^{n}\nabla f_{i}\left(x^{k-\delta_{i}^{k}}\right)\right\rangle\right]$
		$\displaystyle=\mathbb{E}\left[\left\langle\nabla f\left(x^{k}\right),\frac{1}{n}\sum_{i=1}^{n}\nabla f_{i}\left(x^{k-\delta_{i}^{k}}\right)\right\rangle\right]$
		$\displaystyle\quad+\mathbb{E}\left[\left\langle\nabla f\left(x^{k}\right)-\nabla f\left(x^{k-(k\bmod n)}\right),\bar{g}^{k}-\frac{1}{n}\sum_{i=1}^{n}\nabla f_{i}\left(x^{k-\delta_{i}^{k}}\right)\right\rangle\right]$
		$\displaystyle\qquad+\mathbb{E}\left[\left\langle\nabla f\left(x^{k-(k\bmod n)}\right),\bar{g}^{k}-\frac{1}{n}\sum_{i=1}^{n}\nabla f_{i}\left(x^{k-\delta_{i}^{k}}\right)\right\rangle\right]$
		$\displaystyle=\underbrace{\mathbb{E}\left[\left\langle\nabla f\left(x^{k}\right),\frac{1}{n}\sum_{i=1}^{n}\nabla f_{i}\left(x^{k-\delta_{i}^{k}}\right)\right\rangle\right]}_{T_{1}}$
		$\displaystyle\quad+\underbrace{\mathbb{E}\left[\left\langle\nabla f\left(x^{k}\right)-\nabla f\left(x^{k-(k\bmod n)}\right),\bar{g}^{k}-\frac{1}{n}\sum_{i=1}^{n}\nabla f_{i}\left(x^{k-\delta_{i}^{k}}\right)\right\rangle\right]}_{T_{2}}.$

Next, using Section 2.3, we have

	$\displaystyle 2T_{1}$	$\displaystyle=\mathbb{E}\left[2\left\langle\nabla f\left(x^{k}\right),\frac{1}{n}\sum_{i=1}^{n}\nabla f_{i}\left(x^{k-\delta_{i}^{k}}\right)\right\rangle\right]$
		$\displaystyle=\mathbb{E}\left[\left\\|\nabla f\left(x^{k}\right)\right\\|^{2}+\left\\|\frac{1}{n}\sum_{i=1}^{n}\nabla f_{i}\left(x^{k-\delta_{i}^{k}}\right)\right\\|^{2}\right]-\mathbb{E}\left[\left\\|\nabla f\left(x^{k}\right)-\frac{1}{n}\sum_{i=1}^{n}\nabla f_{i}\left(x^{k-\delta_{i}^{k}}\right)\right\\|^{2}\right]$
		$\displaystyle\geq\mathbb{E}\left[\left\\|\nabla f\left(x^{k}\right)\right\\|^{2}\right]+\mathbb{E}\left[\left\\|\frac{1}{n}\sum_{i=1}^{n}\nabla f_{i}\left(x^{k-\delta_{i}^{k}}\right)\right\\|^{2}\right]-\frac{L^{2}}{n}\sum_{i=1}^{n}\mathbb{E}\left[\left\\|x^{k}-x^{k-\delta_{i}^{k}}\right\\|^{2}\right].$

Next, we analyze $T_{2}$

	$\displaystyle T_{2}$	$\displaystyle=\mathbb{E}\left[\left\langle\nabla f\left(x^{k}\right)-\nabla f\left(x^{k-(k\bmod n)}\right),\bar{g}^{k}-\frac{1}{n}\sum_{i=1}^{n}\nabla f_{i}\left(x^{k-\delta_{i}^{k}}\right)\right\rangle\right]$
		$\displaystyle\geq-\mathbb{E}\left[\left\\|\nabla f\left(x^{k}\right)-\nabla f\left(x^{k-(k\bmod n)}\right)\right\\|\left\\|\bar{g}^{k}-\frac{1}{n}\sum_{i=1}^{n}\nabla f_{i}\left(x^{k-\delta_{i}^{k}}\right)\right\\|\right]$
		$\displaystyle\geq-L\mathbb{E}\left[\left\\|x^{k}-x^{k-(k\bmod n)}\right\\|\left\\|\bar{g}^{k}-\frac{1}{n}\sum_{i=1}^{n}\nabla f_{i}\left(x^{k-\delta_{i}^{k}}\right)\right\\|\right]$
		$\displaystyle=-L\mathbb{E}\left[\left\\|\gamma\sum_{\ell=k-(k\bmod n)}^{k-1}\bar{g}^{\ell}\right\\|\left\\|\bar{g}^{k}-\frac{1}{n}\sum_{i=1}^{n}\nabla f_{i}\left(x^{k-\delta_{i}^{k}}\right)\right\\|\right]$
		$\displaystyle\geq-L\gamma\sum_{\ell=k-(k\bmod n)}^{k-1}\mathbb{E}\left[\left\\|\bar{g}^{\ell}\right\\|\left\\|\bar{g}^{k}-\frac{1}{n}\sum_{i=1}^{n}\nabla f_{i}\left(x^{k-\delta_{i}^{k}}\right)\right\\|\right]$
		$\displaystyle\geq-L\gamma\sum_{\ell=k-(k\bmod n)}^{k-1}\frac{1}{2}\left(\mathbb{E}\left[\left\\|\bar{g}^{\ell}\right\\|^{2}\right]+\mathbb{E}\left[\left\\|\bar{g}^{k}-\frac{1}{n}\sum_{i=1}^{n}\nabla f_{i}\left(x^{k-\delta_{i}^{k}}\right)\right\\|^{2}\right]\right)$
		$\displaystyle\geq-\frac{L\gamma}{2}\sum_{\ell=k-(k\bmod n)}^{k-1}\mathbb{E}\left[\left\\|\bar{g}^{\ell}\right\\|^{2}\right]-(k\bmod n)\frac{L\gamma\sigma^{2}}{2B^{k}n}$
		$\displaystyle\geq-\frac{L\gamma}{2}\sum_{\ell=k-(k\bmod n)}^{k-1}\mathbb{E}\left[\left\\|\bar{g}^{\ell}\right\\|^{2}\right]-\frac{L\gamma\sigma^{2}}{2B^{k}}~.$

The inequalities follow from the Cauchy-Schwarz inequality, $L$ –smoothness of $f$ , the triangle inequality, Young’s inequality, Section C.2, and finally $(k\bmod n)\leq n-1<n$ .

It remains to bound the term $\mathbb{E}\left[\left\|\bar{g}^{k}\right\|^{2}\right]$ . Using Young’s inequality, we have

	$\displaystyle\mathbb{E}\left[\left\\|\bar{g}^{k}\right\\|^{2}\right]$	$\displaystyle\leq 2\mathbb{E}\left[\left\\|\frac{1}{n}\sum_{i=1}^{n}\nabla f_{i}\left(x^{k-\delta_{i}^{k}}\right)\right\\|^{2}\right]+2\mathbb{E}\left[\left\\|\bar{g}^{k}-\frac{1}{n}\sum_{i=1}^{n}\nabla f_{i}\left(x^{k-\delta_{i}^{k}}\right)\right\\|^{2}\right]$
		$\displaystyle\leq 2\mathbb{E}\left[\left\\|\frac{1}{n}\sum_{i=1}^{n}\nabla f_{i}\left(x^{k-\delta_{i}^{k}}\right)\right\\|^{2}\right]+\frac{2\sigma^{2}}{B^{k}n}~,$

x where in the last step we used Section C.2.

Now, by combining all terms in (5), we obtain

	$\displaystyle\mathbb{E}\left[f\left(x^{k+1}\right)\right]$	$\displaystyle\leq\mathbb{E}\left[f\left(x^{k}\right)\right]-\frac{\gamma}{2}\mathbb{E}\left[\left\\|\nabla f\left(x^{k}\right)\right\\|^{2}\right]-\frac{\gamma}{2}\mathbb{E}\left[\left\\|\frac{1}{n}\sum_{i=1}^{n}\nabla f_{i}\left(x^{k-\delta_{i}^{k}}\right)\right\\|^{2}\right]$
		$\displaystyle\quad+\frac{\gamma L^{2}}{2n}\sum_{i=1}^{n}\mathbb{E}\left[\left\\|x^{k}-x^{k-\delta_{i}^{k}}\right\\|^{2}\right]$
		$\displaystyle\quad+\frac{\gamma^{2}L}{2}\sum_{\ell=k-(k\bmod n)}^{k-1}\mathbb{E}\left[\left\\|\bar{g}^{\ell}\right\\|^{2}\right]+\frac{\gamma^{2}L\sigma^{2}}{2B^{k}}+\frac{\gamma^{2}L}{2}\mathbb{E}\left[\left\\|\bar{g}^{k}\right\\|^{2}\right]$
		$\displaystyle\leq\mathbb{E}\left[f\left(x^{k}\right)\right]-\frac{\gamma}{2}\mathbb{E}\left[\left\\|\nabla f\left(x^{k}\right)\right\\|^{2}\right]-\frac{\gamma}{2}\mathbb{E}\left[\left\\|\frac{1}{n}\sum_{i=1}^{n}\nabla f_{i}\left(x^{k-\delta_{i}^{k}}\right)\right\\|^{2}\right]$
		$\displaystyle\quad+\frac{\gamma L^{2}}{2n}\sum_{i=1}^{n}\mathbb{E}\left[\left\\|x^{k}-x^{k-\delta_{i}^{k}}\right\\|^{2}\right]$
		$\displaystyle\quad+\gamma^{2}L\sum_{\ell=k-(k\bmod n)}^{k-1}\mathbb{E}\left[\left\\|\frac{1}{n}\sum_{i=1}^{n}\nabla f_{i}\left(x^{\ell-\delta_{i}^{\ell}}\right)\right\\|^{2}\right]$
		$\displaystyle\quad+\gamma^{2}L\mathbb{E}\left[\left\\|\frac{1}{n}\sum_{i=1}^{n}\nabla f_{i}\left(x^{k-\delta_{i}^{k}}\right)\right\\|^{2}\right]$
		$\displaystyle\quad+\gamma^{2}L\sum_{\ell=k-(k\bmod n)}^{k-1}\frac{\sigma^{2}}{B^{\ell}n}+\frac{\gamma^{2}L\sigma^{2}}{2B^{k}}+\frac{\gamma^{2}L\sigma^{2}}{B^{k}n}~.$

This completes the proof under the stepsize condition $\gamma\leq\nicefrac{{1}}{{4L}}$ and $B:=\inf_{k\geq 0}B^{k}$ . ∎

C.4 Proof of Section 5.1

The following lemma provides an upper bound on the residual error due to delays. {myboxed}

Lemma 5.1 (Residual Estimation).

Under Section 2.3, the iterates of Ringleader ASGD (Algorithm 1) with stepsize $\gamma\leq\nicefrac{{1}}{{4nL}}$ satisfy the following bound

\frac{1}{K}\sum_{k=0}^{K-1}\frac{1}{n}\sum_{i=1}^{n}\mathbb{E}\left[\left\|x^{k}-x^{k-\delta_{i}^{k}}\right\|^{2}\right]\leq\frac{2\gamma n}{LK}\sum_{k=0}^{K-1}\mathbb{E}\left[\left\|\frac{1}{n}\sum_{j=1}^{n}\nabla f_{j}\left(x^{k-\delta_{j}^{k}}\right)\right\|^{2}\right]+\frac{2\gamma\sigma^{2}}{LB}~.

Proof.

By Young’s inequality, we have

	$\displaystyle\mathbb{E}\left[\left\\|x^{k}-x^{k-\delta_{i}^{k}}\right\\|^{2}\right]$	$\displaystyle=\mathbb{E}\left[\left\\|\gamma\sum_{\ell=k-\delta_{i}^{k}}^{k-1}\bar{g}^{\ell}\right\\|^{2}\right]$
		$\displaystyle\leq 2\gamma^{2}\mathbb{E}\left[\left\\|\sum_{\ell=k-\delta_{i}^{k}}^{k-1}\frac{1}{n}\sum_{j=1}^{n}\nabla f_{j}\left(x^{\ell-\delta_{j}^{\ell}}\right)\right\\|^{2}\right]$
		$\displaystyle\quad+2\gamma^{2}\mathbb{E}\left[\left\\|\sum_{\ell=k-\delta_{i}^{k}}^{k-1}\left(\bar{g}^{\ell}-\frac{1}{n}\sum_{j=1}^{n}\nabla f_{j}\left(x^{\ell-\delta_{j}^{\ell}}\right)\right)\right\\|^{2}\right]$
		$\displaystyle\leq 2\gamma^{2}\underbrace{\delta_{i}^{k}\sum_{\ell=k-\delta_{i}^{k}}^{k-1}\mathbb{E}\left[\left\\|\frac{1}{n}\sum_{j=1}^{n}\nabla f_{j}\left(x^{\ell-\delta_{j}^{\ell}}\right)\right\\|^{2}\right]}_{T_{ik}}+2\gamma^{2}(\delta_{i}^{k})^{2}\frac{\sigma^{2}}{Bn}~.$

In the last inequality, we used Jensen’s inequality and Section C.2.

Next, we estimate the sum of $T_{ik}$

	$\displaystyle\sum_{k=0}^{K-1}\frac{1}{n}\sum_{i=1}^{n}T_{ik}$	$\displaystyle=\sum_{k=0}^{K-1}\frac{1}{n}\sum_{i=1}^{n}\delta_{i}^{k}\sum_{\ell=k-\delta_{i}^{k}}^{k-1}\mathbb{E}\left[\left\\|\frac{1}{n}\sum_{j=1}^{n}\nabla f_{j}\left(x^{\ell-\delta_{j}^{\ell}}\right)\right\\|^{2}\right]$
		$\displaystyle=\frac{1}{n}\sum_{i=1}^{n}\sum_{k=0}^{K-1}\delta_{i}^{k}\sum_{\ell=k-\delta_{i}^{k}}^{k-1}\mathbb{E}\left[\left\\|\frac{1}{n}\sum_{j=1}^{n}\nabla f_{j}\left(x^{\ell-\delta_{j}^{\ell}}\right)\right\\|^{2}\right]$
		$\displaystyle\leq 2\sum_{i=1}^{n}\sum_{k=0}^{K-1}\sum_{\ell=k-\delta_{i}^{k}}^{k-1}\mathbb{E}\left[\left\\|\frac{1}{n}\sum_{j=1}^{n}\nabla f_{j}\left(x^{\ell-\delta_{j}^{\ell}}\right)\right\\|^{2}\right]$
		$\displaystyle\leq 2\sum_{i=1}^{n}\delta_{i}^{\max}\sum_{k=0}^{K-1}\mathbb{E}\left[\left\\|\frac{1}{n}\sum_{j=1}^{n}\nabla f_{j}\left(x^{k-\delta_{j}^{k}}\right)\right\\|^{2}\right]$
		$\displaystyle\leq 4n^{2}\sum_{k=0}^{K-1}\mathbb{E}\left[\left\\|\frac{1}{n}\sum_{j=1}^{n}\nabla f_{j}\left(x^{k-\delta_{j}^{k}}\right)\right\\|^{2}\right].$

In the first and last inequality, we used the bound $\delta_{i}^{\max}\leq 2n$ from Section 4.2. Finally, applying the stepsize condition $\gamma\leq\nicefrac{{1}}{{4nL}}$ yields the result. ∎

Appendix D Time Complexity of IA²SGD

The iteration complexity of IA²SGD (Wang et al., 2025) is

K=\mathcal{O}\!\left(\frac{\delta^{\max}L\Delta}{\varepsilon}\left(1+\frac{\sigma^{2}}{n\varepsilon}\right)\right).

We now analyze the corresponding wall-clock time under the fixed computation model (2). Since the algorithm performs an update whenever a single worker finishes a computation, we seek the minimal time $T$ such that

\sum_{i=1}^{n}\left\lfloor\frac{T}{\tau_{i}}\right\rfloor\;\geq\;K~.

Observe that

\sum_{i=1}^{n}\frac{T}{\tau_{i}}\;\geq\;\sum_{i=1}^{n}\left\lfloor\frac{T}{\tau_{i}}\right\rfloor.

Hence, if we define $T^{\prime}$ by

\sum_{i=1}^{n}\frac{T^{\prime}}{\tau_{i}}\;=\;K~,

then

T^{\prime}=\left(\sum_{i=1}^{n}\frac{1}{\tau_{i}}\right)^{-1}K~.

It follows that the minimal time $T$ is necessarily larger than $T^{\prime}$ .

It remains to bound $\delta^{\max}$ . At initialization, all workers start computing their first gradients simultaneously. By the time the slowest worker completes its first gradient (at time $\tau_{n}$ ), the other workers may each have completed multiple gradients. In particular,

\delta^{\max}\;\geq\;\sum_{i=1}^{n}\left\lfloor\frac{\tau_{n}}{\tau_{i}}\right\rfloor.

Combining this with the iteration complexity bound, we obtain that the total runtime satisfies

T\;\geq\;c\times\frac{\tau_{n}L\Delta}{\varepsilon}\left(1+\frac{\sigma^{2}}{n\varepsilon}\right),

for some universal constant $c>0$ .

Note that the expression above should not be viewed as an exact upper bound on the runtime. It is better understood as a simplified estimate of $T$ , which is sufficient for our purposes and provides a cleaner basis for comparison.

Appendix E Improved Malenia SGD

Malenia SGD has the following iteration complexity (Tyurin & Richtárik, 2024)

K\;\geq\;\frac{12\Delta L_{f}}{\varepsilon}\;+\;\frac{12\Delta L_{f}\sigma^{2}}{\varepsilon^{2}nS}~,

where $S$ is a lower bound on the harmonic mean of the batch sizes, i.e.,

\left(\frac{1}{n}\sum_{i=1}^{n}\frac{1}{b_{i}^{k}}\right)^{-1}\;\geq\;S~,

for all iterations $k$ . In the original Malenia SGD analysis (Tyurin & Richtárik, 2024), this bound follows from the condition in (3), which fixes the same value of $S$ across all iterations.

In the fixed-time regime (2), however, this condition is no longer necessary. By adopting the same strategy as Ringleader ASGD (Algorithm 1)—namely, waiting for at least one gradient from each worker—we effectively replace $S$ with $B$ in the rate. This yields the following time complexity

\tau_{n}K\;=\;\frac{12\tau_{n}\Delta L_{f}}{\varepsilon}\left(1+\frac{\sigma^{2}}{\varepsilon nB}\right).

Substituting the expression for $B$ from Section 5.2 and proceeding as in the proof of Section 5.2, we obtain the same overall time complexity as before—this time without requiring condition (3), which depends on knowing $\sigma$ and fixing $\varepsilon$ in advance.

Finally, note that this improvement is only valid in the fixed-time regime. In the setting with arbitrarily varying computation times, the same optimization cannot be applied, for the same reasons discussed for Ringleader ASGD in Appendix B.

	$\displaystyle\mathbb{E}\left[f\left(x^{k+1}\right)\right]$	$\displaystyle\leq\mathbb{E}\left[f\left(x^{k}\right)\right]-\frac{\gamma}{2}\mathbb{E}\left[\left\\|\nabla f\left(x^{k}\right)\right\\|^{2}\right]-\frac{\gamma}{4}\mathbb{E}\left[\left\\|\frac{1}{n}\sum_{i=1}^{n}\nabla f_{i}\left(x^{k-\delta_{i}^{k}}\right)\right\\|^{2}\right]$
		$\displaystyle\quad+\frac{\gamma L^{2}}{2n}\sum_{i=1}^{n}\mathbb{E}\left[\left\\|x^{k}-x^{k-\delta_{i}^{k}}\right\\|^{2}\right]+\frac{3\gamma^{2}L\sigma^{2}}{2B}$
		$\displaystyle\quad+\gamma^{2}L\sum_{\ell=k-(k\bmod n)}^{k-1}\mathbb{E}\left[\left\\|\frac{1}{n}\sum_{i=1}^{n}\nabla f_{i}\left(x^{\ell-\delta_{i}^{\ell}}\right)\right\\|^{2}\right].$

	$\displaystyle\frac{1}{K}\sum_{k=0}^{K-1}$	$\displaystyle\mathbb{E}\left[\left\\|\nabla f\left(x^{k}\right)\right\\|^{2}\right]+\frac{1}{2K}\sum_{k=0}^{K-1}\mathbb{E}\left[\left\\|\frac{1}{n}\sum_{i=1}^{n}\nabla f_{i}\left(x^{k-\delta_{i}^{k}}\right)\right\\|^{2}\right]$
		$\displaystyle\leq\frac{2\Delta}{\gamma K}+\frac{3\gamma L\sigma^{2}}{B}$
		$\displaystyle\quad+\frac{L^{2}}{n}\frac{1}{K}\sum_{k=0}^{K-1}\sum_{i=1}^{n}\mathbb{E}\left[\left\\|x^{k}-x^{k-\delta_{i}^{k}}\right\\|^{2}\right]$
		$\displaystyle\quad+2\gamma L\frac{1}{K}\sum_{k=0}^{K-1}\sum_{\ell=k-(k\bmod n)}^{k-1}\mathbb{E}\left[\left\\|\frac{1}{n}\sum_{i=1}^{n}\nabla f_{i}\left(x^{\ell-\delta_{i}^{\ell}}\right)\right\\|^{2}\right].$

	$\displaystyle\frac{1}{K}\sum_{k=0}^{K-1}$	$\displaystyle\mathbb{E}\left[\left\\|\nabla f\left(x^{k}\right)\right\\|^{2}\right]+\frac{1}{2K}\sum_{k=0}^{K-1}\mathbb{E}\left[\left\\|\frac{1}{n}\sum_{i=1}^{n}\nabla f_{i}\left(x^{k-\delta_{i}^{k}}\right)\right\\|^{2}\right]$
		$\displaystyle\leq\frac{2\Delta}{\gamma K}+\frac{3\gamma L\sigma^{2}}{B}+\frac{2\gamma L\sigma^{2}}{B}$
		$\displaystyle\quad+2\gamma Ln\frac{1}{K}\sum_{k=0}^{K-1}\mathbb{E}\left[\left\\|\frac{1}{n}\sum_{j=1}^{n}\nabla f_{j}\left(x^{k-\delta_{j}^{k}}\right)\right\\|^{2}\right]$
		$\displaystyle\quad+2\gamma L\frac{1}{K}\sum_{k=0}^{K-1}\sum_{\ell=k-(k\bmod n)}^{k-1}\mathbb{E}\left[\left\\|\frac{1}{n}\sum_{i=1}^{n}\nabla f_{i}\left(x^{\ell-\delta_{i}^{\ell}}\right)\right\\|^{2}\right]$
		$\displaystyle\leq\frac{2\Delta}{\gamma K}+\frac{5\gamma L\sigma^{2}}{B}$
		$\displaystyle\quad+2\gamma Ln\frac{1}{K}\sum_{k=0}^{K-1}\mathbb{E}\left[\left\\|\frac{1}{n}\sum_{j=1}^{n}\nabla f_{j}\left(x^{k-\delta_{j}^{k}}\right)\right\\|^{2}\right]$
		$\displaystyle\quad+2\gamma Ln\frac{1}{K}\sum_{k=0}^{K-1}\mathbb{E}\left[\left\\|\frac{1}{n}\sum_{i=1}^{n}\nabla f_{i}\left(x^{k-\delta_{i}^{k}}\right)\right\\|^{2}\right].$

	$\displaystyle\mathbb{E}\left[f\left(x^{k+1}\right)\right]$	$\displaystyle\leq\mathbb{E}\left[f\left(x^{k}\right)\right]-\frac{\gamma}{2}\mathbb{E}\left[\left\\|\nabla f\left(x^{k}\right)\right\\|^{2}\right]-\frac{\gamma}{4}\mathbb{E}\left[\left\\|\frac{1}{n}\sum_{i=1}^{n}\nabla f_{i}\left(x^{k-\delta_{i}^{k}}\right)\right\\|^{2}\right]$
		$\displaystyle\quad+\frac{\gamma L^{2}}{2n}\sum_{i=1}^{n}\mathbb{E}\left[\left\\|x^{k}-x^{k-\delta_{i}^{k}}\right\\|^{2}\right]+\frac{3\gamma^{2}L\sigma^{2}}{2B}$
		$\displaystyle\quad+\gamma^{2}L\sum_{\ell=k-(k\bmod n)}^{k-1}\mathbb{E}\left[\left\\|\frac{1}{n}\sum_{i=1}^{n}\nabla f_{i}\left(x^{\ell-\delta_{i}^{\ell}}\right)\right\\|^{2}\right].$

	$\displaystyle T_{2}$	$\displaystyle=\mathbb{E}\left[\left\langle\nabla f\left(x^{k}\right)-\nabla f\left(x^{k-(k\bmod n)}\right),\bar{g}^{k}-\frac{1}{n}\sum_{i=1}^{n}\nabla f_{i}\left(x^{k-\delta_{i}^{k}}\right)\right\rangle\right]$
		$\displaystyle\geq-\mathbb{E}\left[\left\\|\nabla f\left(x^{k}\right)-\nabla f\left(x^{k-(k\bmod n)}\right)\right\\|\left\\|\bar{g}^{k}-\frac{1}{n}\sum_{i=1}^{n}\nabla f_{i}\left(x^{k-\delta_{i}^{k}}\right)\right\\|\right]$
		$\displaystyle\geq-L\mathbb{E}\left[\left\\|x^{k}-x^{k-(k\bmod n)}\right\\|\left\\|\bar{g}^{k}-\frac{1}{n}\sum_{i=1}^{n}\nabla f_{i}\left(x^{k-\delta_{i}^{k}}\right)\right\\|\right]$
		$\displaystyle=-L\mathbb{E}\left[\left\\|\gamma\sum_{\ell=k-(k\bmod n)}^{k-1}\bar{g}^{\ell}\right\\|\left\\|\bar{g}^{k}-\frac{1}{n}\sum_{i=1}^{n}\nabla f_{i}\left(x^{k-\delta_{i}^{k}}\right)\right\\|\right]$
		$\displaystyle\geq-L\gamma\sum_{\ell=k-(k\bmod n)}^{k-1}\mathbb{E}\left[\left\\|\bar{g}^{\ell}\right\\|\left\\|\bar{g}^{k}-\frac{1}{n}\sum_{i=1}^{n}\nabla f_{i}\left(x^{k-\delta_{i}^{k}}\right)\right\\|\right]$
		$\displaystyle\geq-L\gamma\sum_{\ell=k-(k\bmod n)}^{k-1}\frac{1}{2}\left(\mathbb{E}\left[\left\\|\bar{g}^{\ell}\right\\|^{2}\right]+\mathbb{E}\left[\left\\|\bar{g}^{k}-\frac{1}{n}\sum_{i=1}^{n}\nabla f_{i}\left(x^{k-\delta_{i}^{k}}\right)\right\\|^{2}\right]\right)$
		$\displaystyle\geq-\frac{L\gamma}{2}\sum_{\ell=k-(k\bmod n)}^{k-1}\mathbb{E}\left[\left\\|\bar{g}^{\ell}\right\\|^{2}\right]-(k\bmod n)\frac{L\gamma\sigma^{2}}{2B^{k}n}$
		$\displaystyle\geq-\frac{L\gamma}{2}\sum_{\ell=k-(k\bmod n)}^{k-1}\mathbb{E}\left[\left\\|\bar{g}^{\ell}\right\\|^{2}\right]-\frac{L\gamma\sigma^{2}}{2B^{k}}~.$

Ringleader ASGD: The First Asynchronous SGD with Optimal Time Complexity under Data Heterogeneity

Abstract

1 Introduction

1.1 Contributions

2 Problem Setup

2.1 Worker Heterogeneity Model

2.2 Notations

2.3 Assumptions

3 Background and Motivation

3.1 Naive Minibatch SGD

Algorithm Description.

Limitations.

Theoretical Performance.

3.2 Malenia SGD

Algorithm Description.

Stopping Condition Rationale.

Theoretical Performance.

Limitations.

3.3 Toward Asynchronous Methods

Limitations.

3.4 IA2SGD

Algorithm Overview.

Theoretical Performance.

3.5 Motivation

4 Ringleader ASGD

4.1 Detailed Description

Initialization.

Phase 1 — Gradient Collection.

Phase 2 — Sequential Updates.

Temporary Table Management.

Preparing for the Next Round.

4.2 Delay Control Analysis

Proof.

First round (iterations 0 to n−1n-1):

Second round (iterations nn to 2​n−12n-1):

General pattern:

4.3 Comparison to IA2SGD

4.4 Comparison to Malenia SGD

5 Theoretical Results

5.1 Iteration Complexity

Proof.

5.2 Time Complexity

Proof.

Proof.

6 Experiments

7 Conclusion

Acknowledgments

References

Appendix A Related Work

Appendix B Arbitrarily Changing Computation Times

B.1 Universal Computation Model

B.2 Toward an Optimal Method

B.3 An Optimal Method

Proof.

Appendix C Auxiliary Lemmas

C.1 Proof of Section 2.3

Lemma 2.3 (Smoothness Bounds).

Proof.

C.2 Variance Term

Proof.

C.3 Proof of Section 5.1

Lemma 5.1 (Descent Lemma).

Proof.

C.4 Proof of Section 5.1

Lemma 5.1 (Residual Estimation).

Proof.

Appendix D Time Complexity of IA2SGD

Appendix E Improved Malenia SGD

3.4 IA²SGD

First round (iterations $0$ to $n-1$ ):

Second round (iterations $n$ to $2n-1$ ):

4.3 Comparison to IA²SGD

Appendix D Time Complexity of IA²SGD