Bridging Discrete and Continuous RL: Stable Deterministic Policy Gradient with Martingale Characterization

Ziheng Cheng University of California, Berkeley. Email: ziheng_cheng@berkeley.edu    Xin Guo University of California, Berkeley. Email: xinguo@berkeley.edu    Yufei Zhang Imperial College London. Email: yufei.zhang@imperial.ac.uk
Abstract

The theory of discrete-time reinforcement learning (RL) has advanced rapidly over the past decades. Although primarily designed for discrete environments, many real-world RL applications are inherently continuous and complex. A major challenge in extending discrete-time algorithms to continuous-time settings is their sensitivity to time discretization, often leading to poor stability and slow convergence. In this paper, we investigate deterministic policy gradient methods for continuous-time RL. We derive a continuous-time policy gradient formula based on an analogue of the advantage function and establish its martingale characterization. This theoretical foundation leads to our proposed algorithm, CT-DDPG, which enables stable learning with deterministic policies in continuous-time environments. Numerical experiments show that the proposed CT-DDPG algorithm offers improved stability and faster convergence compared to existing discrete-time and continuous-time methods, across a wide range of control tasks with varying time discretizations and noise levels.

1 Introduction

Deep Reinforcement learning (RL) has achieved remarkable success over the past decade, powered by theoretical advances and the success of algorithms in discrete-time systems such as Atari, Go, and Large Language Models [Mnih et al., 2013; Silver et al., 2016; Guo et al., 2025]. However, many real-world problems, such as robotic control, autonomous driving, and financial trading, are inherently continuous in time. In these domains, agents need to interact with the environment at an ultra-high frequency, underscoring the need for continuous-time RL approaches [Wang et al., 2020].

One major challenge in applying discrete-time RL to continuous-time environments is the sensitivity to the discretization step size. As the step size decreases, standard algorithms often degrade, resulting in exploding variance, poor stability, and slow convergence. While several works have attempted to resolve this issue with discretization-invariant algorithms [Tallec et al., 2019; Park et al., 2021], their underlying design principles are rooted in discrete-time RL. As a result, these methods are not robust when applied to complex, stochastic, and continuous real-world environments.

Recently there is a fast growing body of research on continuous-time RL [Yildiz et al., 2021; Jia and Zhou, 2022a, b, 2023; Zhao et al., 2023; Giegrich et al., 2024], including rigorous mathematical formulations and various algorithmic designs. However, most existing methods either rely on model-based assumptions, or consider stochastic policy, which is difficult to sample in continuous time, state and action spaces [Jia et al., 2025], and imposes Bellman equation constraints which are not feasible for implementation within deep RL frameworks. These challenges hinder the application of continuous-time RL framework in practice, leading to an important research question:

Can we develop a theoretically grounded algorithm that achieves stability and efficiency for deep RL in continuous-time environments?

In this paper, we address this question by investigating deterministic policy gradient (DPG) methods. We consider general continuous-time dynamics driven by a stochastic differential equation over a finite horizon. our main contributions are summarized as follows:

  • In Sec. 3, we develop a rigorous mathematical framework for model-free DPG methods in continuous-time RL. Specifically, Thm. 3.1 derives the DPG formula based on the advantage rate function. Thm. 3.2 further utilizes a martingale criterion to characterize the advantage rate function, laying the foundation for subsequent algorithm design. We also provide detailed comparisons against existing continuous-time RL algorithms with stochastic policy and discuss their major flaws and impracticality in deep RL frameworks.

  • In Sec. 4, we propose CT-DDPG, a novel and practical actor-critic algorithm with provable stability and efficiency in continuous-time environments. Notably, we utilize a multi-step TD objective, and prove its robustness to time discretization and stochastic noises in Sec. 4.2. For the first time, we provide the theoretical insights of the failure of standard discrete-time deep RL algorithms in continuous and stochastic settings.

  • Through extensive experiments in Sec. 5, we verify that existing discrete/continuous time algorithms lack robustness to time discretization and dynamic noise, while our method exhibits consistently stable performance.

2 Problem Formulation

This section formulates the continuous RL problem, where the agent learns an optimal parametrized policy to control an unknown continuous-time stochastic system to maximize a reward functional over a finite time horizon.

Let the state space be n\displaystyle\mathbb{R}^{n} and the action space be an open set 𝒜d\displaystyle\mathcal{A}\subseteq\mathbb{R}^{d}. For each non-anticipative 𝒜\displaystyle\mathcal{A}-valued control (action) process 𝒂=(at)t0\displaystyle\bm{a}=(a_{t})_{t\geq 0}, consider the associated state process governed by the following dynamics:

dXt𝒂=b(t,Xt𝒂,at)dt+σ(t,Xt𝒂,at)dWt,t[0,T];X0𝒂=x0ν,\mathrm{d}X_{t}^{\bm{a}}=b(t,X_{t}^{\bm{a}},a_{t})\mathrm{d}t+\sigma(t,X_{t}^{\bm{a}},a_{t})\mathrm{d}W_{t},\;t\in[0,T];\quad X_{0}^{\bm{a}}=x_{0}\sim\nu, (2.1)

where ν\displaystyle\nu is the initial distribution, (Wt)t0\displaystyle(W_{t})_{t\geq 0} is an m\displaystyle m-dimensional Brownian motion on a filtered probability space (Ω,,𝔽=(t)t0,)\displaystyle(\Omega,\mathcal{F},\mathbb{F}=(\mathcal{F}_{t})_{t\geq 0},\mathbb{P}), and b:[0,T]×n×𝒜n\displaystyle b:[0,T]\times\mathbb{R}^{n}\times\mathcal{A}\to\mathbb{R}^{n}, σ:[0,T]×n×𝒜n×m\displaystyle\sigma:[0,T]\times\mathbb{R}^{n}\times\mathcal{A}\to\mathbb{R}^{n\times m} are continuous functions. The reward functional of 𝒂\displaystyle\bm{a} is given by

𝔼[0Teβtr(t,Xt𝒂,at)dt+eβTg(XT𝒂)],\mathbb{E}\left[\int_{0}^{T}e^{-\beta t}r(t,X_{t}^{\bm{a}},a_{t})\mathrm{d}t+e^{-\beta T}g(X_{T}^{\bm{a}})\right], (2.2)

where β0\displaystyle\beta\geq 0 is a discount factor, and r:[0,T]×n×𝒜\displaystyle r:[0,T]\times\mathbb{R}^{n}\times\mathcal{A}\to\mathbb{R} and g:n\displaystyle g:\mathbb{R}^{n}\to\mathbb{R} are continuous functions, representing the running and terminal rewards, respectively.

It is well-known that under mild regularity conditions, it suffices to optimize (2.2) over control processes generated by Markov policies [Kurtz and Stockbridge, 1998]. Given a Markov policy μ:[0,T]×n𝒜\displaystyle\mu:[0,T]\times{\mathbb{R}}^{n}\to\mathcal{A}, the associated state process (Xtμ)t0\displaystyle(X^{\mu}_{t})_{t\geq 0} evolves according to the dynamics:

dXtμ=b(t,Xtμ,μ(t,Xtμ))dt+σ(t,Xtμ,μ(t,Xtμ))dWt,t[0,T];X0μ=x0ν.\mathrm{d}X_{t}^{\mu}=b(t,X_{t}^{\mu},\mu(t,X_{t}^{\mu}))\mathrm{d}t+\sigma(t,X_{t}^{\mu},\mu(t,X_{t}^{\mu}))\mathrm{d}W_{t},\;t\in[0,T];\quad X^{\mu}_{0}=x_{0}\sim\nu. (2.3)

The agent aims to maximize the following reward

𝔼[0Teβtr(t,Xtμ,μ(t,Xtμ))dt+eβTg(XTμ)]\mathbb{E}\left[\int_{0}^{T}e^{-\beta t}r(t,X_{t}^{\mu},\mu(t,X^{\mu}_{t}))\mathrm{d}t+e^{-\beta T}g(X_{T}^{\mu})\right] (2.4)

over all admissible policies μ\displaystyle\mu. Importantly, the agent does not have access to the coefficients b\displaystyle b, σ\displaystyle\sigma, r\displaystyle r and g\displaystyle g. Instead, the agent directly interacts with Eq. 2.3 with different actions, and refines her strategy based on observed state and reward trajectories. We emphasize that in this paper, we directly optimize (2.4) over deterministic policies, which map the state space directly to the action space, rather than over stochastic policies as studied in Jia and Zhou [2022b, 2023]; Zhao et al. [2023], which map the state space to probability measures over the action space (see Sec. 3.3).

To solve Eq. 2.4, a practical approach is to restrict the optimization problem over a sufficiently rich class of parameterized policies. More precisely, given a class of policies {μϕ:[0,T]×n𝒜ϕk}\displaystyle\{\mu_{\phi}:[0,T]\times{\mathbb{R}}^{n}\to\mathcal{A}\mid\phi\in\mathbb{R}^{k}\} parameterized by ϕ\displaystyle\phi, we consider the following maximization problem:

maxϕkJ(ϕ),withJ(ϕ)𝔼[0Teβtr(t,Xtϕ,μϕ(t,Xtϕ))dt+eβTg(XTϕ)],\max_{\phi\in\mathbb{R}^{k}}J(\phi),\quad\textnormal{with}\quad J(\phi)\coloneqq\mathbb{E}\left[\int_{0}^{T}e^{-\beta t}r(t,X_{t}^{\phi},\mu_{\phi}(t,X^{\phi}_{t}))\mathrm{d}t+e^{-\beta T}g(X_{T}^{\phi})\right], (2.5)

where Xϕ\displaystyle X^{\phi} denotes the state process controlled by μϕ\displaystyle\mu_{\phi}. Throughout this paper, we assume the initial state distribution ν\displaystyle\nu has a second moment, and impose the following regularity conditions on the policy class and model coefficients.

Assumption 1.

There exists C0\displaystyle C\geq 0 such that for all t[0,T]\displaystyle t\in[0,T], a,a𝒜\displaystyle a,a^{\prime}\in\mathcal{A} and and x,xn\displaystyle x,x^{\prime}\in{\mathbb{R}}^{n},

|b(t,x,a)b(t,x,a)|+|σ(t,x,a)σ(t,x,a)|\displaystyle\displaystyle|b(t,x,a)-b(t,x^{\prime},a^{\prime})|+|\sigma(t,x,a)-\sigma(t,x^{\prime},a^{\prime})| C(|xx|+|aa|),\displaystyle\displaystyle\leq C(|x-x^{\prime}|+|a-a^{\prime}|),
|b(t,0,0)|+|σ(t,0,0)|C,|r(t,x,a)|+|g(x)|\displaystyle\displaystyle|b(t,0,0)|+|\sigma(t,0,0)|\leq C,\quad|r(t,x,a)|+|g(x)| C(1+|x|2+|a|2),\displaystyle\displaystyle\leq C(1+|x|^{2}+|a|^{2}),

and there exists a locally bounded function ρ1:[0,)[0,)\displaystyle\rho_{1}:[0,\infty)\to[0,\infty) such that for all ϕk\displaystyle\phi\in{\mathbb{R}}^{k}, t[0,T]\displaystyle t\in[0,T], and x,xn\displaystyle x,x^{\prime}\in{\mathbb{R}}^{n}, |μϕ(t,x)μϕ(t,x)|ρ1(|ϕ|)|xx|\displaystyle|\mu_{\phi}(t,x)-\mu_{\phi}(t,x^{\prime})|\leq\rho_{1}(|\phi|)|x-x^{\prime}| and |μϕ(t,0)|ρ1(|ϕ|)\displaystyle|\mu_{\phi}(t,0)|\leq\rho_{1}(|\phi|).

Asp. 1 holds for all policies parameterized by feedforward neural networks with Lipschitz activations. It ensures that the state dynamics and the objective function are well defined for any ϕk\displaystyle\phi\in\mathbb{R}^{k}.

3 Main Theoretical Results

We will first characterize the gradient of the objective functional Eq. 2.5 with respect to the policy parameter ϕ\displaystyle\phi, using a continuous-time analogue of the discrete-time advantage function. We will then derive a martingale characterization of this continuous-time advantage function and value function, which serves as the foundation of our algorithm design under deterministic policies. All detailed proofs can be found in Appendix B.

3.1 Deterministic policy gradient (DPG) formula

We first introduce a dynamic version of the objective function J(ϕ)\displaystyle J(\phi). For each (t,x)[0,T]×n\displaystyle(t,x)\in[0,T]\times{\mathbb{R}}^{n}, define the value function

Vϕ(t,x)𝔼[tTeβ(st)r(s,Xsϕ,μϕ(s,Xsϕ))ds+eβ(Ts)g(XTϕ)|Xtϕ=x].\displaystyle\displaystyle V^{\phi}(t,x)\coloneqq\mathbb{E}\left[\int_{t}^{T}e^{-\beta(s-t)}r(s,X_{s}^{\phi},\mu_{\phi}(s,X^{\phi}_{s}))\mathrm{d}s+e^{-\beta(T-s)}g(X_{T}^{\phi})\,\bigg|\,X_{t}^{\phi}=x\right]. (3.1)

Note that J(ϕ)=𝔼xν[Vϕ(0,x)]\displaystyle J(\phi)=\mathbb{E}_{x\sim\nu}[V^{\phi}(0,x)]. We additionally impose the following differentiability condition on the model parameters and policies with respect to the parameter.

Assumption 2.

For all (t,x)[0,T]×n\displaystyle(t,x)\in[0,T]\times{\mathbb{R}}^{n}, a(b,σσ,r)(t,x,a)\displaystyle a\mapsto(b,\sigma\sigma^{\top},r)(t,x,a) and ϕμϕ(t,x)\displaystyle\phi\mapsto\mu_{\phi}(t,x) are continuously differentiable. There exists a locally bounded function ρ2:[0,)[0,)\displaystyle\rho_{2}:[0,\infty)\to[0,\infty) such that for all ϕk\displaystyle\phi\in{\mathbb{R}}^{k} and (t,x)[0,T]×n\displaystyle(t,x)\in[0,T]\times{\mathbb{R}}^{n},

|ϕb(t,x,μϕ(t,x))|1+|x|+|ϕ(σσ)(t,x,μϕ(t,x))|+|ϕr(t,x,μϕ(t,x))|1+|x|2ρ2(|θ|).\frac{|\partial_{\phi}b(t,x,\mu_{\phi}(t,x))|}{1+|x|}+\frac{|\partial_{\phi}(\sigma\sigma^{\top})(t,x,\mu_{\phi}(t,x))|+|\partial_{\phi}r(t,x,\mu_{\phi}(t,x))|}{1+|x|^{2}}\leq\rho_{2}(|\theta|).

Moreover, VϕC1,2([0,T]×n)\displaystyle V^{\phi}\in C^{1,2}([0,T]\times{\mathbb{R}}^{n}) for all ϕk\displaystyle\phi\in{\mathbb{R}}^{k}.

Under Asp. 1, by Itô’s formula, for any given ϕk\displaystyle\phi\in{\mathbb{R}}^{k}, VϕC1,2([0,T]×n)\displaystyle V^{\phi}\in C^{1,2}([0,T]\times{\mathbb{R}}^{n}) satisfies the following linear Bellman equation: for all (t,x)[0,T]×n\displaystyle(t,x)\in[0,T]\times{\mathbb{R}}^{n},

[Vϕ](t,x,μϕ(t,x))+r(t,x,μϕ(t,x))=0,Vϕ(T,x)=g(x),\displaystyle\displaystyle\begin{split}&\mathcal{L}[V^{\phi}](t,x,\mu_{\phi}(t,x))+r(t,x,\mu_{\phi}(t,x))=0,\quad V^{\phi}(T,x)=g(x),\end{split} (3.2)

where \displaystyle\mathcal{L} is the generator of (2.3) such that for all φC1,2([0,T]×n)\displaystyle\varphi\in C^{1,2}([0,T]\times{\mathbb{R}}^{n}),

[φ](t,x,a)tφ(t,x)βφ(t,x)+b(t,x,a)xφ(t,x)+12Tr(Σ(t,x,a)xx2φ(t,x)),\displaystyle\displaystyle\begin{split}\mathcal{L}[\varphi](t,x,a)&\coloneqq\partial_{t}\varphi(t,x)-\beta\varphi(t,x)+b(t,x,a)^{\top}\partial_{x}\varphi(t,x)+\frac{1}{2}\textrm{Tr}(\Sigma(t,x,a)\partial^{2}_{xx}\varphi(t,x)),\end{split} (3.3)

with Σσσ\displaystyle\Sigma\coloneqq\sigma\sigma^{\top}. The following theorem presents the DPG formula for the continuous RL problem.

Theorem 3.1.

Suppose Asps. 1 and 2 hold. For all (t,x)[0,T]×n\displaystyle(t,x)\in[0,T]\times{\mathbb{R}}^{n} and ϕk\displaystyle\phi\in{\mathbb{R}}^{k},

ϕVϕ(t,x)=𝔼[tTeβ(st)ϕμϕ(s,Xsϕ)aAϕ(s,Xsϕ,μϕ(s,Xsϕ))ds|Xtϕ=x],\displaystyle\displaystyle\begin{split}&\partial_{\phi}V^{\phi}(t,x)=\mathbb{E}\left[\int_{t}^{T}e^{-\beta(s-t)}\partial_{\phi}\mu_{\phi}(s,X^{\phi}_{s})^{\top}\partial_{a}A^{\phi}(s,X^{\phi}_{s},\mu_{\phi}(s,X^{\phi}_{s}))\mathrm{d}s\,\bigg|\,X^{\phi}_{t}=x\right],\end{split}

where Aϕ(t,x,a)[Vϕ](t,x,a)+r(t,x,a)\displaystyle A^{\phi}(t,x,a)\coloneqq\mathcal{L}[V^{\phi}](t,x,a)+r(t,x,a).

The proof of Thm. 3.1 follows by quantifying the difference between the value functions corresponding to two policies, and then applying Vitali’s convergence theorem. Similar formula was established in Gobet and Munos [2005] under stronger conditions that the running reward r\displaystyle r is zero, the diffusion coefficient is uniformly elliptic, and the coefficients are four times continuously differentiable.

Remark 1.

Thm. 3.1 is analogous to the DPG formula for discrete-time Markov decision processes [Silver et al., 2014]. The function Aϕ\displaystyle A^{\phi} plays the role of advantage function used in discrete-time DPG, and has been referred to as the advantage rate function in Zhao et al. [2023]. To see it, assume β=0\displaystyle\beta=0, and for any given N\displaystyle N\in\mathbb{N}, consider the discrete-time version of Eq. 2.5:

JΔt(ϕ)𝔼[i=0N1r(ti,XtiΔt,ϕ,μϕ(ti,XtiΔt,ϕ))Δt+g(XTΔt,ϕ)],J_{\Delta t}(\phi)\coloneqq\mathbb{E}\big[\sum_{i=0}^{N-1}r(t_{i},X_{t_{i}}^{\Delta t,\phi},\mu_{\phi}(t_{i},X^{\Delta t,\phi}_{t_{i}}))\Delta t+g(X_{T}^{\Delta t,\phi})\big], (3.4)

where Δt=T/N\displaystyle\Delta t={T}/{N}, ti=iΔt\displaystyle t_{i}=i\Delta t, and XΔt,ϕ\displaystyle X^{\Delta t,\phi} satisfies the following time-discretization of Eq. 2.3:

Xti+1Δt,ϕ=XtiΔt,ϕ+b(ti,XtiΔt,ϕ,μϕ(ti,XtiΔt,ϕ))Δt+σ(ti,XtiΔt,ϕ,μϕ(ti,XtiΔt,ϕ))Δtωti,X_{t_{i+1}}^{\Delta t,\phi}=X_{t_{i}}^{\Delta t,\phi}+b(t_{i},X_{t_{i}}^{\Delta t,\phi},\mu_{\phi}(t_{i},X_{t_{i}}^{\Delta t,\phi}))\Delta t+\sigma(t_{i},X_{t_{i}}^{\Delta t,\phi},\mu_{\phi}(t_{i},X_{t_{i}}^{\Delta t,\phi}))\sqrt{\Delta t}\omega_{t_{i}},

and (ωti)i=0N1\displaystyle(\omega_{t_{i}})_{i=0}^{N-1} are independent standard normal random variables. By the deterministic policy gradient formula [Silver et al., 2014],

ϕJΔt(ϕ)=𝔼[i=0N1ϕμϕ(ti,XtiΔt,ϕ)aAΔt,ϕ(ti,XtiΔt,ϕ,μϕ(ti,XtiΔt,ϕ))Δt],\partial_{\phi}J_{\Delta t}(\phi)=\mathbb{E}\Big[\sum_{i=0}^{N-1}\partial_{\phi}\mu_{\phi}(t_{i},X_{t_{i}}^{\Delta t,\phi})^{\top}\partial_{a}A^{\Delta t,\phi}(t_{i},X_{t_{i}}^{\Delta t,\phi},\mu_{\phi}(t_{i},X_{t_{i}}^{\Delta t,\phi}))\Delta t\Big], (3.5)

where AΔt,ϕ(t,x,a)QΔt,ϕ(t,x,a)VΔt,ϕ(t,x)Δt\displaystyle A^{\Delta t,\phi}(t,x,a)\coloneqq\frac{Q^{\Delta t,\phi}(t,x,a)-V^{\Delta t,\phi}(t,x)}{\Delta t} is the advantage function for Eq. 3.4 normalized with the time stepsize. As N\displaystyle N\to\infty, AΔt,ϕ\displaystyle A^{\Delta t,\phi} converges to Aϕ\displaystyle A^{\phi}, as shown in Jia and Zhou [2023]. Sending Δt0\displaystyle\Delta t\to 0 in Eq. 3.5 yields the continuous-time DPG in Thm. 3.1.

3.2 Martingale characterization of continuous-time advantage rate function

By Thm. 3.1, implementing the DPG requires computing the advantage rate function Aϕ\displaystyle A^{\phi} in a neighborhood of the policy μϕ\displaystyle\mu_{\phi}. The following theorem characterizes the advantage rate function through a martingale criterion.

Theorem 3.2.

Suppose Asps. 1 and 2 hold. Let ϕk\displaystyle\phi\in{\mathbb{R}}^{k}, V^C1,2([0,T]×n)\displaystyle\hat{V}\in{C}^{1,2}([0,T]\times\mathbb{R}^{n}) and q^C([0,T]×n×𝒜)\displaystyle\hat{q}\in{C}([0,T]\times\mathbb{R}^{n}\times\mathcal{A}) satisfy the following conditions for all (t,x)[0,T]×n\displaystyle(t,x)\in[0,T]\times\mathbb{R}^{n}:

V^(T,x)=g(x),q^(t,x,μϕ(t,x))=0,\hat{V}(T,x)=g(x),\quad\hat{q}(t,x,\mu_{\phi}(t,x))=0, (3.6)

and there exists a neighborhood 𝒪μϕ(t,x)𝒜\displaystyle\mathcal{O}_{\mu_{\phi}(t,x)}\subset\mathcal{A} of μϕ(t,x)\displaystyle\mu_{\phi}(t,x) such that for all a𝒪μϕ(t,x)\displaystyle a\in\mathcal{O}_{\mu_{\phi}(t,x)},

(eβ(st)V^(s,Xst,x,a)+tseβ(ut)(rq^)(u,Xut,x,a,αu)du)s[t,T]\left(e^{-\beta(s-t)}\hat{V}(s,X_{s}^{t,x,a})+\int_{t}^{s}e^{-\beta(u-t)}(r-\hat{q})(u,X_{u}^{t,x,a},\alpha_{u})\mathrm{d}u\right)_{s\in[t,T]} (3.7)

is an 𝔽\displaystyle\mathbb{F}-martingale, where Xt,x,a\displaystyle X^{t,x,a} satisfies for all s[t,T]\displaystyle s\in[t,T],

dXst,x,a=b(s,Xst,x,a,αs)ds+σ(s,Xst,x,a,αs)dWs,Xtt,x,a=x,\displaystyle\displaystyle\begin{split}\mathrm{d}X^{t,x,a}_{s}&=b(s,X^{t,x,a}_{s},\alpha_{s})\mathrm{d}s+\sigma(s,X^{t,x,a}_{s},\alpha_{s})\mathrm{d}W_{s},\quad X^{t,x,a}_{t}=x,\end{split} (3.8)

and (αs)st\displaystyle(\alpha_{s})_{s\geq t} is a square-integrable 𝒜\displaystyle\mathcal{A}-valued adapted process with limstαs=a\displaystyle\lim_{s\searrow t}\alpha_{s}=a almost surely. Then V^(t,x)=Vϕ(t,x)\displaystyle\hat{V}(t,x)=V^{\phi}(t,x) and q^(t,x,a)=Aϕ(t,x,a)\displaystyle\hat{q}(t,x,a)=A^{\phi}(t,x,a) for all (t,x,a)[0,T]×n×𝒪μϕ(t,x)\displaystyle(t,x,a)\in[0,T]\times\mathbb{R}^{n}\times\mathcal{O}_{\mu_{\phi}(t,x)}.

Thm. 3.2 establishes sufficient conditions ensuring that the functions V^\displaystyle\hat{V} and q^\displaystyle\hat{q} coincide with the value function and the advantage rate function of a given policy μϕ\displaystyle\mu_{\phi}, respectively. Eq. 3.6 requires that V^\displaystyle\hat{V} agrees with the terminal condition h\displaystyle h at time T\displaystyle T, and the function q^\displaystyle\hat{q} satisfies the linear Bellman equation Eq. 3.2 as the true advantage rate Aϕ\displaystyle A^{\phi}. The martingale constraint Eq. 3.7 ensures q^\displaystyle\hat{q} is the advantage rate function associated with V^\displaystyle\hat{V}, for all actions in a neighborhood of the policy μϕ\displaystyle\mu_{\phi}.

To ensure exploration of the action space, Thm. 3.2 requires that the martingale condition Eq. 3.7 holds for state processes initialized with any action a𝒪μϕ(t,x)\displaystyle a\in\mathcal{O}_{\mu_{\phi}(t,x)}. In practice, one can use an exploration policy to generate these exploratory actions, which are then employed to learn the gradient of the target deterministic policy. This parallels the central role of off-policy algorithms in discrete-time DPG methods [Lillicrap et al., 2015; Haarnoja et al., 2018a].

3.3 Improved efficiency and stability of deterministic policies over stochastic policies

Thm. 3.2 implies that DPG can be estimated both more efficiently and more stably than stochastic policy gradients, since it avoids costly integrations over the action space.

Recall that Jia and Zhou [2022b, 2023]; Zhao et al. [2023] study continuous-time RL with stochastic policies π:[0,T]×d𝒫(𝒜)\displaystyle\pi:[0,T]\times{\mathbb{R}}^{d}\to\mathcal{P}(\mathcal{A}) and establish an analogous policy gradient formula based on the corresponding advantage rate function. By incorporating an additional entropy term into the objective, Jia and Zhou [2023] characterizes the advantage rate function analogously to Thm. 3.2, replacing the Bellman condition Eq. 3.6 with

𝔼aπ(da|t,x)[q^(t,x,a)γlogπ(a|t,x)]=0,(t,x)[0,T]×n,\mathbb{E}_{a\sim\pi(\mathrm{d}a|t,x)}[\hat{q}(t,x,a)-\gamma\log\pi(a|t,x)]=0,\quad\forall(t,x)\in[0,T]\times{\mathbb{R}}^{n}, (3.9)

where γ>0\displaystyle\gamma>0 is the entropy regularization coefficient, and requiring the martingale constraint Eq. 3.7 to hold for all state dynamics starting at state x\displaystyle x at time t\displaystyle t, with actions sampled randomly from π\displaystyle\pi at any time partition of [0,T]\displaystyle[0,T]. Implementing the criterion Eq. 3.9 requires sampling random actions from the policy π\displaystyle\pi to compute the expectation over the action space. This makes policy evaluation substantially more challenging in deep RL, particularly with high-dimensional action spaces or non-Gaussian policies, often resulting in training instability and slow convergence, as observed in our experiments in Sec. 5. In contrast, the Bellman condition Eq. 3.6 for DPG can be straightforwardly implemented using a simple re-parameterization (see Eq. 4.2).

4 Algorithm and Analysis

4.1 Algorithm design

Given the martingale characterization (Thm. 3.2), we now discuss the implementation details in a continuous-time RL framework via deep neural networks. We use Vθ,qψ,μϕ\displaystyle V_{\theta},q_{\psi},\mu_{\phi} to denote the neural networks for value, advantage rate function and policy, respectively.

Martingale loss.

To ensure the martingale condition Eq. 3.7, let Mt=eβtVθ(t,xt)+0teβs[r(s,xs,as)qψ(s,xs,as)]ds.\displaystyle M_{t}=e^{-\beta t}V_{\theta}(t,x_{t})+\int_{0}^{t}e^{-\beta s}[r(s,x_{s},a_{s})-q_{\psi}(s,x_{s},a_{s})]\mathrm{d}s. We adopt the following martingale orthogonality conditions (also known as generalized moment method) 𝔼[0TζtdMt]=0\displaystyle\mathbb{E}\left[\int_{0}^{T}\zeta_{t}\mathrm{d}M_{t}\right]=0, where 𝜻=(ζt)[0,T]\displaystyle\bm{\zeta}=(\zeta_{t})_{[0,T]} is any test function. This is both necessary and sufficient to ensure the martingale condition for all 𝔽\displaystyle\mathbb{F}-adapted and square-integrable processes 𝜻\displaystyle\bm{\zeta} [Jia and Zhou, 2022a].

In theory, one should consider all possible test functions, which leads to infinitely many equations. For practical implementation, however, it suffices to select a finite number of test functions with special structures. A natural choice is to set ζt=θVθ(t,xt)\displaystyle\zeta_{t}=\partial_{\theta}V_{\theta}(t,x_{t}) or ζt=ψqψ(t,xt,at)\displaystyle\zeta_{t}=\partial_{\psi}q_{\psi}(t,x_{t},a_{t}), in which case the marginal orthogonality condition becomes a vector-valued condition. The classic stochastic approximation method [Robbins and Monro, 1951] can be applied to solve the equation:

θθηθVθ(t,xt)(Vθ(t,xt)tt+δeβ(st)[r(s,xs,as)qψ(s,xs,as)]dseβδVθ(t+δ,xt+δ)),\theta\leftarrow\theta-\eta\partial_{\theta}V_{\theta}(t,x_{t})\cdot\Big(V_{\theta}(t,x_{t})-\int_{t}^{t+\delta}e^{-\beta(s-t)}[r(s,x_{s},a_{s})-q_{\psi}(s,x_{s},a_{s})]\mathrm{d}s-e^{-\beta\delta}V_{\theta}(t+\delta,x_{t+\delta})\Big),
ψψηψqψ(t,xt,at)(Vθ(t,xt)tt+δeβ(st)[r(s,xs,as)qψ(s,xs,as)]dseβδVθ(t+δ,xt+δ)),\psi\leftarrow\psi-\eta\partial_{\psi}q_{\psi}(t,x_{t},a_{t})\cdot\Big(V_{\theta}(t,x_{t})-\int_{t}^{t+\delta}e^{-\beta(s-t)}[r(s,x_{s},a_{s})-q_{\psi}(s,x_{s},a_{s})]\mathrm{d}s-e^{-\beta\delta}V_{\theta}(t+\delta,x_{t+\delta})\Big),

where δ>0\displaystyle\delta>0 is the integral interval and the trajectory is sampled from collected data. Note that the update formula above is also referred as semi-gradient TD method in RL [Sutton et al., 1998].

Algorithm 1 Continuous Time Deep Deterministic Policy Gradient
Inputs: Discretization step size h\displaystyle h, horizon K=T/h\displaystyle K=T/h, discount rate β\displaystyle\beta, number of episodes N\displaystyle N, policy net μϕ\displaystyle\mu_{\phi}, advantage-rate net q¯ψ\displaystyle\bar{q}_{\psi}, value net Vθ\displaystyle V_{\theta}, update frequency m\displaystyle m, trajectory length L\displaystyle L, exploration noise σexplore\displaystyle\sigma_{\text{explore}}, soft update parameter τ\displaystyle\tau, learning rate η\displaystyle\eta, batch size B\displaystyle B, terminal value constraint weight α\displaystyle\alpha
Learning Procedures:
  Initialize ϕ,ψ,θ\displaystyle\phi,\psi,\theta, target θtgt=θ\displaystyle\theta^{tgt}=\theta, and replay buffer \displaystyle\mathcal{R}
for n=1,,N\displaystyle n=1,\cdots,N do
  Observe the initial state x~0\displaystyle\tilde{x}_{0}
  for k=1,,K\displaystyle k=1,\cdots,K do
    Perform akh𝒩(μϕ(x~kh),σexplore2)\displaystyle a_{kh}\sim\mathcal{N}(\mu_{\phi}(\tilde{x}_{kh}),\sigma_{\text{explore}}^{2}) and collect rkh,x~(k+1)h\displaystyle r_{kh},\tilde{x}_{(k+1)h}
    Store (x~kh,akh,rkh,x~(k+1)h)\displaystyle(\tilde{x}_{kh},a_{kh},r_{kh},\tilde{x}_{(k+1)h}) in \displaystyle\mathcal{R}
   if k0 mod m\displaystyle k\equiv 0\textbf{ mod }m then
    \displaystyle\trianglerighttrain advantage rate function and value function
     Sample a batch of trajectories {x~kih:(ki+L)h(i),akih:(ki+L)h(i),rkih:(ki+L)h(i)}i=1B\displaystyle\{\tilde{x}_{k_{i}h:(k_{i}+L)h}^{(i)},a_{k_{i}h:(k_{i}+L)h}^{(i)},r_{k_{i}h:(k_{i}+L)h}^{(i)}\}_{i=1}^{B} from \displaystyle\mathcal{R}
     Define qψ(x~,a):=q¯ψ(x~,a)q¯ψ(x~,μϕ(x~))\displaystyle q_{\psi}(\tilde{x},a):=\bar{q}_{\psi}(\tilde{x},a)-\bar{q}_{\psi}(\tilde{x},\mu_{\phi}(\tilde{x}))
     Compute the martingale loss
M=1Bi=1B(Vθ(x~kih(i))l=0L1eβlh[r(ki+l)h(i)qψ(x~(ki+l)h(i),a(ki+l)h(i))]heβLhVθtgt(x~(ki+L)h(i)))2\mathcal{L}^{M}=\frac{1}{B}\sum_{i=1}^{B}\Big(V_{\theta}(\tilde{x}_{k_{i}h}^{(i)})-\sum_{l=0}^{L-1}e^{-\beta lh}[r_{(k_{i}+l)h}^{(i)}-q_{\psi}(\tilde{x}_{(k_{i}+l)h}^{(i)},a_{(k_{i}+l)h}^{(i)})]h-e^{-\beta Lh}V_{\theta^{tgt}}(\tilde{x}_{(k_{i}+L)h}^{(i)})\Big)^{2} (4.1)
     Sample a batch of terminal states {x~Kh(i),rKh(i)}i=1B\displaystyle\{\tilde{x}_{Kh}^{(i)},r_{Kh}^{(i)}\}_{i=1}^{B} from \displaystyle\mathcal{R}
     Compute the terminal value constraint C=1Bi=1B(Vθ(x~Kh(i))rKh(i))2\displaystyle\mathcal{L}^{C}=\frac{1}{B}\sum_{i=1}^{B}(V_{\theta}(\tilde{x}_{Kh}^{(i)})-r_{Kh}^{(i)})^{2}
     Update the critic: ψψψ(M+αC)\displaystyle\psi\leftarrow\psi-\partial_{\psi}(\mathcal{L}^{M}+\alpha\mathcal{L}^{C}), θθηθ(M+αC)\displaystyle\theta\leftarrow\theta-\eta\partial_{\theta}(\mathcal{L}^{M}+\alpha\mathcal{L}^{C})
    \displaystyle\trianglerighttrain policy
     Sample a batch of states {x~kih(i)}i=1B\displaystyle\{\tilde{x}_{k_{i}h}^{(i)}\}_{i=1}^{B} from \displaystyle\mathcal{R}
     Compute the policy loss =1Bi=1Bq¯ψ(x~kih(i),μϕ(x~kih(i)))h\displaystyle\mathcal{L}=-\frac{1}{B}\sum_{i=1}^{B}\bar{q}_{\psi}(\tilde{x}_{k_{i}h}^{(i)},\mu_{\phi}(\tilde{x}_{k_{i}h}^{(i)}))h
     Update the actor: ϕϕηϕ\displaystyle\phi\leftarrow\phi-\eta\partial_{\phi}\mathcal{L}
     Update the target: θtgtτθ+(1τ)θtgt\displaystyle\theta^{tgt}\leftarrow\tau\theta+(1-\tau)\theta^{tgt}
   end if
  end for
end for

Bellman constraints.

To enforce Eq. 3.6, we re-parameterize the advantage rate function as

qψ(t,x,a):=q¯ψ(t,x,a)q¯ψ(t,x,μϕ(t,x)),q_{\psi}(t,x,a):=\bar{q}_{\psi}(t,x,a)-\bar{q}_{\psi}(t,x,\mu_{\phi}(t,x)), (4.2)

where q¯ψ\displaystyle\bar{q}_{\psi} is a neural network and μϕ\displaystyle\mu_{\phi} denotes the current deterministic policy [Tallec et al., 2019].

In practice, it is often challenging to design a neural network structure that directly enforces the terminal value constraint. To address this, we add a penalty term of the form: 𝔼(Vθ(T,xT)g(xT))2\displaystyle\mathbb{E}(V_{\theta}(T,x_{T})-g(x_{T}))^{2}, where xT,g(xT)\displaystyle x_{T},g(x_{T}) are sampled from collected trajectories.

Implementation with discretization.

Let h\displaystyle h denote the discretization step size. We denote by x~t\displaystyle\tilde{x}_{t} the concatenation of time and state (t,xt)\displaystyle(t,x_{t}) for compactness. The full procedure of Continuous Time Deep Deterministic Policy Gradient (CT-DDPG) is summarized in Alg. 1.

We employ several training techniques widely used in modern deep RL algorithms such as DDPG and SAC. In particular, we employ a target value network Vθtgt\displaystyle V_{\theta^{tgt}}, defined as the exponentially moving average of the value network weights. This technique has been shown to improve training stability in deep RL111Here we focus on a single target value network as our primary goal is to study the efficiency of deterministic policies in continuous-time RL. Extensions with multiple target networks [Haarnoja et al., 2018b; Fujimoto et al., 2018] can be readily incorporated.. We further adopt a replay buffer to store transitions in order to improve sample efficiency. For exploration, we add independent Gaussian noises to the deterministic policy μϕ\displaystyle\mu_{\phi}.

Multi-step TD.

When training advantage-rate net and value net, we adopt multiple steps L>1\displaystyle L>1 to compute the temporal difference error (see Eq. 4.1). This is different from most off-policy algorithms which typically rely on a single transition step. Notably, when L=1\displaystyle L=1, our algorithm reduces to DAU [Tallec et al., 2019, Alg. 2] except that their policy learning rate vanishes as h0\displaystyle h\to 0. We highlight that multi-step TD is essential for the empirical success of CT-DDPG. In the next subsection, we theoretically demonstrate that one-step TD inevitably leads to gradient variance blow-up in the limit of vanishing discretization step, thereby slowing convergence.

4.2 Issues of One-Step TD in Continuous Time: Variance Blow up

When training the value function Vθ\displaystyle V_{\theta} and the advantage function Aψ\displaystyle A_{\psi} for a given policy (stochastic or deterministic), Temporal Difference algorithms [Haarnoja et al., 2018a; Tallec et al., 2019; Jia and Zhou, 2023] typically use a one-step semi-gradient:

Gθ,h:=1h𝔼[θVθ(x~t)(Vθ(x~t)(rtAψ(x~t,at))heβhVθ(x~t+h))],\displaystyle G_{\theta,h}=\frac{1}{h}\mathbb{E}\left[\partial_{\theta}V_{\theta}(\tilde{x}_{t})\left(V_{\theta}(\tilde{x}_{t})-(r_{t}-A_{\psi}(\tilde{x}_{t},a_{t}))\cdot h-e^{-\beta h}V_{\theta}(\tilde{x}_{t+h})\right)\right], (4.3)
Gψ,h:=1h𝔼[ψAψ(x~t,at)(Vθ(x~t)(rtAψ(x~t,at))heβhVθ(x~t+h))],\displaystyle G_{\psi,h}=\frac{1}{h}\mathbb{E}\left[\partial_{\psi}A_{\psi}(\tilde{x}_{t},a_{t})\left(V_{\theta}(\tilde{x}_{t})-(r_{t}-A_{\psi}(\tilde{x}_{t},a_{t}))\cdot h-e^{-\beta h}V_{\theta}(\tilde{x}_{t+h})\right)\right],

where tTruncExp(β;T)\displaystyle t\sim TruncExp(\beta;T) and xtXtπ,atπ(|t,xt)\displaystyle x_{t}\sim X_{t}^{\pi^{\prime}},a_{t}\sim\pi^{\prime}(\cdot|t,x_{t}) with an exploration policy π\displaystyle\pi^{\prime}. In practice, however, one has to use stochastic gradient:

gθ,h:=1h[θVθ(x~t)(Vθ(x~t)(rtAψ(x~t,at))heβhVθ(x~t+h))],\displaystyle g_{\theta,h}=\frac{1}{h}\left[\partial_{\theta}V_{\theta}(\tilde{x}_{t})\left(V_{\theta}(\tilde{x}_{t})-(r_{t}-A_{\psi}(\tilde{x}_{t},a_{t}))\cdot h-e^{-\beta h}V_{\theta}(\tilde{x}_{t+h})\right)\right], (4.4)
gψ,h:=1h[ψAψ(x~t,at)(Vθ(x~t)(rtAψ(x~t,at))heβhVθ(x~t+h))].\displaystyle g_{\psi,h}=\frac{1}{h}\left[\partial_{\psi}A_{\psi}(\tilde{x}_{t},a_{t})\left(V_{\theta}(\tilde{x}_{t})-(r_{t}-A_{\psi}(\tilde{x}_{t},a_{t}))\cdot h-e^{-\beta h}V_{\theta}(\tilde{x}_{t+h})\right)\right].
Proposition 4.1.

Assume C¯IσσC¯I\displaystyle\underline{C}\cdot I\preceq\sigma\sigma^{\top}\preceq\overline{C}\cdot I for some 0<C¯C¯\displaystyle 0<\underline{C}\leq\overline{C}, and θVθ,xVθ\displaystyle\partial_{\theta}V_{\theta},\partial_{x}V_{\theta} are not identically zero. Then the variance of stochastic gradient estimator blows up in the sense that:

limh0𝔼[gθ,h]=limh0Gθ,h=Θ(1),limh0𝔼[gψ,h]=limh0Gψ,h=Θ(1),\lim_{h\to 0}\mathbb{E}[g_{\theta,h}]=\lim_{h\to 0}G_{\theta,h}=\Theta(1),\lim_{h\to 0}\mathbb{E}[g_{\psi,h}]=\lim_{h\to 0}G_{\psi,h}=\Theta(1), (4.5)
limh0hVar(gθ,h)=Θ(1),limh0hVar(gψ,h)=Θ(1).\lim_{h\to 0}h\cdot\mathrm{Var}(g_{\theta,h})=\Theta(1),\lim_{h\to 0}h\cdot\mathrm{Var}(g_{\psi,h})=\Theta(1). (4.6)

In contrast, Alg. 1 utilizes L\displaystyle L-step TD loss with (stochastic) semi-gradient (for simplicity of the theoretical analysis, we consider hard update of target, i.e., τ=1\displaystyle\tau=1):

Gθ,h,L=𝔼[θVθ(x~t)(Vθ(x~t)l=0L1eβlh[rt+lhqψ(x~t+lh,at+lh)]heβLhVθ(x~t+Lh))],\hskip-5.69054ptG_{\theta,h,L}=\mathbb{E}\big[\partial_{\theta}V_{\theta}(\tilde{x}_{t})\big(V_{\theta}(\tilde{x}_{t})-\sum_{l=0}^{L-1}e^{-\beta lh}[r_{t+lh}-q_{\psi}(\tilde{x}_{t+lh},a_{t+lh})]h-e^{-\beta Lh}V_{\theta}(\tilde{x}_{t+Lh})\big)\big], (4.7)
gθ,h,L=θVθ(x~t)(Vθ(x~t)l=0L1eβlh[rt+lhqψ(x~t+lh,at+lh)]heβLhVθ(x~t+Lh)).g_{\theta,h,L}=\partial_{\theta}V_{\theta}(\tilde{x}_{t})\big(V_{\theta}(\tilde{x}_{t})-\sum_{l=0}^{L-1}e^{-\beta lh}[r_{t+lh}-q_{\psi}(\tilde{x}_{t+lh},a_{t+lh})]h-e^{-\beta Lh}V_{\theta}(\tilde{x}_{t+Lh})\big). (4.8)
Proposition 4.2.

Under the same assumptions in Prop. 4.1, if Lhδ>0\displaystyle Lh\equiv\delta>0, then the expected gradient does not vanish in the sense that

limh0𝔼[gθ,h,δh]=limh0Gθ,h,δh=Θ(1).\lim_{h\to 0}\mathbb{E}[g_{\theta,h,\frac{\delta}{h}}]=\lim_{h\to 0}G_{\theta,h,\frac{\delta}{h}}=\Theta(1). (4.9)

In addition, the variance of stochastic gradient does not blow up:

limh0¯Var(gθ,h,δh)=𝒪(1).\overline{\lim_{h\to 0}}\mathrm{Var}(g_{\theta,h,\frac{\delta}{h}})=\mathcal{O}(1). (4.10)
Remark 2 (effect of 1/h\displaystyle 1/h scaling).

Note that in Eq. 4.7, we omit the 1/h\displaystyle 1/h factor in contrast to Eq. 4.3. This modification is crucial for preventing the variance from blowing up. If we were to remove the 1/h\displaystyle 1/h factor in Eq. 4.3, then according to Prop. 4.1 the expected gradient Gθ,h\displaystyle G_{\theta,h} would vanish as h0\displaystyle h\to 0. This theoretical inconsistency reveals a fundamental drawback of one-step TD methods in the continuous-time RL framework, which is also verified in our experiments.

Remark 3 (previous analysis of one-step TD).

Jia and Zhou [2022a] discussed the issues of one-step TD objective

minθ1h2𝔼x~(Vθ(x~t)rtheβhVθ(x~t+h))2,\min_{\theta}\frac{1}{h^{2}}\mathbb{E}_{\tilde{x}}\left(V_{\theta}(\tilde{x}_{t})-r_{t}\cdot h-e^{-\beta h}V_{\theta}(\tilde{x}_{t+h})\right)^{2}, (4.11)

showing that its minimizer does not converge to the true value function as h0\displaystyle h\to 0. However, practical one-step TD methods do not directly optimize Eq. 4.11, but rather employ the semi-gradient update Eq. 4.3. Consequently, the analysis in Jia and Zhou [2022a] does not fully explain the failure of discrete-time RL algorithms under small discretization steps. In contrast, our analysis is consistent with the actual update rule and thus offers theoretical insights that are directly relevant to the design of continuous-time algorithms.

5 Experiments

The goal of our numerical experiments is to evaluate the efficiency of the proposed CT-DDPG algorithm and continuous-time RL framework in terms of convergence speed, training stability and robustness to the discretization step and dynamic noises.

Environments.

We evaluate on a suite of challenging continuous-control benchmarks from Gymnasium [Towers et al., 2024]: Pendulum-v1, HalfCheetah-v5, Hopper-v5, and Walker2d-v5, sweeping the discretization step and dynamic noise levels. To model stochastic dynamics, at each simulator step we sample an i.i.d. Gaussian generalized force ξ𝒩(0,σ2I)\displaystyle\xi\sim\mathcal{N}(0,\sigma^{2}I) and write it to MuJoCo’s qfrc_applied buffer [Todorov et al., 2012], thereby perturbing the equations of motion. More details can be found in Appendix C.

Baselines.

We compare against discrete-time algorithms DDPG [Lillicrap et al., 2015], SAC [Haarnoja et al., 2018b], as well as a continuous-time algorithm with stochastic Gaussian policy: q-learning [Jia and Zhou, 2023]. In particular, for q-learning, we adopt two different settings when learning q-function: the original one-step TD taregt (L=1\displaystyle L=1) in Jia and Zhou [2023], and a multi-step TD extension with L>1\displaystyle L>1 as in Alg. 1. This provides a fair comparison between deterministic and stochastic policies in continuous-time RL. We also test DAU [Tallec et al., 2019], i.e., CT-DDPG with L=1\displaystyle L=1, to see the effects of multi-step TD. For each algorithm, we report results averaged over at least three independent runs with different random seeds.

Refer to caption
Figure 1: Comparison between CT-DDPG with discrete-time RL algorithms.

Results.

Figs. 2 and 1 show the average return against training episodes, where the shaded area stands for standard deviation across different runs. We observe that for most environments, our CT-DDPG has the best performance among all baselines and the gap becomes larger as discretization step decreases and (or) noise level increases. Specifically, we have the following observations:

  • As demonstrated in Fig. 1, although discrete-time algorithms, DDPG and SAC, perform reasonably well under the standard Gymnasium settings (top row), they degrade substantially when h\displaystyle h decreases and σ\displaystyle\sigma increases (middle & bottom rows). This stems from the fact that one-step TD updates provide only myopic information under small h\displaystyle h and noisy dynamics, preventing the Q-function from capturing the long-term structure of the problem.

  • For continuous-time RL with stochastic policy shown in Fig. 2, q-learning exhibits slow convergence and training instability, due to the difficulty of enforcing Bellman equation constraints Eq. 3.9. Although q-learning using multi-step TD can to some extent improve upon original q-learning (L=1\displaystyle L=1), it still remains unstable across diverse environment settings and underperforms compared to CT-DDPG. This highlights the fundamental limitations of stochastic policy in continuous-time RL.

  • To further investigate the effects of multi-step TD, we also test DAU (i.e., CT-DDPG with L=1\displaystyle L=1) in Fig. 2. It turns out that in small h\displaystyle h and large σ\displaystyle\sigma regime, DAU converges more slowly. In Fig. 3, we examine the variance to square norm ratio (NSR) of stochastic gradients in the training process. As h0\displaystyle h\to 0, NSR of DAU becomes evidently larger than that of CT-DDPG, consistent with our theories in Sec. 4.2. A large NSR leads to the instability when training q-function and consequently impedes the convergence.

Refer to caption
Figure 2: Comparison between continuous-time RL algorithms.
Refer to caption
Figure 3: Noise-to-Signal Ratio of stochastic gradient when training value-net.

In summary, CT-DDPG exhibits superior performance in terms of convergence speed and stability across most environment settings, verifying the efficiency and robustness of our method.

6 Conclusion

In this paper, we investigate deterministic policy gradient methods to achieve stability and efficiency for deep RL in continuous-time environments, bridging the gap between discrete and continuous time algorithms. We develop a rigorous mathematical framework and provide a martingale characterization for DPG. We further theoretically demonstrate the issues of standard one-step TD method in continuous-time regime for the first time. All our theoretical results are verified through extensive experiments. We hope this work can motivate future researches on continuous-time RL.

Acknowledgments

YZ is grateful for support from the Imperial Global Connect Fund, and the CNRS–Imperial Abraham de Moivre International Research Laboratory.

References

  • Baird [1994] Leemon C Baird. Reinforcement learning in continuous time: Advantage updating. In Proceedings of 1994 IEEE International Conference on Neural Networks (ICNN’94), volume 4, pages 2448–2453. IEEE, 1994.
  • Doya [2000] Kenji Doya. Reinforcement learning in continuous time and space. Neural Computation, 12(1):219–245, 2000.
  • Fujimoto et al. [2018] Scott Fujimoto, Herke Hoof, and David Meger. Addressing function approximation error in actor-critic methods. In International Conference on Machine Learning, pages 1587–1596. PMLR, 2018.
  • Giegrich et al. [2024] Michael Giegrich, Christoph Reisinger, and Yufei Zhang. Convergence of policy gradient methods for finite-horizon exploratory linear-quadratic control problems. SIAM Journal on Control and Optimization, 62(2):1060–1092, 2024.
  • Gobet and Munos [2005] Emmanuel Gobet and Rémi Munos. Sensitivity analysis using Itô–malliavin calculus and martingales, and application to stochastic optimal control. SIAM Journal on Control and Optimization, 43(5):1676–1713, 2005.
  • Guo et al. [2025] Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025.
  • Haarnoja et al. [2018a] Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International Conference on Machine Learning, pages 1861–1870. Pmlr, 2018a.
  • Haarnoja et al. [2018b] Tuomas Haarnoja, Aurick Zhou, Kristian Hartikainen, George Tucker, Sehoon Ha, Jie Tan, Vikash Kumar, Henry Zhu, Abhishek Gupta, Pieter Abbeel, et al. Soft actor-critic algorithms and applications. arXiv preprint arXiv:1812.05905, 2018b.
  • Jia and Zhou [2022a] Yanwei Jia and Xun Yu Zhou. Policy evaluation and temporal-difference learning in continuous time and space: A martingale approach. Journal of Machine Learning Research, 23(154):1–55, 2022a.
  • Jia and Zhou [2022b] Yanwei Jia and Xun Yu Zhou. Policy gradient and actor-critic learning in continuous time and space: Theory and algorithms. Journal of Machine Learning Research, 23(275):1–50, 2022b.
  • Jia and Zhou [2023] Yanwei Jia and Xun Yu Zhou. q-learning in continuous time. Journal of Machine Learning Research, 24(161):1–61, 2023.
  • Jia et al. [2025] Yanwei Jia, Du Ouyang, and Yufei Zhang. Accuracy of discretely sampled stochastic policies in continuous-time reinforcement learning. arXiv preprint arXiv:2503.09981, 2025.
  • Kurtz and Stockbridge [1998] Thomas G Kurtz and Richard H Stockbridge. Existence of markov controls and characterization of optimal markov controls. SIAM Journal on Control and Optimization, 36(2):609–653, 1998.
  • Lillicrap et al. [2015] Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.
  • Mnih et al. [2013] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.
  • Munos [2006] Rémi Munos. Policy gradient in continuous time. Journal of Machine Learning Research, 7:771–791, 2006.
  • Park et al. [2021] Seohong Park, Jaekyeom Kim, and Gunhee Kim. Time discretization-invariant safe action repetition for policy gradient methods. Advances in Neural Information Processing Systems, 34:267–279, 2021.
  • Robbins and Monro [1951] Herbert Robbins and Sutton Monro. A stochastic approximation method. The Annals of Mathematical Statistics, pages 400–407, 1951.
  • Schulman et al. [2015] John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. In International Conference on Machine Learning, pages 1889–1897. PMLR, 2015.
  • Schulman et al. [2017] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
  • Sethi et al. [2025] Deven Sethi, David Šiška, and Yufei Zhang. Entropy annealing for policy mirror descent in continuous time and space. SIAM Journal on Control and Optimization, 63(4):3006–3041, 2025.
  • Silver et al. [2014] David Silver, Guy Lever, Nicolas Heess, Thomas Degris, Daan Wierstra, and Martin Riedmiller. Deterministic policy gradient algorithms. In International Conference on Machine Learning, pages 387–395. Pmlr, 2014.
  • Silver et al. [2016] David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of go with deep neural networks and tree search. Nature, 529(7587):484–489, 2016.
  • Sutton et al. [1998] Richard S Sutton, Andrew G Barto, et al. Reinforcement learning: An introduction, volume 1. MIT press Cambridge, 1998.
  • Tallec et al. [2019] Corentin Tallec, Léonard Blier, and Yann Ollivier. Making deep q-learning methods robust to time discretization. In International Conference on Machine Learning, pages 6096–6104. PMLR, 2019.
  • Todorov et al. [2012] Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 5026–5033. IEEE, 2012.
  • Towers et al. [2024] Mark Towers, Ariel Kwiatkowski, Jordan Terry, John U Balis, Gianluca De Cola, Tristan Deleu, Manuel Goulão, Andreas Kallinteris, Markus Krimmel, Arjun KG, et al. Gymnasium: A standard interface for reinforcement learning environments. arXiv preprint arXiv:2407.17032, 2024.
  • Wang et al. [2020] Haoran Wang, Thaleia Zariphopoulou, and Xun Yu Zhou. Reinforcement learning in continuous time and space: A stochastic control approach. Journal of Machine Learning Research, 21(198):1–34, 2020.
  • Yildiz et al. [2021] Cagatay Yildiz, Markus Heinonen, and Harri Lähdesmäki. Continuous-time model-based reinforcement learning. In International Conference on Machine Learning, pages 12009–12018. PMLR, 2021.
  • Zhang [2017] Jianfeng Zhang. Backward stochastic differential equations. In Backward Stochastic Differential Equations: From Linear to Fully Nonlinear Theory, pages 79–99. Springer, 2017.
  • Zhao et al. [2023] Hanyang Zhao, Wenpin Tang, and David Yao. Policy optimization for continuous reinforcement learning. Advances in Neural Information Processing Systems, 36:13637–13663, 2023.

Appendix A Related Work

Discretization-Invariant Algorithms.

Discretization has long been recognized as a central challenge in continuous control and RL [Baird, 1994; Doya, 2000; Munos, 2006]. More recently, Tallec et al. [2019] showed that Q-learning–based approaches collapse as the discretization step becomes small and introduced the concept of the advantage rate function. Yildiz et al. [2021] tackled this issue through a model-based approach for deterministic ODE dynamics using the Neural ODE framework. Park et al. [2021] demonstrated that conventional policy gradient methods suffer from variance blow-up and proposed action-repetition strategies as a remedy. While these methods mitigate discretization sensitivity to some extent, they are restricted to deterministic dynamics and fail to handle stochasticity, a key feature of real-world environments.

Continuous-Time RL with Stochastic Policies.

Beyond addressing discretization sensitivity, another line of work directly considers continuous dynamics driven by stochastic differential equations. Jia and Zhou [2022a, b] introduced a martingale characterization for policy evaluation and developed an actor–critic algorithm in continuous time. Jia and Zhou [2023] studied the continuous-time analogue of the discrete-time advantage function, namely the q\displaystyle q-function, and proposed a q\displaystyle q-learning algorithm. Giegrich et al. [2024]; Sethi et al. [2025] extend natural policy gradient methods to the continuous-time setting, and Zhao et al. [2023] further generalize PPO [Schulman et al., 2017] and TRPO [Schulman et al., 2015] methods to continuous time. However, all of these approaches adopt stochastic policies, which require enforcing Bellman equation constraints that are not tractable in deep RL frameworks. In contrast, our method leverages deterministic policies and enforces the Bellman equation via a simple reparameterization trick, enabling stable integration with deep RL.

Theoretical Issues of Discrete-Time RL.

Although many works have empirically observed that standard discrete-time algorithms degrade under small discretization, the theoretical foundations remain underexplored. Munos [2006]; Park et al. [2021] showed that the variance of policy gradient estimators can diverge as h0\displaystyle h\to 0. Baird [1994]; Tallec et al. [2019] further demonstrated that the standard Q-function degenerates and collapses to the value function. From the perspective of policy evaluation, Jia and Zhou [2022a] proved that the minimizer of the mean-square TD error does not converge to the true value function. Nevertheless, most discrete-time algorithms rely on semi-gradient updates rather than directly minimizing the mean-square TD error. To the best of our knowledge, there has been no theoretical analysis establishing the failure of standard one-step TD methods in the continuous-time setting.

Appendix B Proofs

Notations.

We denote by C1,2([0,T]×n)\displaystyle C^{1,2}([0,T]\times{\mathbb{R}}^{n}) the space of continuous functions u:[0,T]×n\displaystyle u:[0,T]\times{\mathbb{R}}^{n}\to\mathbb{R} that are once continuously differentiable in time and twice continuously differentiable in space, and there exists a constant C0\displaystyle C\geq 0 such that for all (t,x)[0,T]×n\displaystyle(t,x)\in[0,T]\times{\mathbb{R}}^{n}, |u(t,x)|+|tu(t,x)|C(1+|x|2),|xu(t,x)|C(|1+|x|),|xx2u(t,x)|C\displaystyle|u(t,x)|+|\partial_{t}u(t,x)|\leq C(1+|x|^{2}),|\partial_{x}u(t,x)|\leq C(|1+|x|),|\partial^{2}_{xx}u(t,x)|\leq C. We use 𝒫(S)\displaystyle\mathcal{P}(S) to denote the collection of all probability distributions over S\displaystyle S. For compactness of notation, we denote by x~t\displaystyle\tilde{x}_{t} the concatenation of time and state (t,xt)\displaystyle(t,x_{t}). Finally, we use standard 𝒪(),Ω(),Θ()\displaystyle\mathcal{O}(\cdot),\Omega(\cdot),\Theta(\cdot) to omit constant factors.

B.1 Proof of Thm. 3.1

The following performance difference lemma characterizes the difference of value functions with different policies, which will be used in proving the policy gradient formula.

Proposition B.1.

Suppose Asp. 1 holds. Let ϕk\displaystyle\phi\in{\mathbb{R}}^{k} and assume VϕC1,2([0,T]×n)\displaystyle V^{\phi}\in C^{1,2}([0,T]\times{\mathbb{R}}^{n}). For all (t,x)[0,T]×n\displaystyle(t,x)\in[0,T]\times{\mathbb{R}}^{n} and ϕk\displaystyle\phi^{\prime}\in{\mathbb{R}}^{k},

Vϕ(t,x)Vϕ(t,x)=𝔼[tTeβ(st)(H[Vϕ](s,Xsϕ,μϕ(s,Xsϕ))H[Vϕ](s,Xsϕ,μϕ(s,Xsϕ)))ds|Xtϕ=x].\displaystyle\displaystyle\begin{split}&V^{\phi^{\prime}}(t,x)-V^{\phi}(t,x)\\ &=\mathbb{E}\bigg[\int_{t}^{T}e^{-\beta(s-t)}\bigg(H[V^{\phi}](s,X^{\phi^{\prime}}_{s},\mu_{\phi^{\prime}}(s,X^{\phi^{\prime}}_{s}))-H[V^{\phi}](s,X^{\phi^{\prime}}_{s},\mu_{\phi}(s,X^{\phi^{\prime}}_{s}))\bigg)\mathrm{d}s\,\bigg|\,X^{\phi^{\prime}}_{t}=x\bigg].\end{split} (B.1)
Proof of Prop. B.1.

Observe that under Asp. 1, for each ϕk\displaystyle\phi\in{\mathbb{R}}^{k}, and (t,x)[0,T]×n\displaystyle(t,x)\in[0,T]\times{\mathbb{R}}^{n},

Vϕ(t,x)𝔼[tTeβ(st)r(s,Xst,x,ϕ,μϕ(s,Xst,x,ϕ))dt+eβ(Ts)g(XTt,x,ϕ)],\displaystyle\displaystyle V^{\phi}(t,x)\coloneqq\mathbb{E}\left[\int_{t}^{T}e^{-\beta(s-t)}r(s,X_{s}^{t,x,\phi},\mu_{\phi}(s,X_{s}^{t,x,\phi}))\mathrm{d}t+e^{-\beta(T-s)}g(X_{T}^{t,x,\phi})\right], (B.2)

where (Xst,x,ϕ)st\displaystyle(X_{s}^{t,x,\phi})_{s\geq t} satisfies for all s[t,T]\displaystyle s\in[t,T],

dXs=b(s,Xs,μϕ(s,Xs))ds+σ(s,Xs,μϕ(s,Xs))dWs,Xt=x.\mathrm{d}X_{s}=b(s,X_{s},\mu_{\phi}(s,X_{s}))\mathrm{d}s+\sigma(s,X_{s},\mu_{\phi}(s,X_{s}))\mathrm{d}W_{s},\quad X_{t}=x. (B.3)

Fix ϕd\displaystyle\phi^{\prime}\in{\mathbb{R}}^{d}. Denote by Xϕ=Xt,x,ϕ\displaystyle X^{\phi}=X^{t,x,\phi} and Xϕ=Xt,x,ϕ\displaystyle X^{\phi^{\prime}}=X^{t,x,\phi^{\prime}} for simplicity. Then

Vϕ(t,x)Vϕ(t,x)=𝔼[tTeβ(st)r(s,Xsϕ,μϕ(s,Xsϕ))ds]+eβ(Tt)𝔼[g(XTϕ)]Vϕ(t,x)=𝔼[tTeβ(st)r(s,Xsϕ,μϕ(s,Xsϕ))ds]+𝔼[eβ(Tt)Vϕ(T,XTϕ)]Vϕ(t,Xtx,ϕ),\displaystyle\displaystyle\begin{split}&V^{\phi^{\prime}}(t,x)-V^{\phi}(t,x)\\ &=\mathbb{E}\left[\int_{t}^{T}e^{-\beta(s-t)}r(s,X^{\phi^{\prime}}_{s},\mu_{\phi^{\prime}}(s,X_{s}^{\phi^{\prime}}))\mathrm{d}s\right]+e^{-\beta(T-t)}\mathbb{E}\left[g(X^{\phi^{\prime}}_{T})\right]-V^{\phi}(t,x)\\ &=\mathbb{E}\left[\int_{t}^{T}e^{-\beta(s-t)}r(s,X^{\phi^{\prime}}_{s},\mu_{\phi^{\prime}}(s,X_{s}^{\phi^{\prime}}))\mathrm{d}s\right]+\mathbb{E}\left[e^{-\beta(T-t)}V^{\phi}(T,X^{\phi^{\prime}}_{T})\right]-V^{\phi}(t,X^{x,\phi^{\prime}}_{t}),\end{split} (B.4)

where the last identity used the fact that Vϕ(T,x)=g(x)\displaystyle V^{\phi}(T,x)=g(x) and Xtϕ=x\displaystyle{X^{\phi^{\prime}}_{t}}=x. As VϕC1,2([0,T]×n)\displaystyle V^{\phi}\in C^{1,2}([0,T]\times{\mathbb{R}}^{n}), applying Itô’s formula to seβ(st)Vϕ(s,Xsϕ)\displaystyle s\mapsto e^{-\beta(s-t)}V^{\phi}(s,X^{\phi^{\prime}}_{s}) yields

𝔼[eβ(Tt)Vϕ(T,XTϕ)]Vϕ(t,Xtϕ)\displaystyle\displaystyle\mathbb{E}\left[e^{-\beta(T-t)}V^{\phi}(T,X^{\phi^{\prime}}_{T})\right]-V^{\phi}(t,X^{\phi^{\prime}}_{t})
=𝔼[tTeβ(st)[Vϕ](s,Xsϕ,μϕ(s,Xsϕ))ds]\displaystyle\displaystyle=\mathbb{E}\left[\int_{t}^{T}e^{-\beta(s-t)}\mathcal{L}[V^{\phi}](s,X^{\phi^{\prime}}_{s},\mu_{\phi^{\prime}}(s,X^{\phi^{\prime}}_{s}))\mathrm{d}s\right]
=𝔼[tTeβ(st)(([Vϕ](s,y,μϕ(s,y))[Vϕ](s,y,μϕ(s,y)))|y=Xsϕ+[Vϕ](s,Xsϕ,μϕ(s,Xsϕ)))ds]\displaystyle\displaystyle=\mathbb{E}\left[\int_{t}^{T}e^{-\beta(s-t)}\left(\left(\mathcal{L}[V^{\phi}](s,y,\mu_{\phi^{\prime}}(s,y))-\mathcal{L}[V^{\phi}](s,y,\mu_{\phi}(s,y))\right)\bigg|_{y=X^{\phi^{\prime}}_{s}}+\mathcal{L}[V^{\phi}](s,X^{\phi^{\prime}}_{s},\mu_{\phi}(s,X^{\phi^{\prime}}_{s}))\right)\mathrm{d}s\right]
=𝔼[tTeβ(st)(([Vϕ](s,y,μϕ(s,y))[Vϕ](s,y,μϕ(s,y)))|y=Xsϕr(s,Xsϕ,μϕ(s,Xsϕ)))ds],\displaystyle\displaystyle=\mathbb{E}\left[\int_{t}^{T}e^{-\beta(s-t)}\left(\left(\mathcal{L}[V^{\phi}](s,y,\mu_{\phi^{\prime}}(s,y))-\mathcal{L}[V^{\phi}](s,y,\mu_{\phi}(s,y))\right)\bigg|_{y=X^{\phi^{\prime}}_{s}}-r(s,X^{\phi^{\prime}}_{s},\mu_{\phi}(s,X^{\phi^{\prime}}_{s}))\right)\mathrm{d}s\right],

where the last identity used the PDE Eq. 3.2. This along with Eq. B.4 proves the desired result. ∎

Proof of Thm. 3.1.

Recall that ϕVϕ(t,x)=(ϕ1Vϕ(t,x),,ϕkVϕ(t,x))\displaystyle\partial_{\phi}V^{\phi}(t,x)=(\partial_{\phi_{1}}V^{\phi}(t,x),\ldots,\partial_{\phi_{k}}V^{\phi}(t,x))^{\top}. Hence it suffices to prove for all ϕk\displaystyle\phi^{\prime}\in{\mathbb{R}}^{k},

ddϵVϕ+ϵϕ(t,x)|ϵ=0=𝔼[tTeβ(st)aAϕ(s,Xsϕ,μϕ(s,Xsϕ))ϕμϕ(s,Xsϕ)ds|Xtϕ=x]ϕ.\displaystyle\displaystyle\begin{split}&\frac{\mathrm{d}}{\mathrm{d}\epsilon}V^{\phi+\epsilon\phi^{\prime}}(t,x)\bigg|_{\epsilon=0}=\mathbb{E}\left[\int_{t}^{T}e^{-\beta(s-t)}\partial_{a}A^{\phi}(s,X^{\phi}_{s},\mu_{\phi}(s,X^{\phi}_{s}))^{\top}\partial_{\phi}\mu_{\phi}(s,X^{\phi}_{s})\mathrm{d}s\,\bigg|\,X^{\phi}_{t}=x\right]\phi^{\prime}.\end{split}

To this end, for all ϵ[1,1]\displaystyle\epsilon\in[-1,1], let Xϵ\displaystyle X^{\epsilon} be the solution to the following dynamics:

dXs=b(s,Xs,μϕ+ϵϕ(s,Xs))ds+σ(s,Xs,μϕ+ϵϕ(t,Xt))dWs,Xt=x.\mathrm{d}X_{s}=b(s,X_{s},\mu_{\phi+\epsilon\phi^{\prime}}(s,X_{s}))\mathrm{d}s+\sigma(s,X_{s},\mu_{\phi+\epsilon\phi^{\prime}}(t,X_{t}))\mathrm{d}W_{s},\quad X_{t}=x. (B.5)

For all ϵ[1,1]\displaystyle\epsilon\in[-1,1], by Prop. B.1 and the fundamental theorem of calculus,

Vϕ+ϵϕ(t,x)Vϕ(t,x)ϵ=𝔼[tTeβ(st)(01𝒢(s,Xsϵ,ϕ+rϵϕ)dr)ds]ϕ,\displaystyle\displaystyle\begin{split}&\frac{V^{\phi+\epsilon\phi^{\prime}}(t,x)-V^{\phi}(t,x)}{\epsilon}=\mathbb{E}\bigg[\int_{t}^{T}e^{-\beta(s-t)}\left(\int_{0}^{1}\mathcal{G}(s,X^{\epsilon}_{s},\phi+r\epsilon\phi^{\prime})\mathrm{d}r\right)\mathrm{d}s\bigg]\phi^{\prime},\end{split} (B.6)

where for all ϕ~k\displaystyle\tilde{\phi}\in{\mathbb{R}}^{k},

𝒢(t,x,ϕ~)aH[Vϕ](t,x,μϕ~(s,x))ϕμϕ~(t,x).\mathcal{G}(t,x,\tilde{\phi})\coloneqq\partial_{a}H[V^{\phi}](t,x,\mu_{\tilde{\phi}}(s,x))^{\top}\partial_{\phi}\mu_{{\tilde{\phi}}}(t,x).

To show the limit of Eq. B.6 as ϵ0\displaystyle\epsilon\to 0, observe that by Asp. 2 and standard stability analysis of Eq. B.5 (see e.g., [Zhang, 2017, Theorem 3.2.4]), for all ϵ[1,1]\displaystyle\epsilon\in[-1,1],

𝔼[suptsT|XsϵXs0|2]C𝔼[(0T|b(s,Xs0,μϕ+ϵϕ(s,Xs0))b(s,Xs0,μϕ(s,Xs0))|ds)2]+C𝔼[0T|σ(s,Xs0,μϕ+ϵϕ(s,Xs0))σ(s,Xs0,μϕ(s,Xs0))|2ds],\displaystyle\displaystyle\begin{split}\mathbb{E}\left[\sup_{t\leq s\leq T}|X_{s}^{\epsilon}-X_{s}^{0}|^{2}\right]&\leq C\mathbb{E}\left[\left(\int_{0}^{T}|b(s,X_{s}^{0},\mu_{\phi+\epsilon\phi^{\prime}}(s,X_{s}^{0}))-b(s,X_{s}^{0},\mu_{\phi}(s,X_{s}^{0}))|\mathrm{d}s\right)^{2}\right]\\ &\quad+C\mathbb{E}\left[\int_{0}^{T}|\sigma(s,X_{s}^{0},\mu_{\phi+\epsilon\phi^{\prime}}(s,X_{s}^{0}))-\sigma(s,X_{s}^{0},\mu_{\phi}(s,X_{s}^{0}))|^{2}\mathrm{d}s\right],\end{split}

which along with the growth condition in Asp. 1 and the regularity of b\displaystyle b, σ\displaystyle\sigma and ϕ\displaystyle\phi in Asp. 2, and the dominated convergence theorem shows that

limϵ0𝔼[suptsT|XsϵXs0|2]=0.\lim_{\epsilon\to 0}\mathbb{E}\left[\sup_{t\leq s\leq T}|X_{s}^{\epsilon}-X_{s}^{0}|^{2}\right]=0. (B.7)

Moreover, there exists C0\displaystyle C\geq 0 such that for all ϵ[1,1]\displaystyle\epsilon\in[-1,1], and A([0,T])([0,1])\displaystyle A\in\mathcal{F}\otimes\mathcal{B}([0,T])\otimes\mathcal{B}([0,1]),

𝔼[tT01𝟏Aeβ(st)|𝒢(s,Xsϵ,ϕ+rϵϕ)|drds]\displaystyle\displaystyle\mathbb{E}\bigg[\int_{t}^{T}\int_{0}^{1}\mathbf{1}_{A}e^{-\beta(s-t)}\left|\mathcal{G}(s,X^{\epsilon}_{s},\phi+r\epsilon\phi^{\prime})\right|\mathrm{d}r\mathrm{d}s\bigg]
𝔼[tT01𝟏Adrds]12𝔼[tT01e2β(st)|𝒢(s,Xsϵ,ϕ+rϵϕ)|2drds]12\displaystyle\displaystyle\leq\mathbb{E}\bigg[\int_{t}^{T}\int_{0}^{1}\mathbf{1}_{A}\mathrm{d}r\mathrm{d}s\bigg]^{\frac{1}{2}}\mathbb{E}\bigg[\int_{t}^{T}\int_{0}^{1}e^{-2\beta(s-t)}\left|\mathcal{G}(s,X^{\epsilon}_{s},\phi+r\epsilon\phi^{\prime})\right|^{2}\mathrm{d}r\mathrm{d}s\bigg]^{\frac{1}{2}}
𝔼[tT01𝟏Adrds]12C(1+𝔼[suptsT|Xsϵ|2])12,\displaystyle\displaystyle\leq\mathbb{E}\bigg[\int_{t}^{T}\int_{0}^{1}\mathbf{1}_{A}\mathrm{d}r\mathrm{d}s\bigg]^{\frac{1}{2}}C\left(1+\mathbb{E}\left[\sup_{t\leq s\leq T}|X_{s}^{\epsilon}|^{2}\right]\right)^{\frac{1}{2}},

where the last inequality used the growth conditions on the derivatives of the coefficients b,σ,r\displaystyle b,\sigma,r and μ\displaystyle\mu, and of the value function Vϕ\displaystyle V^{\phi}. Using the moment condition supϵ[1,1]𝔼[suptsT|Xsϵ|2]<\displaystyle\sup_{\epsilon\in[-1,1]}\mathbb{E}\left[\sup_{t\leq s\leq T}|X_{s}^{\epsilon}|^{2}\right]<\infty, the random variables {(ω,s,r)eβ(st)𝒢(s,Xsϵ,ϕ+rϵϕ)ϵ[1,1]}\displaystyle\{(\omega,s,r)\mapsto e^{-\beta(s-t)}\mathcal{G}(s,X^{\epsilon}_{s},\phi+r\epsilon\phi^{\prime})\mid\epsilon\in[-1,1]\} are uniformly integrable. Hence using Vitali’s convergence theorem and passing ϵ0\displaystyle\epsilon\to 0 in Eq. B.6 yield the desired identity. ∎

B.2 Proof of Thm. 3.2

Proof.

For all (t,x)[0,T]×n\displaystyle(t,x)\in[0,T]\times\mathbb{R}^{n} and a𝒪μϕ(t,x)\displaystyle a\in\mathcal{O}_{\mu_{\phi}(t,x)}, applying Itô’s formula to ueβ(ut)V^(s,Xut,x,a)\displaystyle u\mapsto e^{-\beta(u-t)}\hat{V}(s,X_{u}^{t,x,a}) yields for all 0t<sT\displaystyle 0\leq t<s\leq T,

eβ(st)V^(s,Xst,x,a)V^(t,Xtt,x,a)\displaystyle e^{-\beta(s-t)}\hat{V}(s,X_{s}^{t,x,a})-\hat{V}(t,X_{t}^{t,x,a}) =tseβ(ut)[V^](u,Xut,x,a,αu)du\displaystyle=\int_{t}^{s}e^{-\beta(u-t)}\mathcal{L}[\hat{V}](u,X_{u}^{t,x,a},\alpha_{u})\mathrm{d}u (B.8)
+tseβ(ut)xV^(u,Xut,x,a)σ(u,Xut,x,a,αu)dWu.\displaystyle+\int_{t}^{s}e^{-\beta(u-t)}\partial_{x}\hat{V}(u,X_{u}^{t,x,a})^{\top}\sigma(u,X_{u}^{t,x,a},\alpha_{u})\mathrm{d}W_{u}.

This along with the martingale condition Eq. 3.7 implies

(tseβ(ut)([V^]+rq^)(u,Xut,x,a,αu)du)s[t,T]\left(\int_{t}^{s}e^{-\beta(u-t)}\left(\mathcal{L}[\hat{V}]+r-\hat{q}\right)(u,X_{u}^{t,x,a},\alpha_{u})\mathrm{d}u\right)_{s\in[t,T]}

is a martingale, which has continuous paths and finite variation. Hence almost surely

tseβ(ut)([V^]+rq^)(u,Xut,x,a,αu)du=0,s[t,T].\int_{t}^{s}e^{-\beta(u-t)}\left(\mathcal{L}[\hat{V}]+r-\hat{q}\right)(u,X_{u}^{t,x,a},\alpha_{u})\mathrm{d}u=0,\quad\forall s\in[t,T]. (B.9)

We claim ([V^]+rq^)(t,x,a)=0\displaystyle(\mathcal{L}[\hat{V}]+r-\hat{q})(t,x,a)=0 for all (t,x)[0,T]×n\displaystyle(t,x)\in[0,T]\times\mathbb{R}^{n} and a𝒪μϕ(t,x)\displaystyle a\in\mathcal{O}_{\mu_{\phi}(t,x)}. To see it, define f(t,x,a)(L[V^]+rq^)(t,x,a)\displaystyle f(t,x,a)\coloneqq(L[\hat{V}]+r-\hat{q})(t,x,a) for all (t,x,a)[0,T]×n×𝒜\displaystyle(t,x,a)\in[0,T]\times\mathbb{R}^{n}\times\mathcal{A}. By assumptions, fC([0,T]×n×𝒜)\displaystyle f\in C([0,T]\times{\mathbb{R}}^{n}\times\mathcal{A}). Suppose there exists (t¯,x¯)[0,T]×n\displaystyle(\bar{t},\bar{x})\in[0,T]\times{\mathbb{R}}^{n} and a¯𝒪μϕ(t,x)\displaystyle\bar{a}\in\mathcal{O}_{\mu_{\phi}(t,x)} such that f(t¯,x¯,a¯)0\displaystyle f(\bar{t},\bar{x},\bar{a})\not=0. Due to the continuity of f\displaystyle f, we can assume without loss of generality that f(t¯,x¯,a¯)>0\displaystyle f(\bar{t},\bar{x},\bar{a})>0 and t¯[0,T)\displaystyle\bar{t}\in[0,T). The continuity of f\displaystyle f implies that there exist constants ϵ,δ>0\displaystyle\epsilon,\delta>0 such that f(t,x,a)ϵ>0\displaystyle f(t,x,a)\geq\epsilon>0 for all (t,x,a)[0,T]×n×𝒜\displaystyle(t,x,a)\in[0,T]\times{\mathbb{R}}^{n}\times\mathcal{A} with max{|tt¯|,|xx¯|,|aa¯|}δ\displaystyle\max\{|t-\bar{t}|,|x-\bar{x}|,|a-\bar{a}|\}\leq\delta. Now consider the process Xt¯,x¯,a¯\displaystyle X^{\bar{t},\bar{x},\bar{a}} defined by (3.8), and define the stopping time

τ\displaystyle\displaystyle\tau inf{t[t¯,T]max{|tt¯|,|Xtt¯,x¯,a¯x¯|,|αta¯|}>δ}.\displaystyle\displaystyle\coloneqq\inf\left\{t\in[\bar{t},T]\mid\max\{|t-\bar{t}|,|X^{\bar{t},\bar{x},\bar{a}}_{t}-\bar{x}|,|\alpha_{t}-\bar{a}|\}>\delta\right\}.

Note that τ>t¯\displaystyle\tau>\bar{t} almost surely, due to the sample path continuity of tXtt¯,x¯,a¯\displaystyle t\mapsto X^{\bar{t},\bar{x},\bar{a}}_{t} and the condition limstαs=a¯\displaystyle\lim_{s\searrow t}\alpha_{s}=\bar{a}. This along with (B.9) implies that there exists a measure zero set 𝒩\displaystyle\mathcal{N} such that for all ωΩ𝒩\displaystyle\omega\in\Omega\setminus\mathcal{N}, τ(ω)>t¯\displaystyle\tau(\omega)>\bar{t}, and

t¯τ(ω)eβ(ut¯)f(u,Xut¯,x¯,a¯(ω),αu(ω))du=0.\int_{\bar{t}}^{\tau(\omega)}e^{-\beta(u-\bar{t})}f\left(u,X_{u}^{\bar{t},\bar{x},\bar{a}}(\omega),\alpha_{u}(\omega)\right)\mathrm{d}u=0.

However, by the definition of τ\displaystyle\tau, for all t(t¯,τ(ω))\displaystyle t\in(\bar{t},\tau(\omega)), max{|tt¯|,|Xtt¯,x¯,a¯x¯|,|αta¯|}δ\displaystyle\max\{|t-\bar{t}|,|X^{\bar{t},\bar{x},\bar{a}}_{t}-\bar{x}|,|\alpha_{t}-\bar{a}|\}\leq\delta, which along with the choice of δ\displaystyle\delta implies f(t,Xtt¯,x¯,a¯(ω),αt(ω))ϵ>0\displaystyle f(t,X_{t}^{\bar{t},\bar{x},\bar{a}}(\omega),\alpha_{t}(\omega))\geq\epsilon>0 and hence

t¯τ(ω)eβ(ut¯)h(u,Xut¯,x¯,a¯(ω),αu(ω))du>0.\int_{\bar{t}}^{\tau(\omega)}e^{-\beta(u-\bar{t})}h\left(u,X_{u}^{\bar{t},\bar{x},\bar{a}}(\omega),\alpha_{u}(\omega)\right)\mathrm{d}u>0.

This yields a contradiction, and proves ([V^]+rq^)(t,x,a)=0\displaystyle(\mathcal{L}[\hat{V}]+r-\hat{q})(t,x,a)=0 for all (t,x,μ)[0,T]×n\displaystyle(t,x,\mu)\in[0,T]\times\mathbb{R}^{n} and a𝒪μϕ(t,x)\displaystyle a\in\mathcal{O}_{\mu_{\phi}(t,x)}.

Now by Eq. 3.6, for all (t,x)[0,T]×n\displaystyle(t,x)\in[0,T]\times\mathbb{R}^{n},

([V^]+r)(t,x,μϕ(t,x))=0,V^(T,x)=g(x).\displaystyle\displaystyle(\mathcal{L}[\hat{V}]+r)(t,x,\mu_{\phi}(t,x))=0,\quad\hat{V}(T,x)=g(x).

Since VϕC1,2([0,T]×n)\displaystyle V^{\phi}\in C^{1,2}([0,T]\times{\mathbb{R}}^{n}) satisfies the same PDE, the Feynman-Kac formula shows that V^(t,x)=Vϕ(t,x)\displaystyle\hat{V}(t,x)=V^{\phi}(t,x) for all (t,x)\displaystyle(t,x). This subsequently implies ([Vϕ]+rq^)(t,x,a)=0\displaystyle(\mathcal{L}[V^{\phi}]+r-\hat{q})(t,x,a)=0 for all (t,x)[0,T]×n\displaystyle(t,x)\in[0,T]\times\mathbb{R}^{n} and a𝒪μϕ(t,x)\displaystyle a\in\mathcal{O}_{\mu_{\phi}(t,x)}. ∎

B.3 Proof of Prop. 4.1

Proof.

By Itô’s formula,

eβhVθ(x~t+h)Vθ(x~t)\displaystyle e^{-\beta h}V_{\theta}(\tilde{x}_{t+h})-V_{\theta}(\tilde{x}_{t}) (B.10)
=tt+heβ(st)[tVθ(x~s)+xVθ(x~s)b(x~s,as)+12Tr(xx2Vθ(x~s)σσ(x~s,as))βVθ(x~s)]ds\displaystyle=\underbrace{\int_{t}^{t+h}e^{-\beta(s-t)}\left[\partial_{t}V_{\theta}(\tilde{x}_{s})+\partial_{x}V_{\theta}(\tilde{x}_{s})^{\top}b(\tilde{x}_{s},a_{s})+\frac{1}{2}\operatorname{Tr}(\partial_{xx}^{2}V_{\theta}(\tilde{x}_{s})\sigma\sigma^{\top}(\tilde{x}_{s},a_{s}))-\beta V_{\theta}(\tilde{x}_{s})\right]\mathrm{d}s}_{\text{\char 172}}
+tt+heβ(st)xVθ(x~s)σ(x~s,as)dWs.\displaystyle\qquad+\underbrace{\int_{t}^{t+h}e^{-\beta(s-t)}\partial_{x}V_{\theta}(\tilde{x}_{s})^{\top}\sigma(\tilde{x}_{s},a_{s})\mathrm{d}W_{s}}_{\text{\char 173}}.

Note that the last term is a martingale and thus vanishes after taking expectation. Therefore the semi-gradient can be rewritten as

Gθ,h=𝔼[θVθ(x~t)(1h+(Aψ(x~t,at)rt))].\displaystyle G_{\theta,h}=\mathbb{E}\left[\partial_{\theta}V_{\theta}(\tilde{x}_{t})\left(-\text{\char 172}\cdot\frac{1}{h}+(A_{\psi}(\tilde{x}_{t},a_{t})-r_{t})\right)\right]. (B.11)

When the discretization step h\displaystyle h goes to zero, the integral admits a first-order expansion, which leads to

limh0Gθ,h=𝔼[θVθ(x~t)(Aψ(x~t,at)tVθ(x~t)H(x~t,at,xVθ(x~t),xx2Vθ(x~t))+βVθ(x~t))].\lim_{h\to 0}G_{\theta,h}=\mathbb{E}\left[\partial_{\theta}V_{\theta}(\tilde{x}_{t})\left(A_{\psi}(\tilde{x}_{t},a_{t})-\partial_{t}V_{\theta}(\tilde{x}_{t})-H(\tilde{x}_{t},a_{t},\partial_{x}V_{\theta}(\tilde{x}_{t}),\partial_{xx}^{2}V_{\theta}(\tilde{x}_{t}))+\beta V_{\theta}(\tilde{x}_{t})\right)\right]. (B.12)

Similarly we have

limh0Gψ,h=𝔼[ψAψ(x~t,at)(Aψ(x~t,at)tVθ(x~t)H(x~t,at,xVθ(x~t),xx2Vθ(x~t))+βVθ(x~t))].\lim_{h\to 0}G_{\psi,h}=\mathbb{E}\left[\partial_{\psi}A_{\psi}(\tilde{x}_{t},a_{t})\left(A_{\psi}(\tilde{x}_{t},a_{t})-\partial_{t}V_{\theta}(\tilde{x}_{t})-H(\tilde{x}_{t},a_{t},\partial_{x}V_{\theta}(\tilde{x}_{t}),\partial_{xx}^{2}V_{\theta}(\tilde{x}_{t}))+\beta V_{\theta}(\tilde{x}_{t})\right)\right]. (B.13)

On the other hand, consider the conditional variance of stochastic gradient:

Var(gθ,ht)=1h2θVθ(x~t)θVθ(x~t)Var(eβhVθ(x~t+h)Vθ(x~t)t).\displaystyle\mathrm{Var}(g_{\theta,h}\mid\mathcal{F}_{t})=\frac{1}{h^{2}}\partial_{\theta}V_{\theta}(\tilde{x}_{t})\partial_{\theta}V_{\theta}(\tilde{x}_{t})^{\top}\mathrm{Var}(e^{-\beta h}V_{\theta}(\tilde{x}_{t+h})-V_{\theta}(\tilde{x}_{t})\mid\mathcal{F}_{t}). (B.14)

Note that

𝔼[(eβhVθ(x~t+h)Vθ(x~t))2t]\displaystyle\mathbb{E}[(e^{-\beta h}V_{\theta}(\tilde{x}_{t+h})-V_{\theta}(\tilde{x}_{t}))^{2}\mid\mathcal{F}_{t}] =𝔼[2+2+2t],\displaystyle=\mathbb{E}[\text{\char 172}^{2}+2\cdot\text{\char 172}\cdot\text{\char 173}+\text{\char 173}^{2}\mid\mathcal{F}_{t}], (B.15)

and 𝔼[eβhVθ(x~t+h)Vθ(x~t)t]=𝔼[t]\displaystyle\mathbb{E}[e^{-\beta h}V_{\theta}(\tilde{x}_{t+h})-V_{\theta}(\tilde{x}_{t})\mid\mathcal{F}_{t}]=\mathbb{E}[\text{\char 172}\mid\mathcal{F}_{t}]. This yields

Var(eβhVθ(x~t+h)Vθ(x~t)t)=Var(t)+𝔼[2+2t]𝔼[2+2t]\mathrm{Var}(e^{-\beta h}V_{\theta}(\tilde{x}_{t+h})-V_{\theta}(\tilde{x}_{t})\mid\mathcal{F}_{t})=\mathrm{Var}(\text{\char 172}\mid\mathcal{F}_{t})+\mathbb{E}\left[\text{\char 173}^{2}+2\cdot\text{\char 172}\cdot\text{\char 173}\mid\mathcal{F}_{t}\right]\geq\mathbb{E}\left[\text{\char 173}^{2}+2\cdot\text{\char 172}\cdot\text{\char 173}\mid\mathcal{F}_{t}\right] (B.16)

According to Itô isometry,

𝔼[2t]=𝒪(h2),\mathbb{E}[\text{\char 172}^{2}\mid\mathcal{F}_{t}]=\mathcal{O}(h^{2}), (B.17)
𝔼[2t]=𝔼[tt+he2β(st)xVθ(x~s)σ(x~s,as)2ds|x~t]=𝒪(h),\mathbb{E}[\text{\char 173}^{2}\mid\mathcal{F}_{t}]=\mathbb{E}\left[\int_{t}^{t+h}e^{-2\beta(s-t)}\|\partial_{x}V_{\theta}(\tilde{x}_{s})^{\top}\sigma(\tilde{x}_{s},a_{s})\|^{2}\mathrm{d}s\bigg|\tilde{x}_{t}\right]=\mathcal{O}(h), (B.18)

and the cross term can be controlled by Cauchy-Schwarz:

𝔼[||x~t](𝔼[2x~t])12(𝔼[2x~t])12=𝒪(h32)\mathbb{E}[|\text{\char 172}\cdot\text{\char 173}|\mid\tilde{x}_{t}]\leq(\mathbb{E}[\text{\char 172}^{2}\mid\tilde{x}_{t}])^{\frac{1}{2}}\cdot(\mathbb{E}[\text{\char 173}^{2}\mid\tilde{x}_{t}])^{\frac{1}{2}}=\mathcal{O}(h^{\frac{3}{2}}) (B.19)

These estimates show that, as h0\displaystyle h\to 0, the leading contribution to the variance comes from the stochastic integral term . As a result, by combining Fatou’s Lemma and Eq. B.14, we conclude that

limh0hVar(gθ,h)\displaystyle\lim_{h\to 0}h\cdot\mathrm{Var}(g_{\theta,h}) limh0h𝔼[Var(gθ,ht)]\displaystyle\geq\lim_{h\to 0}h\cdot\mathbb{E}[\mathrm{Var}(g_{\theta,h}\mid\mathcal{F}_{t})] (B.20)
𝔼[limh0[hVar(gθ,ht)]]\displaystyle\geq\mathbb{E}\left[\lim_{h\to 0}[h\cdot\mathrm{Var}(g_{\theta,h}\mid\mathcal{F}_{t})]\right]
𝔼[θVθ(x~t)θVθ(x~t)xVθ(x~t)σ(x~t,at)2].\displaystyle\geq\mathbb{E}\left[\partial_{\theta}V_{\theta}(\tilde{x}_{t})\partial_{\theta}V_{\theta}(\tilde{x}_{t})^{\top}\|\partial_{x}V_{\theta}(\tilde{x}_{t})^{\top}\sigma(\tilde{x}_{t},a_{t})\|^{2}\right].

B.4 Proof of Prop. 4.2

Proof.

We begin by recalling that, for any horizon Lh\displaystyle Lh, Itô’s formula yields,

eβLhVθ(x~t+Lh)Vθ(x~t)\displaystyle e^{-\beta Lh}V_{\theta}(\tilde{x}_{t+Lh})-V_{\theta}(\tilde{x}_{t}) (B.21)
=tt+Lheβ(st)[tVθ(x~s)+xVθ(x~s)b(x~s,as)+12Tr(xx2Vθ(x~s)σσ(x~s,as))βVθ(x~s)]ds\displaystyle=\underbrace{\int_{t}^{t+Lh}e^{-\beta(s-t)}\left[\partial_{t}V_{\theta}(\tilde{x}_{s})+\partial_{x}V_{\theta}(\tilde{x}_{s})^{\top}b(\tilde{x}_{s},a_{s})+\frac{1}{2}\operatorname{Tr}(\partial_{xx}^{2}V_{\theta}(\tilde{x}_{s})\sigma\sigma^{\top}(\tilde{x}_{s},a_{s}))-\beta V_{\theta}(\tilde{x}_{s})\right]\mathrm{d}s}_{\text{\char 174}}
+tt+Lheβ(st)xVθ(x~s)σ(x~s,as)dWs.\displaystyle\qquad+\underbrace{\int_{t}^{t+Lh}e^{-\beta(s-t)}\partial_{x}V_{\theta}(\tilde{x}_{s})^{\top}\sigma(\tilde{x}_{s},a_{s})\mathrm{d}W_{s}}_{\text{\char 175}}.

Now consider the case where Lhδ>0\displaystyle Lh\equiv\delta>0 is fixed while h0\displaystyle h\to 0. In this regime, the estimator Gθ,h,δ/h\displaystyle G_{\theta,h,\delta/h} can be expressed as

limh0Gθ,h,δh\displaystyle\lim_{h\to 0}G_{\theta,h,\frac{\delta}{h}} =𝔼[θVθ(x~t)(Vθ(x~t)tt+δeβ(st)[rsqψ(x~s,as)]dseβδVθ(x~t+δ))]=Θ(1)\displaystyle=\mathbb{E}\left[\partial_{\theta}V_{\theta}(\tilde{x}_{t})\left(V_{\theta}(\tilde{x}_{t})-\int_{t}^{t+\delta}e^{-\beta(s-t)}[r_{s}-q_{\psi}(\tilde{x}_{s},a_{s})]\mathrm{d}s-e^{-\beta\delta}V_{\theta}(\tilde{x}_{t+\delta})\right)\right]=\Theta(1) (B.22)
=𝔼[θVθ(x~t)(tt+δ[qψ(x~s,as)tVθ(x~s)H(x~s,as,xVθ(x~s),xx2Vθ(x~s))+βVθ(x~s)]ds)]\displaystyle=\mathbb{E}\left[\partial_{\theta}V_{\theta}(\tilde{x}_{t})\left(\int_{t}^{t+\delta}\left[q_{\psi}(\tilde{x}_{s},a_{s})-\partial_{t}V_{\theta}(\tilde{x}_{s})-H(\tilde{x}_{s},a_{s},\partial_{x}V_{\theta}(\tilde{x}_{s}),\partial_{xx}^{2}V_{\theta}(\tilde{x}_{s}))+\beta V_{\theta}(\tilde{x}_{s})\right]\mathrm{d}s\right)\right]
=Θ(1).\displaystyle=\Theta(1).

The integral is taken over a fixed interval of length δ\displaystyle\delta, and thus this expression is bounded and will not vanish.

We next turn to the variance. Expanding the definition of gθ,h,L\displaystyle g_{\theta,h,L} and using Jensen’s inequality, we obtain

Var(gθ,h,L)\displaystyle\mathrm{Var}(g_{\theta,h,L}) (B.23)
2𝔼[θVθ(x~t)θVθ(x~t)((eβLhVθ(x~t+Lh)Vθ(x~t))2+(l=0L1eβlh[rt+lhqψ(x~t+lh,at+lh)]h)2)]\displaystyle\leq 2\mathbb{E}\left[\partial_{\theta}V_{\theta}(\tilde{x}_{t})\partial_{\theta}V_{\theta}(\tilde{x}_{t})^{\top}\left((e^{-\beta Lh}V_{\theta}(\tilde{x}_{t+Lh})-V_{\theta}(\tilde{x}_{t}))^{2}+(\sum_{l=0}^{L-1}e^{-\beta lh}[r_{t+lh}-q_{\psi}(\tilde{x}_{t+lh},a_{t+lh})]h)^{2}\right)\right]
=𝒪(1).\displaystyle=\mathcal{O}(1).

This is because all terms are bounded. ∎

Appendix C Experiment Details

Model architecture.

Across all experiments, the policy, Q-network, and value network are implemented as three-layer fully connected MLPs with ReLU activations. The hidden dimension is set to 400, except for Pendulum, where we use 64. To incorporate time information, we augment the environment observations with a sinusoidal embedding, yielding x~t=(xt,cos(2πtT),sin(2πtT))\displaystyle\tilde{x}_{t}=(x_{t},\cos(\tfrac{2\pi t}{T}),\sin(\tfrac{2\pi t}{T})), where T\displaystyle T denotes the maximum horizon. For stochastic policies, we employ Gaussian policies with mean and variance parameterized by neural networks, and fix the entropy coefficient to γ=0.1\displaystyle\gamma=0.1.

Environment setup.

To accelerate training, we run 8 environments in parallel, collecting 8 trajectories per episode. The discount rate is set to β=0.8\displaystyle\beta=0.8, applied in the form eβh\displaystyle e^{-\beta h}. For MuJoCo environments, we set terminate_when_unhealthy=False.

Training hyperparameters.

We use the Adam optimizer with a learning rate of 3×104\displaystyle 3\times 10^{-4} for all networks (3×103\displaystyle 3\times 10^{-3} for Pendulum), and a batch size of B=256\displaystyle B=256. The update frequency is m=1\displaystyle m=1 in the original environment and m=5\displaystyle m=5 for smaller step sizes h\displaystyle h. The soft target update parameter is τ=0.005\displaystyle\tau=0.005. The weight for the terminal value constraint is α=0.002\displaystyle\alpha=0.002. For CT-DDPG, the trajectory length L\displaystyle L is sampled uniformly from [2,10]\displaystyle[2,10], and we use exploration noise with standard deviation σexplore=0.1\displaystyle\sigma_{\text{explore}}=0.1. For q-learning, for each state x~\displaystyle\tilde{x} in the minibatch, we sample n=20\displaystyle n=20 actions from π(x~)\displaystyle\pi(\cdot\mid\tilde{x}) and compute the penalty term (1ni=1n[qψ(x~,ai)γlogπ(aix~)])2\displaystyle\big(\frac{1}{n}\sum_{i=1}^{n}\big[q_{\psi}(\tilde{x},a_{i})-\gamma\log\pi(a_{i}\mid\tilde{x})\big]\big)^{2}.

Refer to caption
Figure 4: Comparison between all algorithms.