Bridging Discrete and Continuous RL: Stable Deterministic Policy Gradient with Martingale Characterization
Abstract
The theory of discrete-time reinforcement learning (RL) has advanced rapidly over the past decades. Although primarily designed for discrete environments, many real-world RL applications are inherently continuous and complex. A major challenge in extending discrete-time algorithms to continuous-time settings is their sensitivity to time discretization, often leading to poor stability and slow convergence. In this paper, we investigate deterministic policy gradient methods for continuous-time RL. We derive a continuous-time policy gradient formula based on an analogue of the advantage function and establish its martingale characterization. This theoretical foundation leads to our proposed algorithm, CT-DDPG, which enables stable learning with deterministic policies in continuous-time environments. Numerical experiments show that the proposed CT-DDPG algorithm offers improved stability and faster convergence compared to existing discrete-time and continuous-time methods, across a wide range of control tasks with varying time discretizations and noise levels.
1 Introduction
Deep Reinforcement learning (RL) has achieved remarkable success over the past decade, powered by theoretical advances and the success of algorithms in discrete-time systems such as Atari, Go, and Large Language Models [Mnih et al., 2013; Silver et al., 2016; Guo et al., 2025]. However, many real-world problems, such as robotic control, autonomous driving, and financial trading, are inherently continuous in time. In these domains, agents need to interact with the environment at an ultra-high frequency, underscoring the need for continuous-time RL approaches [Wang et al., 2020].
One major challenge in applying discrete-time RL to continuous-time environments is the sensitivity to the discretization step size. As the step size decreases, standard algorithms often degrade, resulting in exploding variance, poor stability, and slow convergence. While several works have attempted to resolve this issue with discretization-invariant algorithms [Tallec et al., 2019; Park et al., 2021], their underlying design principles are rooted in discrete-time RL. As a result, these methods are not robust when applied to complex, stochastic, and continuous real-world environments.
Recently there is a fast growing body of research on continuous-time RL [Yildiz et al., 2021; Jia and Zhou, 2022a, b, 2023; Zhao et al., 2023; Giegrich et al., 2024], including rigorous mathematical formulations and various algorithmic designs. However, most existing methods either rely on model-based assumptions, or consider stochastic policy, which is difficult to sample in continuous time, state and action spaces [Jia et al., 2025], and imposes Bellman equation constraints which are not feasible for implementation within deep RL frameworks. These challenges hinder the application of continuous-time RL framework in practice, leading to an important research question:
Can we develop a theoretically grounded algorithm that achieves stability and efficiency for deep RL in continuous-time environments?
In this paper, we address this question by investigating deterministic policy gradient (DPG) methods. We consider general continuous-time dynamics driven by a stochastic differential equation over a finite horizon. our main contributions are summarized as follows:
-
•
In Sec. 3, we develop a rigorous mathematical framework for model-free DPG methods in continuous-time RL. Specifically, Thm. 3.1 derives the DPG formula based on the advantage rate function. Thm. 3.2 further utilizes a martingale criterion to characterize the advantage rate function, laying the foundation for subsequent algorithm design. We also provide detailed comparisons against existing continuous-time RL algorithms with stochastic policy and discuss their major flaws and impracticality in deep RL frameworks.
-
•
In Sec. 4, we propose CT-DDPG, a novel and practical actor-critic algorithm with provable stability and efficiency in continuous-time environments. Notably, we utilize a multi-step TD objective, and prove its robustness to time discretization and stochastic noises in Sec. 4.2. For the first time, we provide the theoretical insights of the failure of standard discrete-time deep RL algorithms in continuous and stochastic settings.
-
•
Through extensive experiments in Sec. 5, we verify that existing discrete/continuous time algorithms lack robustness to time discretization and dynamic noise, while our method exhibits consistently stable performance.
2 Problem Formulation
This section formulates the continuous RL problem, where the agent learns an optimal parametrized policy to control an unknown continuous-time stochastic system to maximize a reward functional over a finite time horizon.
Let the state space be and the action space be an open set . For each non-anticipative -valued control (action) process , consider the associated state process governed by the following dynamics:
(2.1) |
where is the initial distribution, is an -dimensional Brownian motion on a filtered probability space , and , are continuous functions. The reward functional of is given by
(2.2) |
where is a discount factor, and and are continuous functions, representing the running and terminal rewards, respectively.
It is well-known that under mild regularity conditions, it suffices to optimize (2.2) over control processes generated by Markov policies [Kurtz and Stockbridge, 1998]. Given a Markov policy , the associated state process evolves according to the dynamics:
(2.3) |
The agent aims to maximize the following reward
(2.4) |
over all admissible policies . Importantly, the agent does not have access to the coefficients , , and . Instead, the agent directly interacts with Eq. 2.3 with different actions, and refines her strategy based on observed state and reward trajectories. We emphasize that in this paper, we directly optimize (2.4) over deterministic policies, which map the state space directly to the action space, rather than over stochastic policies as studied in Jia and Zhou [2022b, 2023]; Zhao et al. [2023], which map the state space to probability measures over the action space (see Sec. 3.3).
To solve Eq. 2.4, a practical approach is to restrict the optimization problem over a sufficiently rich class of parameterized policies. More precisely, given a class of policies parameterized by , we consider the following maximization problem:
(2.5) |
where denotes the state process controlled by . Throughout this paper, we assume the initial state distribution has a second moment, and impose the following regularity conditions on the policy class and model coefficients.
Assumption 1.
There exists such that for all , and and ,
and there exists a locally bounded function such that for all , , and , and .
Asp. 1 holds for all policies parameterized by feedforward neural networks with Lipschitz activations. It ensures that the state dynamics and the objective function are well defined for any .
3 Main Theoretical Results
We will first characterize the gradient of the objective functional Eq. 2.5 with respect to the policy parameter , using a continuous-time analogue of the discrete-time advantage function. We will then derive a martingale characterization of this continuous-time advantage function and value function, which serves as the foundation of our algorithm design under deterministic policies. All detailed proofs can be found in Appendix B.
3.1 Deterministic policy gradient (DPG) formula
We first introduce a dynamic version of the objective function . For each , define the value function
(3.1) |
Note that . We additionally impose the following differentiability condition on the model parameters and policies with respect to the parameter.
Assumption 2.
For all , and are continuously differentiable. There exists a locally bounded function such that for all and ,
Moreover, for all .
Under Asp. 1, by Itô’s formula, for any given , satisfies the following linear Bellman equation: for all ,
(3.2) |
where is the generator of (2.3) such that for all ,
(3.3) |
with . The following theorem presents the DPG formula for the continuous RL problem.
The proof of Thm. 3.1 follows by quantifying the difference between the value functions corresponding to two policies, and then applying Vitali’s convergence theorem. Similar formula was established in Gobet and Munos [2005] under stronger conditions that the running reward is zero, the diffusion coefficient is uniformly elliptic, and the coefficients are four times continuously differentiable.
Remark 1.
Thm. 3.1 is analogous to the DPG formula for discrete-time Markov decision processes [Silver et al., 2014]. The function plays the role of advantage function used in discrete-time DPG, and has been referred to as the advantage rate function in Zhao et al. [2023]. To see it, assume , and for any given , consider the discrete-time version of Eq. 2.5:
(3.4) |
where , , and satisfies the following time-discretization of Eq. 2.3:
and are independent standard normal random variables. By the deterministic policy gradient formula [Silver et al., 2014],
(3.5) |
where is the advantage function for Eq. 3.4 normalized with the time stepsize. As , converges to , as shown in Jia and Zhou [2023]. Sending in Eq. 3.5 yields the continuous-time DPG in Thm. 3.1.
3.2 Martingale characterization of continuous-time advantage rate function
By Thm. 3.1, implementing the DPG requires computing the advantage rate function in a neighborhood of the policy . The following theorem characterizes the advantage rate function through a martingale criterion.
Theorem 3.2.
Thm. 3.2 establishes sufficient conditions ensuring that the functions and coincide with the value function and the advantage rate function of a given policy , respectively. Eq. 3.6 requires that agrees with the terminal condition at time , and the function satisfies the linear Bellman equation Eq. 3.2 as the true advantage rate . The martingale constraint Eq. 3.7 ensures is the advantage rate function associated with , for all actions in a neighborhood of the policy .
To ensure exploration of the action space, Thm. 3.2 requires that the martingale condition Eq. 3.7 holds for state processes initialized with any action . In practice, one can use an exploration policy to generate these exploratory actions, which are then employed to learn the gradient of the target deterministic policy. This parallels the central role of off-policy algorithms in discrete-time DPG methods [Lillicrap et al., 2015; Haarnoja et al., 2018a].
3.3 Improved efficiency and stability of deterministic policies over stochastic policies
Thm. 3.2 implies that DPG can be estimated both more efficiently and more stably than stochastic policy gradients, since it avoids costly integrations over the action space.
Recall that Jia and Zhou [2022b, 2023]; Zhao et al. [2023] study continuous-time RL with stochastic policies and establish an analogous policy gradient formula based on the corresponding advantage rate function. By incorporating an additional entropy term into the objective, Jia and Zhou [2023] characterizes the advantage rate function analogously to Thm. 3.2, replacing the Bellman condition Eq. 3.6 with
(3.9) |
where is the entropy regularization coefficient, and requiring the martingale constraint Eq. 3.7 to hold for all state dynamics starting at state at time , with actions sampled randomly from at any time partition of . Implementing the criterion Eq. 3.9 requires sampling random actions from the policy to compute the expectation over the action space. This makes policy evaluation substantially more challenging in deep RL, particularly with high-dimensional action spaces or non-Gaussian policies, often resulting in training instability and slow convergence, as observed in our experiments in Sec. 5. In contrast, the Bellman condition Eq. 3.6 for DPG can be straightforwardly implemented using a simple re-parameterization (see Eq. 4.2).
4 Algorithm and Analysis
4.1 Algorithm design
Given the martingale characterization (Thm. 3.2), we now discuss the implementation details in a continuous-time RL framework via deep neural networks. We use to denote the neural networks for value, advantage rate function and policy, respectively.
Martingale loss.
To ensure the martingale condition Eq. 3.7, let We adopt the following martingale orthogonality conditions (also known as generalized moment method) , where is any test function. This is both necessary and sufficient to ensure the martingale condition for all -adapted and square-integrable processes [Jia and Zhou, 2022a].
In theory, one should consider all possible test functions, which leads to infinitely many equations. For practical implementation, however, it suffices to select a finite number of test functions with special structures. A natural choice is to set or , in which case the marginal orthogonality condition becomes a vector-valued condition. The classic stochastic approximation method [Robbins and Monro, 1951] can be applied to solve the equation:
where is the integral interval and the trajectory is sampled from collected data. Note that the update formula above is also referred as semi-gradient TD method in RL [Sutton et al., 1998].
(4.1) |
Bellman constraints.
To enforce Eq. 3.6, we re-parameterize the advantage rate function as
(4.2) |
where is a neural network and denotes the current deterministic policy [Tallec et al., 2019].
In practice, it is often challenging to design a neural network structure that directly enforces the terminal value constraint. To address this, we add a penalty term of the form: , where are sampled from collected trajectories.
Implementation with discretization.
Let denote the discretization step size. We denote by the concatenation of time and state for compactness. The full procedure of Continuous Time Deep Deterministic Policy Gradient (CT-DDPG) is summarized in Alg. 1.
We employ several training techniques widely used in modern deep RL algorithms such as DDPG and SAC. In particular, we employ a target value network , defined as the exponentially moving average of the value network weights. This technique has been shown to improve training stability in deep RL111Here we focus on a single target value network as our primary goal is to study the efficiency of deterministic policies in continuous-time RL. Extensions with multiple target networks [Haarnoja et al., 2018b; Fujimoto et al., 2018] can be readily incorporated.. We further adopt a replay buffer to store transitions in order to improve sample efficiency. For exploration, we add independent Gaussian noises to the deterministic policy .
Multi-step TD.
When training advantage-rate net and value net, we adopt multiple steps to compute the temporal difference error (see Eq. 4.1). This is different from most off-policy algorithms which typically rely on a single transition step. Notably, when , our algorithm reduces to DAU [Tallec et al., 2019, Alg. 2] except that their policy learning rate vanishes as . We highlight that multi-step TD is essential for the empirical success of CT-DDPG. In the next subsection, we theoretically demonstrate that one-step TD inevitably leads to gradient variance blow-up in the limit of vanishing discretization step, thereby slowing convergence.
4.2 Issues of One-Step TD in Continuous Time: Variance Blow up
When training the value function and the advantage function for a given policy (stochastic or deterministic), Temporal Difference algorithms [Haarnoja et al., 2018a; Tallec et al., 2019; Jia and Zhou, 2023] typically use a one-step semi-gradient:
(4.3) | ||||
where and with an exploration policy . In practice, however, one has to use stochastic gradient:
(4.4) | ||||
Proposition 4.1.
Assume for some , and are not identically zero. Then the variance of stochastic gradient estimator blows up in the sense that:
(4.5) |
(4.6) |
In contrast, Alg. 1 utilizes -step TD loss with (stochastic) semi-gradient (for simplicity of the theoretical analysis, we consider hard update of target, i.e., ):
(4.7) |
(4.8) |
Proposition 4.2.
Under the same assumptions in Prop. 4.1, if , then the expected gradient does not vanish in the sense that
(4.9) |
In addition, the variance of stochastic gradient does not blow up:
(4.10) |
Remark 2 (effect of scaling).
Note that in Eq. 4.7, we omit the factor in contrast to Eq. 4.3. This modification is crucial for preventing the variance from blowing up. If we were to remove the factor in Eq. 4.3, then according to Prop. 4.1 the expected gradient would vanish as . This theoretical inconsistency reveals a fundamental drawback of one-step TD methods in the continuous-time RL framework, which is also verified in our experiments.
Remark 3 (previous analysis of one-step TD).
Jia and Zhou [2022a] discussed the issues of one-step TD objective
(4.11) |
showing that its minimizer does not converge to the true value function as . However, practical one-step TD methods do not directly optimize Eq. 4.11, but rather employ the semi-gradient update Eq. 4.3. Consequently, the analysis in Jia and Zhou [2022a] does not fully explain the failure of discrete-time RL algorithms under small discretization steps. In contrast, our analysis is consistent with the actual update rule and thus offers theoretical insights that are directly relevant to the design of continuous-time algorithms.
5 Experiments
The goal of our numerical experiments is to evaluate the efficiency of the proposed CT-DDPG algorithm and continuous-time RL framework in terms of convergence speed, training stability and robustness to the discretization step and dynamic noises.
Environments.
We evaluate on a suite of challenging continuous-control benchmarks from Gymnasium [Towers et al., 2024]: Pendulum-v1, HalfCheetah-v5, Hopper-v5, and Walker2d-v5, sweeping the discretization step and dynamic noise levels. To model stochastic dynamics, at each simulator step we sample an i.i.d. Gaussian generalized force and write it to MuJoCo’s qfrc_applied buffer [Todorov et al., 2012], thereby perturbing the equations of motion. More details can be found in Appendix C.
Baselines.
We compare against discrete-time algorithms DDPG [Lillicrap et al., 2015], SAC [Haarnoja et al., 2018b], as well as a continuous-time algorithm with stochastic Gaussian policy: q-learning [Jia and Zhou, 2023]. In particular, for q-learning, we adopt two different settings when learning q-function: the original one-step TD taregt () in Jia and Zhou [2023], and a multi-step TD extension with as in Alg. 1. This provides a fair comparison between deterministic and stochastic policies in continuous-time RL. We also test DAU [Tallec et al., 2019], i.e., CT-DDPG with , to see the effects of multi-step TD. For each algorithm, we report results averaged over at least three independent runs with different random seeds.

Results.
Figs. 2 and 1 show the average return against training episodes, where the shaded area stands for standard deviation across different runs. We observe that for most environments, our CT-DDPG has the best performance among all baselines and the gap becomes larger as discretization step decreases and (or) noise level increases. Specifically, we have the following observations:
-
•
As demonstrated in Fig. 1, although discrete-time algorithms, DDPG and SAC, perform reasonably well under the standard Gymnasium settings (top row), they degrade substantially when decreases and increases (middle & bottom rows). This stems from the fact that one-step TD updates provide only myopic information under small and noisy dynamics, preventing the Q-function from capturing the long-term structure of the problem.
-
•
For continuous-time RL with stochastic policy shown in Fig. 2, q-learning exhibits slow convergence and training instability, due to the difficulty of enforcing Bellman equation constraints Eq. 3.9. Although q-learning using multi-step TD can to some extent improve upon original q-learning (), it still remains unstable across diverse environment settings and underperforms compared to CT-DDPG. This highlights the fundamental limitations of stochastic policy in continuous-time RL.
-
•
To further investigate the effects of multi-step TD, we also test DAU (i.e., CT-DDPG with ) in Fig. 2. It turns out that in small and large regime, DAU converges more slowly. In Fig. 3, we examine the variance to square norm ratio (NSR) of stochastic gradients in the training process. As , NSR of DAU becomes evidently larger than that of CT-DDPG, consistent with our theories in Sec. 4.2. A large NSR leads to the instability when training q-function and consequently impedes the convergence.


In summary, CT-DDPG exhibits superior performance in terms of convergence speed and stability across most environment settings, verifying the efficiency and robustness of our method.
6 Conclusion
In this paper, we investigate deterministic policy gradient methods to achieve stability and efficiency for deep RL in continuous-time environments, bridging the gap between discrete and continuous time algorithms. We develop a rigorous mathematical framework and provide a martingale characterization for DPG. We further theoretically demonstrate the issues of standard one-step TD method in continuous-time regime for the first time. All our theoretical results are verified through extensive experiments. We hope this work can motivate future researches on continuous-time RL.
Acknowledgments
YZ is grateful for support from the Imperial Global Connect Fund, and the CNRS–Imperial Abraham de Moivre International Research Laboratory.
References
- Baird [1994] Leemon C Baird. Reinforcement learning in continuous time: Advantage updating. In Proceedings of 1994 IEEE International Conference on Neural Networks (ICNN’94), volume 4, pages 2448–2453. IEEE, 1994.
- Doya [2000] Kenji Doya. Reinforcement learning in continuous time and space. Neural Computation, 12(1):219–245, 2000.
- Fujimoto et al. [2018] Scott Fujimoto, Herke Hoof, and David Meger. Addressing function approximation error in actor-critic methods. In International Conference on Machine Learning, pages 1587–1596. PMLR, 2018.
- Giegrich et al. [2024] Michael Giegrich, Christoph Reisinger, and Yufei Zhang. Convergence of policy gradient methods for finite-horizon exploratory linear-quadratic control problems. SIAM Journal on Control and Optimization, 62(2):1060–1092, 2024.
- Gobet and Munos [2005] Emmanuel Gobet and Rémi Munos. Sensitivity analysis using Itô–malliavin calculus and martingales, and application to stochastic optimal control. SIAM Journal on Control and Optimization, 43(5):1676–1713, 2005.
- Guo et al. [2025] Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025.
- Haarnoja et al. [2018a] Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International Conference on Machine Learning, pages 1861–1870. Pmlr, 2018a.
- Haarnoja et al. [2018b] Tuomas Haarnoja, Aurick Zhou, Kristian Hartikainen, George Tucker, Sehoon Ha, Jie Tan, Vikash Kumar, Henry Zhu, Abhishek Gupta, Pieter Abbeel, et al. Soft actor-critic algorithms and applications. arXiv preprint arXiv:1812.05905, 2018b.
- Jia and Zhou [2022a] Yanwei Jia and Xun Yu Zhou. Policy evaluation and temporal-difference learning in continuous time and space: A martingale approach. Journal of Machine Learning Research, 23(154):1–55, 2022a.
- Jia and Zhou [2022b] Yanwei Jia and Xun Yu Zhou. Policy gradient and actor-critic learning in continuous time and space: Theory and algorithms. Journal of Machine Learning Research, 23(275):1–50, 2022b.
- Jia and Zhou [2023] Yanwei Jia and Xun Yu Zhou. q-learning in continuous time. Journal of Machine Learning Research, 24(161):1–61, 2023.
- Jia et al. [2025] Yanwei Jia, Du Ouyang, and Yufei Zhang. Accuracy of discretely sampled stochastic policies in continuous-time reinforcement learning. arXiv preprint arXiv:2503.09981, 2025.
- Kurtz and Stockbridge [1998] Thomas G Kurtz and Richard H Stockbridge. Existence of markov controls and characterization of optimal markov controls. SIAM Journal on Control and Optimization, 36(2):609–653, 1998.
- Lillicrap et al. [2015] Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.
- Mnih et al. [2013] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.
- Munos [2006] Rémi Munos. Policy gradient in continuous time. Journal of Machine Learning Research, 7:771–791, 2006.
- Park et al. [2021] Seohong Park, Jaekyeom Kim, and Gunhee Kim. Time discretization-invariant safe action repetition for policy gradient methods. Advances in Neural Information Processing Systems, 34:267–279, 2021.
- Robbins and Monro [1951] Herbert Robbins and Sutton Monro. A stochastic approximation method. The Annals of Mathematical Statistics, pages 400–407, 1951.
- Schulman et al. [2015] John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. In International Conference on Machine Learning, pages 1889–1897. PMLR, 2015.
- Schulman et al. [2017] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
- Sethi et al. [2025] Deven Sethi, David Šiška, and Yufei Zhang. Entropy annealing for policy mirror descent in continuous time and space. SIAM Journal on Control and Optimization, 63(4):3006–3041, 2025.
- Silver et al. [2014] David Silver, Guy Lever, Nicolas Heess, Thomas Degris, Daan Wierstra, and Martin Riedmiller. Deterministic policy gradient algorithms. In International Conference on Machine Learning, pages 387–395. Pmlr, 2014.
- Silver et al. [2016] David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of go with deep neural networks and tree search. Nature, 529(7587):484–489, 2016.
- Sutton et al. [1998] Richard S Sutton, Andrew G Barto, et al. Reinforcement learning: An introduction, volume 1. MIT press Cambridge, 1998.
- Tallec et al. [2019] Corentin Tallec, Léonard Blier, and Yann Ollivier. Making deep q-learning methods robust to time discretization. In International Conference on Machine Learning, pages 6096–6104. PMLR, 2019.
- Todorov et al. [2012] Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 5026–5033. IEEE, 2012.
- Towers et al. [2024] Mark Towers, Ariel Kwiatkowski, Jordan Terry, John U Balis, Gianluca De Cola, Tristan Deleu, Manuel Goulão, Andreas Kallinteris, Markus Krimmel, Arjun KG, et al. Gymnasium: A standard interface for reinforcement learning environments. arXiv preprint arXiv:2407.17032, 2024.
- Wang et al. [2020] Haoran Wang, Thaleia Zariphopoulou, and Xun Yu Zhou. Reinforcement learning in continuous time and space: A stochastic control approach. Journal of Machine Learning Research, 21(198):1–34, 2020.
- Yildiz et al. [2021] Cagatay Yildiz, Markus Heinonen, and Harri Lähdesmäki. Continuous-time model-based reinforcement learning. In International Conference on Machine Learning, pages 12009–12018. PMLR, 2021.
- Zhang [2017] Jianfeng Zhang. Backward stochastic differential equations. In Backward Stochastic Differential Equations: From Linear to Fully Nonlinear Theory, pages 79–99. Springer, 2017.
- Zhao et al. [2023] Hanyang Zhao, Wenpin Tang, and David Yao. Policy optimization for continuous reinforcement learning. Advances in Neural Information Processing Systems, 36:13637–13663, 2023.
Appendix A Related Work
Discretization-Invariant Algorithms.
Discretization has long been recognized as a central challenge in continuous control and RL [Baird, 1994; Doya, 2000; Munos, 2006]. More recently, Tallec et al. [2019] showed that Q-learning–based approaches collapse as the discretization step becomes small and introduced the concept of the advantage rate function. Yildiz et al. [2021] tackled this issue through a model-based approach for deterministic ODE dynamics using the Neural ODE framework. Park et al. [2021] demonstrated that conventional policy gradient methods suffer from variance blow-up and proposed action-repetition strategies as a remedy. While these methods mitigate discretization sensitivity to some extent, they are restricted to deterministic dynamics and fail to handle stochasticity, a key feature of real-world environments.
Continuous-Time RL with Stochastic Policies.
Beyond addressing discretization sensitivity, another line of work directly considers continuous dynamics driven by stochastic differential equations. Jia and Zhou [2022a, b] introduced a martingale characterization for policy evaluation and developed an actor–critic algorithm in continuous time. Jia and Zhou [2023] studied the continuous-time analogue of the discrete-time advantage function, namely the -function, and proposed a -learning algorithm. Giegrich et al. [2024]; Sethi et al. [2025] extend natural policy gradient methods to the continuous-time setting, and Zhao et al. [2023] further generalize PPO [Schulman et al., 2017] and TRPO [Schulman et al., 2015] methods to continuous time. However, all of these approaches adopt stochastic policies, which require enforcing Bellman equation constraints that are not tractable in deep RL frameworks. In contrast, our method leverages deterministic policies and enforces the Bellman equation via a simple reparameterization trick, enabling stable integration with deep RL.
Theoretical Issues of Discrete-Time RL.
Although many works have empirically observed that standard discrete-time algorithms degrade under small discretization, the theoretical foundations remain underexplored. Munos [2006]; Park et al. [2021] showed that the variance of policy gradient estimators can diverge as . Baird [1994]; Tallec et al. [2019] further demonstrated that the standard Q-function degenerates and collapses to the value function. From the perspective of policy evaluation, Jia and Zhou [2022a] proved that the minimizer of the mean-square TD error does not converge to the true value function. Nevertheless, most discrete-time algorithms rely on semi-gradient updates rather than directly minimizing the mean-square TD error. To the best of our knowledge, there has been no theoretical analysis establishing the failure of standard one-step TD methods in the continuous-time setting.
Appendix B Proofs
Notations.
We denote by the space of continuous functions that are once continuously differentiable in time and twice continuously differentiable in space, and there exists a constant such that for all , . We use to denote the collection of all probability distributions over . For compactness of notation, we denote by the concatenation of time and state . Finally, we use standard to omit constant factors.
B.1 Proof of Thm. 3.1
The following performance difference lemma characterizes the difference of value functions with different policies, which will be used in proving the policy gradient formula.
Proposition B.1.
Suppose Asp. 1 holds. Let and assume . For all and ,
(B.1) |
Proof of Prop. B.1.
Observe that under Asp. 1, for each , and ,
(B.2) |
where satisfies for all ,
(B.3) |
Fix . Denote by and for simplicity. Then
(B.4) |
where the last identity used the fact that and . As , applying Itô’s formula to yields
where the last identity used the PDE Eq. 3.2. This along with Eq. B.4 proves the desired result. ∎
Proof of Thm. 3.1.
Recall that . Hence it suffices to prove for all ,
To this end, for all , let be the solution to the following dynamics:
(B.5) |
For all , by Prop. B.1 and the fundamental theorem of calculus,
(B.6) |
where for all ,
To show the limit of Eq. B.6 as , observe that by Asp. 2 and standard stability analysis of Eq. B.5 (see e.g., [Zhang, 2017, Theorem 3.2.4]), for all ,
which along with the growth condition in Asp. 1 and the regularity of , and in Asp. 2, and the dominated convergence theorem shows that
(B.7) |
Moreover, there exists such that for all , and ,
where the last inequality used the growth conditions on the derivatives of the coefficients and , and of the value function . Using the moment condition , the random variables are uniformly integrable. Hence using Vitali’s convergence theorem and passing in Eq. B.6 yield the desired identity. ∎
B.2 Proof of Thm. 3.2
Proof.
For all and , applying Itô’s formula to yields for all ,
(B.8) | ||||
This along with the martingale condition Eq. 3.7 implies
is a martingale, which has continuous paths and finite variation. Hence almost surely
(B.9) |
We claim for all and . To see it, define for all . By assumptions, . Suppose there exists and such that . Due to the continuity of , we can assume without loss of generality that and . The continuity of implies that there exist constants such that for all with . Now consider the process defined by (3.8), and define the stopping time
Note that almost surely, due to the sample path continuity of and the condition . This along with (B.9) implies that there exists a measure zero set such that for all , , and
However, by the definition of , for all , , which along with the choice of implies and hence
This yields a contradiction, and proves for all and .
Now by Eq. 3.6, for all ,
Since satisfies the same PDE, the Feynman-Kac formula shows that for all . This subsequently implies for all and . ∎
B.3 Proof of Prop. 4.1
Proof.
By Itô’s formula,
(B.10) | ||||
Note that the last term is a martingale and thus vanishes after taking expectation. Therefore the semi-gradient can be rewritten as
(B.11) |
When the discretization step goes to zero, the integral ① admits a first-order expansion, which leads to
(B.12) |
Similarly we have
(B.13) |
On the other hand, consider the conditional variance of stochastic gradient:
(B.14) |
Note that
(B.15) |
and . This yields
(B.16) |
According to Itô isometry,
(B.17) |
(B.18) |
and the cross term can be controlled by Cauchy-Schwarz:
(B.19) |
These estimates show that, as , the leading contribution to the variance comes from the stochastic integral term ②. As a result, by combining Fatou’s Lemma and Eq. B.14, we conclude that
(B.20) | ||||
∎
B.4 Proof of Prop. 4.2
Proof.
We begin by recalling that, for any horizon , Itô’s formula yields,
(B.21) | ||||
Now consider the case where is fixed while . In this regime, the estimator can be expressed as
(B.22) | ||||
The integral is taken over a fixed interval of length , and thus this expression is bounded and will not vanish.
We next turn to the variance. Expanding the definition of and using Jensen’s inequality, we obtain
(B.23) | ||||
This is because all terms are bounded. ∎
Appendix C Experiment Details
Model architecture.
Across all experiments, the policy, Q-network, and value network are implemented as three-layer fully connected MLPs with ReLU activations. The hidden dimension is set to 400, except for Pendulum, where we use 64. To incorporate time information, we augment the environment observations with a sinusoidal embedding, yielding , where denotes the maximum horizon. For stochastic policies, we employ Gaussian policies with mean and variance parameterized by neural networks, and fix the entropy coefficient to .
Environment setup.
To accelerate training, we run 8 environments in parallel, collecting 8 trajectories per episode. The discount rate is set to , applied in the form . For MuJoCo environments, we set terminate_when_unhealthy=False.
Training hyperparameters.
We use the Adam optimizer with a learning rate of for all networks ( for Pendulum), and a batch size of . The update frequency is in the original environment and for smaller step sizes . The soft target update parameter is . The weight for the terminal value constraint is . For CT-DDPG, the trajectory length is sampled uniformly from , and we use exploration noise with standard deviation . For q-learning, for each state in the minibatch, we sample actions from and compute the penalty term .
