Optimism as Risk-Seeking in Multi-Agent Reinforcement Learning

Runyu Zhang1, Na Li2, Asuman Ozdaglar1, Jeff Shamma3, Gioele Zardini1 1Runyu Zhang, Asuman Ozdaglar, and Gioele Zardini are with the Laboratory for Information & Decision Systems, Massachusetts Institute of Technology, {runyuzha, asuman, gzardini}@mit.edu2Na Li is with Harvard University, nali@seas.harvard.edu3Jeff Shamma is with the University of Illinois at Urbana-Champaign, jshamma@illinois.edu.
Abstract

Risk sensitivity has become a central theme in reinforcement learning (RL), where convex risk measures and robust formulations provide principled ways to model preferences beyond expected return. Recent extensions to multi-agent RL (MARL) have largely emphasized the risk-averse setting, prioritizing robustness to uncertainty. In cooperative MARL, however, such conservatism often leads to suboptimal equilibria, and a parallel line of work has shown that optimism can promote cooperation. Existing optimistic methods, though effective in practice, are typically heuristic and lack theoretical grounding. Building on the dual representation for convex risk measures, we propose a principled framework that interprets risk-seeking objectives as optimism. We introduce optimistic value functions, which formalize optimism as divergence-penalized risk-seeking evaluations. Building on this foundation, we derive a policy-gradient theorem for optimistic value functions, including explicit formulas for the entropic risk/KL-penalty setting, and develop decentralized optimistic actor-critic algorithms that implement these updates. Empirical results on cooperative benchmarks demonstrate that risk-seeking optimism consistently improves coordination over both risk-neutral baselines and heuristic optimistic methods. Our framework thus unifies risk-sensitive learning and optimism, offering a theoretically grounded and practically effective approach to cooperation in MARL.

I Introduction

Decision-making under uncertainty is central to reinforcement learning (RL). Standard RL optimizes expected return and is therefore risk-neutral – an assumption that is often too restrictive in risk-critical domains. To overcome this, a rich body of work has studied risk-sensitive and robust RL. Risk-sensitive RL leverages convex risk measures [1] such as entropic risk and CVar to shape Bellman updates and encode preferences over full return distributions [2, 3, 4, 5], while robust RL models distributional uncertainty via ambiguity sets of divergence-regularized optimization [6, 7, 8, 9, 10]. These two perspectives are closely linked: risk-sensitive and robust Markov Decision Processes (MDPs) can be shown equivalent under dual formulations [2, 11, 12, 13], yielding both theoretical guarantees and practical policy-gradient algorithms.

In multi-agent RL (MARL), risk-sensitive and robust formulations have also been explored, often addressing not only environment uncertainty but also unpredictable behavior of other agents [14, 15, 16, 17]. Most of these works, however, adopt a risk-averse stance: prioritizing safety at the expense of cooperation. In cooperative tasks, this conservatism can prevent agents from reaching optimal joint solutions (c.f. [18, 19, 20, 21]). This limitation is exemplified by the problem of relative overgeneralization (RO) [18], where agents converge to suboptimal equilibria by adapting too cautiously to noisy or misaligned partner behavior. To overcome RO, a parallel line of research has explored optimism or risk-seeking as a driver of cooperation. Early approaches such as distributed QQ-learning [22], hysteretic QQ-learning [19], and lenient learning [23, 24] softened negative updates to preserve cooperative actions. Later methods such as FMQ learning [25] and optimistic exploration [26, 27] reinforced this idea. More recently, optimistic policy-gradient algorithms (e.g., optimistic PPO variants) have been proposed [21], biasing updates toward higher-value estimates to avoid being overly pessimistic about joint actions early on, thereby mitigating relative overgeneralization while preserving optimality at convergence.

Despite their empirical success, most of these optimistic MARL methods are heuristic. Optimism is typically introduced through asymmetric learning rates [19], ad hoc return reshaping [25], or tailored initialization strategies. What is missing is a principled framework that explains optimism mathematically and connects it to established concepts in risk-sensitive learning. This motivates our central question: Can risk-seeking objectives provide a rigorous mechanism for optimism in MARL, thereby promoting cooperation in a theoretically grounded way?

Statement of contribution

This work establishes a principled foundation for optimism in MARL by introducing optimistic value functions derived from convex risk measures. We show that divergence-penalized dual formulations naturally instantiate optimism as a controlled form of risk-seeking, unifying heuristic optimistic updates with risk-sensitive RL theory. Building on this framework, we develop a policy-gradient theorem for optimistic value functions, including explicit sample-based formulas in the entropic risk/KL-penalty setting. We then design decentralized optimistic actor-critic algorithms that realize these updates in practice and validate them empirically on cooperative benchmarks, where they achieve more reliable coordination than both risk-neutral baselines and existing heuristic optimistic methods.

II Preliminaries

This section sets up notation for MDPs and their multi-agent extensions, introduces the optimistic value functions that are central to our work, and recalls the basic properties of convex risk measures.

II-A Markov Decision Processes (MDPs)

A finite MDP is a tuple =(𝒮,𝒜,P,r,γ,ρ)\mathcal{M}=(\mathcal{S},\mathcal{A},P,r,\gamma,\rho), consisting of a finite set of states 𝒮\mathcal{S}, a finite action set 𝒜\mathcal{A}, the transition probability function PP such that P(s|s,a)P(s^{\prime}|s,a) describes the probability of transitioning from state ss to state ss^{\prime} given a particular action aa. A stochastic policy π:𝒮Δ|𝒜|\pi:\mathcal{S}\to\Delta^{|\mathcal{A}|} specifies a strategy where the agent chooses its action based on the current state in a stochastic fashion; specifically, the probability of choosing action aa at state ss is given by Pr(a|s)=π(a|s)\Pr(a|s)=\pi(a|s). For notational simplicity, we use πs()\pi_{s}(\cdot) as shorthand for π(|s)\pi(\cdot|s). For a given stationary policy π\pi , we denote the discounted state visitation distribution by

dπ(s):=(1γ)t=0+γtPr(st=s|s0ρ,aτπsτ).\displaystyle\textstyle d^{\pi}(s):=(1-\gamma)\sum_{t=0}^{+\infty}\gamma^{t}\Pr(s_{t}=s~|~s_{0}\sim\rho,a_{\tau}\sim\pi_{s_{\tau}}).

II-B Multi-agent MDPs

In multi-agent settings the action space factors as 𝒜=𝒜1×𝒜2××𝒜n\mathcal{A}=\mathcal{A}_{1}\times\mathcal{A}_{2}\times\cdots\times\mathcal{A}_{n}, so a joint action a=(a1,a2,,an)a=(a_{1},a_{2},\cdots,a_{n}) is composed of individual agent actions aia_{i}. We focus on decentralized product policies

πθ(a|s)=i=1nπiθ(ai|s),\displaystyle\textstyle\pi^{\theta}(a|s)=\prod_{i=1}^{n}\pi_{i}^{\theta}(a_{i}|s), (1)

where parameters are grouped as θ=(θ1,θ2,,θn)\theta=(\theta_{1},\theta_{2},\cdots,\theta_{n}). A direct parametrization means each θi𝒮×𝒜i\theta_{i}\in\mathbb{R}^{\mathcal{S}\times\mathcal{A}_{i}} encodes action probabilities explicitly, i.e.,

θs,ai=πi(ai|s),\displaystyle\textstyle\theta_{s,a_{i}}=\pi_{i}(a_{i}|s), (2)

where, for notational simplicity, we abbreviate θi(s,ai)\theta_{i}(s,a_{i}) as θs,ai\theta_{s,a_{i}}.

II-C Optimistic Value Functions

To encourage cooperation in decentralized learning, we evaluate a baseline policy π\pi via optimistic value functions and QQ-functions that allow deviations penalized by a divergence. For auxiliary policies {π^t}t0\{\hat{\pi}_{t}\}_{t\geq 0},

Vπ(s)=maxπ^[𝔼st,atπ^t=0+γt(r(st,at)D(π^t,st||πst))|s0=s],\displaystyle\textstyle V^{\pi}\!(s)\!=\!\max_{\widehat{\pi}}\!\left[\!\mathbb{E}_{s_{t}\!,a_{t}\!\sim\!\widehat{\pi}}\!\sum_{t\!=\!0}^{+\!\infty}\!\gamma^{t}\!\left(r(s_{t},\!a_{t})\!-\!\!D(\widehat{\pi}_{t,s_{t}}||\pi_{s_{t}})\right)\!\Big|s_{0}\!=\!s\right],
Qπ(s,a)=r(s,a)+max{π^t}t1\displaystyle\textstyle Q^{\pi}(s,a)=r(s,a)+\max_{\{\widehat{\pi}_{t}\}_{t\geq 1}}
[𝔼st,atπ^t=1+γt(r(st,at)D(π^t,st||πst))|s0=s,a0=a].\displaystyle\textstyle\left[\mathbb{E}_{s_{t},a_{t}\sim\widehat{\pi}}\!\sum_{t=1}^{+\infty}\!\gamma^{t}\!\left(r(s_{t},a_{t})\!-\!\!D(\widehat{\pi}_{t,s_{t}}||\pi_{s_{t}})\right)\!\Big|s_{0}\!=\!s,a_{0}\!=\!a\right].\!\!\!\!\! (3)

Here, D(π^||π)D(\hat{\pi}||\pi) quantifies the deviation of the auxiliary policy π^\hat{\pi} from the baseline policy π\pi, for instance through the KL divergence. The term optimistic reflects the maximization over the auxiliary policy sequence π^t{\hat{\pi}_{t}} in the definitions above. Rather than directly evaluating the return of π\pi,the value functions VπV^{\pi} and QπQ^{\pi} incorporate deviations π^\hat{\pi} that maximize the expected cumulative reward, while paying a penalty proportional to how far π^\hat{\pi} strays from π\pi. This construction encourages the agent to envision the best-case performance of π\pi under locally improved decisions, yielding an evaluation that is deliberately biased toward potentially higher-performing behaviors. The advantage function is defined as Aπ(s,a)=Qπ(s,a)Vπ(s)A^{\pi}(s,a)=Q^{\pi}(s,a)-V^{\pi}(s).

II-D Convex Risk Measures

Let \mathcal{F} be the set of real-valued functions on a finite action set 𝒜\mathcal{A}. A map σ:\sigma:\mathcal{F}\to\mathbb{R} is a convex risk measure if it satisfies:

  1. 1.

    Monotonicity: For f,ff^{\prime},f\in\mathcal{F}, if fff^{\prime}\leq f, then σ(f)σ(f).\sigma(f)\leq\sigma(f^{\prime}).

  2. 2.

    Translation invariance: for any f,mf\in\mathcal{F},m\in\mathbb{R}, σ(f+m)=σ(f)m\sigma(f+m)=\sigma(f)-m .

  3. 3.

    Convexity: for any f,f,λ[0,1]f^{\prime},f\in\mathcal{F},\lambda\in[0,1], σ(λf+(1λ)f)λσ(f)+(1λ)σ(f).\sigma(\lambda f+(1-\lambda)f^{\prime})\leq\lambda\sigma(f)+(1-\lambda)\sigma(f^{\prime}).

By standard duality theory, convex risk measures admit the following representation [1].

Theorem 1 (Dual Representation Theorem [1]).

The function σ:\sigma:\mathcal{F}\to\mathbb{R} is a convex risk measure if and only if there exists a “penalty function” D():Δ|𝒜|D(\cdot):\Delta^{|\mathcal{A}|}\to\mathbb{R} such that

σ(f)=supπ^Δ|𝒜|(𝔼π^fD(π^)).\textstyle\sigma(f)=\sup_{\widehat{\pi}\in\Delta^{|\mathcal{A}|}}\left(-\mathbb{E}_{\widehat{\pi}}f-D(\widehat{\pi})\right). (4)

In specific, DD can be written in the following form:

D(π^)=supf(σ(f)𝔼aπ^f(a)).\displaystyle\textstyle D(\widehat{\pi})=\sup_{f\in\mathcal{F}}\left(-\sigma(f)-\mathbb{E}_{a\sim\widehat{\pi}}f(a)\right). (5)

Note that σ\sigma and DD serve as the Fenchel conjugate of each other. In most cases, a convex risk measure σ(f)\sigma(f) can be understood as quantifying the risk of a random variable taking values f(a)f(a), where aa is sampled from some distribution aπa\sim\pi. Thus, commonly used risk measures are naturally tied to an underlying probability distribution πΔ|𝒜|\pi\in\Delta^{|\mathcal{A}|} (see, e.g., Example 1). In this work, we focus on such distribution-dependent risk measures, and therefore write σ(π,)\sigma(\pi,\cdot) to emphasize the dependence on π\pi. Correspondingly, in the dual representation theorem we denote the penalty term D(π^)D(\hat{\pi}) of σ(π,)\sigma(\pi,\cdot) by D(π^||π)D(\hat{\pi}||\pi). 111The notation DD is intentionally overloaded: it denotes both the regularization term in (3) and the penalty function for a risk measure in (4). The connection between these two uses will become clear in subsequent sections.

Here we provide an example of convex risk measure and its dual form.

Example 1 (Entropy risk measure [1]).

For a given β>0\beta>0, the entropy risk measure takes the form:

σ(π,f)=β1log𝔼aπeβf(a).\textstyle\sigma(\pi,f)=\beta^{-1}\log\mathbb{E}_{a\sim\pi}e^{-\beta f(a)}.

Its corresponding penalty function DD in the dual representation theorem is the KL divergence

D(π^||π)=β1KL(π^||π)=β1a𝒜π^(a)log(π^(a)/π(a)).\textstyle D(\widehat{\pi}||\pi)=\beta^{-1}\textup{KL}(\widehat{\pi}||\pi)=\beta^{-1}\sum_{a\in\mathcal{A}}\widehat{\pi}(a)\log\left({\widehat{\pi}(a)}/{\pi(a)}\right).

III Bellman Equation and Policy Gradient Theorem for Optimistic Value Functions

This section develops the theoretical foundation for optimizing the optimistic value functions introduced earlier in Eq. 3. We first establish a Bellman equation that characterizes these functions, and then derive a policy-gradient theorem that enables their optimization.

Lemma 1 (Bellman Equation).

The optimistic value functions and QQ-functions in Eq. 3 satisfy the following Bellman recursion:

Vπ(s)=maxπ^aπ^(a|s)Qπ(s,a)D(π^s||πs)=σ(πs,Qπ(s,)),Qπ(s,a)=r(s,a)+γsP(s|s,a)Vπ(s),\begin{split}\textstyle V^{\pi}(s)&=\max_{\widehat{\pi}}\sum_{a}\widehat{\pi}(a|s)Q^{\pi}(s,a)-D(\widehat{\pi}_{s}||\pi_{s})\\ \textstyle\qquad~~&=\sigma(\pi_{s},-Q^{\pi}(s,\cdot)),\\ \textstyle Q^{\pi}(s,a)&=r(s,a)+\gamma\sum_{s^{\prime}}P(s^{\prime}|s,a)V^{\pi}(s^{\prime}),\end{split} (6)

where σ(πs,)\sigma(\pi_{s},\cdot) is the convex risk measure associated with the penalty DD, via Eq. 4. Moreover, the auxiliary optimistic distribution π^t\widehat{\pi}_{t} in Eq. 3 can be taken as a stationary policy:

π^(|s)=argmaxπ^aπ^(a|s)Qπ(s,a)D(π^s||πs).\displaystyle\widehat{\pi}(\cdot|s)=\operatorname*{argmax}_{\widehat{\pi}}\sum_{a}\widehat{\pi}(a|s)Q^{\pi}(s,a)-D(\widehat{\pi}_{s}||\pi_{s}). (7)
Proof.

From the definition of VπV^{\pi} in Eq. 3 and dual representation theorem we get

Vπ(s)=maxπ^0𝔼a0π^0[r(s0,a0)D(π^0,s0||πs0)+\displaystyle\textstyle V^{\pi}\!(s)\!=\!\max_{\widehat{\pi}_{0}}\mathbb{E}_{a_{0}\sim\widehat{\pi}_{0}}\Bigg[r(s_{0},a_{0})-D(\widehat{\pi}_{0,s_{0}}||\pi_{s_{0}})+
max{π^t}t1[𝔼st,atπ^t=1+γt(r(st,at)D(π^t,st||πst))|s0=s]]\displaystyle\textstyle\left.\max_{\{\widehat{\pi}_{t}\}_{t\geq 1}}\left[\!\mathbb{E}_{s_{t}\!,a_{t}\!\sim\!\widehat{\pi}}\!\sum_{t\!=\!1}^{+\!\infty}\!\gamma^{t}\!\left(r(s_{t},\!a_{t})\!-\!\!D(\widehat{\pi}_{t,s_{t}}||\pi_{s_{t}})\right)\!\Big|s_{0}\!=\!s\!\right]\right]
=maxπ^0𝔼a0π^0[Qπ(s0,a0)D(π^0,s0||πs0)|s0=s]\displaystyle\textstyle=\!\max_{\widehat{\pi}_{0}}\mathbb{E}_{a_{0}\sim\widehat{\pi}_{0}}\left[Q^{\pi}(s_{0},a_{0})-D(\widehat{\pi}_{0,s_{0}}||\pi_{s_{0}})\Big|s_{0}=s\right]
=maxπ^aπ^(a|s)Qπ(s,a)D(π^s||πs)\displaystyle\textstyle=\max_{\widehat{\pi}}\sum_{a}\widehat{\pi}(a|s)Q^{\pi}(s,a)-D(\widehat{\pi}_{s}||\pi_{s})
=σ(πs,Qπ(s,)).\displaystyle\textstyle=\sigma(\pi_{s},-Q^{\pi}(s,\cdot)).

Further we know that the optimistic π^\widehat{\pi} is given by

π^(|s)=argmaxπ^aπ^(a|s)Qπ(s,a)D(π^s||πs).\displaystyle\textstyle\widehat{\pi}(\cdot|s)=\operatorname*{argmax}_{\widehat{\pi}}\sum_{a}\widehat{\pi}(a|s)Q^{\pi}(s,a)-D(\widehat{\pi}_{s}||\pi_{s}).

Similarly, from the definition of QπQ^{\pi} in (3) we get that

Qπ(s,a)=r(s,a)+\displaystyle Q^{\pi}(s,a)=r(s,a)+
max{π^t}t1[𝔼st,atπ^t=1+γt(r(st,at)D(π^t,st||πst))|s0=s,a0=a]\displaystyle\max_{\{\widehat{\pi}_{t}\!\}_{t\!\geq\!1}}\!\!\left[\!\mathbb{E}_{s_{t},a_{t}\sim\widehat{\pi}}\!\sum_{t=1}^{+\infty}\!\!\gamma^{t}\!\left(r(s_{t},a_{t})\!-\!\!D(\widehat{\pi}_{t,s_{t}}||\pi_{s_{t}})\right)\!\Big|s_{0}\!=\!s,\!a_{0}\!=\!a\!\right]
=r(s,a)+\displaystyle=r(s,a)+
𝔼sP(|s,a)max{π^t}t1[𝔼st,atπ^t=1+γt(r(st,at)D(π^t,st||πst))|s1=s]\displaystyle\mathbb{E}_{s^{\prime}\!\sim\!P(\!\cdot|s,a)}\!\max_{\{\!\widehat{\pi}_{t}\!\}_{t\!\geq\!1}}\!\!\left[\!\mathbb{E}_{s_{t},\!a_{t}\!\sim\!\widehat{\pi}}\!\sum_{t=1}^{+\infty}\!\!\gamma^{t}\!\left(r(\!s_{t},a_{t})\!-\!\!D(\widehat{\pi}_{t,s_{t}}||\pi_{s_{t}})\right)\!\Big|s_{1}\!=\!s^{\prime}\!\right]
=r(s,a)+γsP(s|s,a)Vπ(s),\displaystyle=r(s,a)+\gamma\sum_{s^{\prime}}P(s^{\prime}|s,a)V^{\pi}(s^{\prime}),

which completes the proof. ∎

We are now ready to state the policy gradient theorem for the optimistic value function as follows:

Theorem 2 (Policy gradient theorem for optimistic value function).

For a parametrized policy πθ\pi^{\theta}:

θVπθ(s0)=\displaystyle\textstyle\nabla_{\theta}V^{\pi^{\theta}}(s_{0})=
11γsds0π^θ(s)aσ(πsθ,Qπθ(s,))πθ(a|s)πθ(a|s)θlogπθ(a|s),\displaystyle\textstyle\frac{1}{1-\gamma}\sum_{s}d^{\widehat{\pi}^{\theta}}_{s_{0}}(s)\sum_{a}\frac{\partial\sigma(\pi_{s}^{\theta},-Q^{\pi^{\theta}}(s,\cdot))}{\partial\pi^{\theta}(a|s)}\pi^{\theta}(a|s)\nabla_{\theta}\log\pi^{\theta}(a|s),

where ds0π^θd^{\widehat{\pi}^{\theta}}_{s_{0}} is the discounted visitation distribution under the optimistic policy π^θ\hat{\pi}^{\theta} defined in (1). In the special case of the entropic risk measure

σ(πs,Q(s,))=β1log𝔼aπseβQ(s,a),\sigma(\pi_{s},-Q(s,\cdot))=\beta^{-1}\log\mathbb{E}_{a\sim\pi_{s}}e^{\beta Q(s,a)},

the policy gradient simplifies to

θVπθ(s0)=\displaystyle\textstyle\nabla_{\theta}V^{\pi^{\theta}}(s_{0})=
1β(1γ)s,ads0π^θ(s)eβAπθ(s,a)πθ(a|s)θlogπθ(a|s),\displaystyle\qquad\textstyle\frac{1}{\beta(1-\gamma)}\sum_{s,a}d^{\widehat{\pi}^{\theta}}_{s_{0}}(s)e^{\beta A^{\pi^{\theta}}(s,a)}\pi^{\theta}(a|s)\nabla_{\theta}\log\pi^{\theta}(a|s),

with π^θ(a|s)πθ(a|s)eβQπθ(s,a)\widehat{\pi}^{\theta}(a|s)\propto\pi^{\theta}(a|s)e^{\beta Q^{\pi^{\theta}}(s,a)}.

The proof of Theorem 2 (deferred to Appendix A) is done by taking the gradient of θ\theta on both sides of the Bellman equation (6).

Remark 1 (Interpretation).

The Bellman equation above mirrors the risk-neutral case but replaces the baseline expectation under π\pi with a risk-sensitive evaluation under an auxiliary distribution π^\hat{\pi}. This makes the value of π\pi optimistic-tilted toward actions or trajectories that could yield higher returns, while still penalizing large deviations from π\pi. The policy-gradient theorem retains the same structural form as the classical result, but introduces three important differences. First, the visitation distribution is taken under π^\hat{\pi} rather than π\pi, emphasizing more favorable trajectories. Second, action weights are exponentially scaled by eβAπ(s,a)e^{\beta A^{\pi}(s,a)}, amplifying high-advantage actions and inducing a principled risk-seeking bias. Third, as β0\beta\to 0, we recover the standard policy-gradient theorem: π^π\hat{\pi}\to\pi and the exponential factor reduces to the advantage Aπ(s,a)A^{\pi}(s,a). Together, these results establish a rigorous connection between optimism and risk-seeking convex risk measures, providing the foundation for optimistic policy-gradient algorithms.

IV Optimistic update for multi-agent learning

In this section, we leverage the policy gradient theorem (Theorem 2) to design optimistic RL algorithms for decentralized multi-agent settings. Throughout this section, we assume direct parametrization (2) of the agents’ policies, and we focus on the case where the penalty function is the KL divergence, i.e., D(π^||π)=β1KL(π^||π)D(\widehat{\pi}||\pi)=\beta^{-1}\textup{KL}(\widehat{\pi}||\pi), so that the associated risk measure σ\sigma is the entropic risk. Before presenting the update rules, we introduce two key quantities. For each agent ii, define the averaged optimistic QQ-function and the averaged optimistic advantage function as

Q¯iπθ(s,ai)=β1𝔼aiπiθ(ai|s)eβQπθ(s,ai,ai),\displaystyle\textstyle\overline{Q}_{i}^{\pi^{\theta}}(s,a_{i})=\beta^{-1}\mathbb{E}_{a_{-i}\sim\pi_{-i}^{\theta}(a_{-i}|s)}e^{\beta Q^{\pi^{\theta}}(s,a_{i},a_{-i})}, (8)
A¯iπθ(s,ai)=β1𝔼aiπiθ(ai|s)eβAπθ(s,ai,ai)\displaystyle\vskip-10.0pt\textstyle\overline{A}_{i}^{\pi^{\theta}}(s,a_{i})=\beta^{-1}\mathbb{E}_{a_{-i}\sim\pi_{-i}^{\theta}(a_{-i}|s)}e^{\beta A^{\pi^{\theta}}(s,a_{i},a_{-i})} (9)
Lemma 2 (Multi-agent optimistic policy gradient).

Consider decentralized policy (1) with direct parameterization (2). We have that

Vπθ(s)θs,ai=11γdπ^θ(s)A¯πθ(s,ai),\displaystyle\textstyle\frac{\partial V^{\pi^{\theta}}(s)}{\partial\theta_{s,a_{i}}}=\frac{1}{1-\gamma}d^{\widehat{\pi}^{\theta}}(s)\overline{A}^{\pi^{\theta}}(s,a_{i}), (10)
Proof.

This is a direct corollary of the general Theorem 2, obtained by substituting πθ\pi^{\theta} to decentralized policy (1) with direct parameterization (2). ∎

Remark 2 (Interpretation).

The quantities Q¯iπθ\overline{Q}_{i}^{\pi^{\theta}} A¯iπθ\overline{A}_{i}^{\pi^{\theta}} summarize agent ii’s optimistic evaluation of its actions, averaged over the behavior of the other agents under the baseline policy πiθ\pi_{-i}^{\theta}. Intuitively, Q¯iπθ(s,ai)\overline{Q}_{i}^{\pi^{\theta}}(s,a_{i}) measures the risk-seeking value of choosing aia_{i}, while A¯iπθ(s,ai)\overline{A}_{i}^{\pi^{\theta}}(s,a_{i}) captures the deviation of that choice from the state value. Both tilt the evaluation toward joint actions that promise higher returns, thus embodying the cooperative bias induced by optimism.

Remark 3 (Practical approximation).

In practice, computing dπ^d^{\hat{\pi}} and A¯iπ\overline{A}_{i}^{\pi} exactly is intractable. A common surrogate is to replace dπ^(s)A¯iπ(s,a)d^{\widehat{\pi}}(s)\overline{A}_{i}^{\pi}(s,a) with Q¯iπ(s,a)\overline{Q}_{i}^{\pi}(s,a) in the update, yielding a gradient-like rule that is easier to estimate from data. Specifically, we consider the following updates for agent ii. First, the policy-gradient style update:

θi,s(t+1)=ProjΔ𝒜i(θi,s(t)+η1γQ¯iπθ(s,)),\displaystyle\textstyle\theta_{i,s}^{(t+1)}=\textup{Proj}_{\Delta^{\mathcal{A}_{i}}}\Big(\theta_{i,s}^{(t)}+\frac{\eta}{1-\gamma}\overline{Q}_{i}^{\pi^{\theta}}(s,\cdot)\Big), (11)

Second, the policy-iteration/Q-learning style update:

πi(s)=argmaxaiQ¯iπθ(s,ai).\displaystyle\textstyle\pi_{i}(s)=\arg\max_{a_{i}}\overline{Q}_{i}^{\pi^{\theta}}(s,a_{i}). (12)

Both formulations are consistent with the optimistic policy-gradient principle and share the same stationary points (since dπ^(s)A¯iπ(s,a)=dπ^(s)eβVπ(s)Q¯iπ(s,a)d^{\hat{\pi}}(s)\overline{A}_{i}^{\pi}(s,a)=\tfrac{d^{\hat{\pi}}(s)}{e^{\beta V^{\pi}(s)}}\overline{Q}_{i}^{\pi}(s,a)). Thus, the resulting updates can be interpreted as a form of scaled gradient ascent, where the exact policy-gradient direction is preserved up to a positive multiplicative constant, making them more practical to implement in decentralized settings.

IV-A Decentralized Learning Algorithm

Building on Lemma 2, we now design a sample-based decentralized optimistic RL algorithm. From the update rules in (11)-(12), the main challenge lies in accurately estimating the averaged optimistic QQ-functions Q¯iπ(s,ai)\overline{Q}_{i}^{\pi}(s,a_{i}). The following lemma introduces an auxiliary variable that enables a tractable Bellman-style recursion.

Lemma 3 (Decentralized Bellman equation).

Define an auxiliary variable Zπ(s):=eβVπ(s)Z^{\pi}(s):=e^{\beta V^{\pi}(s)}. If transitions are deterministic, i.e., P(s|s,a)=𝟏{s=f(s,a)}P(s^{\prime}|s,a)=\mathbf{1}\{s^{\prime}=f(s,a)\}, then the averaged optimistic quantities satisfy:

Q¯iπ(s,ai)\displaystyle\overline{Q}_{i}^{\pi}(s,a_{i}) =β1𝔼aiπieβr(s,ai,ai)(Zπ(f(s,ai,ai)))γ,\displaystyle=\beta^{-1}\mathbb{E}_{a_{-i}\sim\pi_{-i}}e^{\beta r(s,a_{i},a_{-i})}\left(Z^{\pi}(f(s,a_{i},a_{-i}))\right)^{\gamma},
Zπ(s)\displaystyle Z^{\pi}(s) =𝔼aiπiθQ¯iπ(s,ai).\displaystyle=\mathbb{E}_{a_{i}\sim\pi_{i}^{\theta}}\overline{Q}_{i}^{\pi}(s,a_{i}). (13)
Proof.

From the definition of Q¯iπ(s,ai)\overline{Q}_{i}^{\pi}(s,a_{i}) we have:

Q¯iπ(s,ai)\displaystyle\overline{Q}_{i}^{\pi}(s,a_{i}) =β1𝔼aiπi(ai|s)eβQπ(s,ai,ai)\displaystyle=\beta^{-1}\mathbb{E}_{a_{-i}\sim\pi_{-i}(a_{-i}|s)}e^{\beta Q^{\pi}(s,a_{i},a_{-i})}
=β1𝔼aiπi(ai|s)eβ(r(s,a)+γVπ(f(s,a)))\displaystyle=\beta^{-1}\mathbb{E}_{a_{-i}\sim\pi_{-i}(a_{-i}|s)}e^{\beta\left(r(s,a)+\gamma V^{\pi}(f(s,a))\right)}
=β1𝔼aiπi(ai|s)eβ(r(s,a)+γβ1log(Zπ(f(s,a))))\displaystyle=\beta^{-1}\mathbb{E}_{a_{-i}\sim\pi_{-i}(a_{-i}|s)}e^{\beta\left(r(s,a)+\gamma\beta^{-1}\log(Z^{\pi}(f(s,a)))\right)}
=β1𝔼aiπieβr(s,ai,ai)(Zπ(f(s,ai,ai)))γ.\displaystyle=\beta^{-1}\mathbb{E}_{a_{-i}\sim\pi_{-i}}e^{\beta r(s,a_{i},a_{-i})}\left(Z^{\pi}(f(s,a_{i},a_{-i}))\right)^{\gamma}.

Further, from the definition of Zπ(s)Z^{\pi}(s) we have:

Zπ(s)\displaystyle Z^{\pi}(s) =eβVπ(s)=eββ1log(𝔼aπeβQ(s,a))\displaystyle=e^{\beta V^{\pi}(s)}=e^{\beta\beta^{-1}\log(\mathbb{E}_{a\sim\pi}e^{\beta Q(s,a)})}
=𝔼aiπθ𝔼aiπieβQ(s,ai,ai)\displaystyle=\mathbb{E}_{a_{i}\sim\pi^{\theta}}\mathbb{E}_{a_{-i}\sim\pi_{-i}}e^{\beta Q(s,a_{i},a_{-i})}
=𝔼aiπθQ¯iπ(s,ai),\displaystyle=\mathbb{E}_{a_{i}\sim\pi^{\theta}}\overline{Q}_{i}^{\pi}(s,a_{i}),

which completes the proof. ∎

Remark 4 (Interpretation).

Lemma 3 shows that the auxiliary variable Zπ(s)Z^{\pi}(s) acts as an exponential proxy for the state value. This avoids computing expectations over the full joint space directly, since each agent can evaluate Q¯iπ(s,ai)\overline{Q}_{i}^{\pi}(s,a_{i}) by averaging only over the actions of the other agents. In practice, the expectations are replaced with empirical averages from sampled trajectories, leading to a practical decentralized policy-evaluation procedure. This is summarized in Algorithm 1, that iteratively updates both Q¯iπ(s,ai)\overline{Q}_{i}^{\pi}(s,a_{i}) and Zπ(s)Z^{\pi}(s). Once we have reliable estimates of Q¯iπ(s,ai)\overline{Q}_{i}^{\pi}(s,a_{i}) from this evaluation step, we can improve the agent’s policy using either policy gradient or policy iteration. Algorithm 2 formalizes this process, where the computed Q¯iπ(s,ai)\overline{Q}_{i}^{\pi}(s,a_{i}) serves as a guide for updating each agent’s policy in a decentralized manner, enabling optimistic, coordinated learning across agents.

Algorithm 1
Agent ii’s Decentralized Optimistic Policy Evaluation
1:Discount γ(0,1)\gamma\in(0,1), stepsize sequence {αt}\{\alpha_{t}\}, optimistic parameter β>0\beta>0, horizon TQT_{Q}, and agent ii’s policy πi\pi_{i}
2:Initialize Zπ(s),Q¯iπ(s,a)Z^{\pi}(s),\overline{Q}_{i}^{\pi}(s,a) for all s,as,a,
3:  e.g., Z(s)1,Q¯iπ(s,a)0Z(s)\leftarrow 1,\overline{Q}_{i}^{\pi}(s,a)\leftarrow 0
4:Sample initial state s0s_{0}
5:for t=0t=0 to TQ1T_{Q}-1 do
6:  Each agent ii sample action aiπi(st)a_{i}\sim\pi_{i}(\cdot\mid s_{t}), then execute ai,ta_{i,t}, observe reward rtr_{t} and next state st+1s_{t+1}
7:  Q-update:
Q¯iπ(st,ai,t)(1αt)Q¯iπ(st,ai,t)+αtβeβrt(Zπ(st+1))γ\overline{Q}_{i}^{\pi}(s_{t},a_{i,t})\leftarrow(1-\alpha_{t})\,\overline{Q}_{i}^{\pi}(s_{t},a_{i,t})+\!\frac{\alpha_{t}}{\beta}\,e^{\beta r_{t}}\,\Big(\!Z^{\pi}(s_{t+1})\!\Big)^{\!\gamma}
8:  Z-update:
Zπ(st)(1αt)Zπ(st)+αtQ¯iπ(st,ai,t)Z^{\pi}(s_{t})\leftarrow(1-\alpha_{t})\,Z^{\pi}(s_{t})+\alpha_{t}\,\overline{Q}_{i}^{\pi}(s_{t},a_{i,t})
9:end for
Algorithm 2
Agent ii’s Decentralized Optimistic Policy Update
1:Initial Policy π=i=1nπi\pi=\prod_{i=1}^{n}\pi_{i}
2:for t=0t=0 to T1T-1 and each agent ii do
3:  Estimate Q¯iπ(s,ai)\overline{Q}_{i}^{\pi}(s,a_{i}) using optimistic policy evaluation (Algorithm 1)
4:  Policy Gradient:
θi,s(t+1)\displaystyle\textstyle\theta_{i,s}^{(t+1)} =ProjΔ𝒜i(θi,s(t)+η1γQ¯iπ(s,))\displaystyle\textstyle=\textup{Proj}_{\Delta^{\mathcal{A}_{i}}}\left(\theta_{i,s}^{(t)}+\frac{\eta}{1-\gamma}\overline{Q}_{i}^{\pi}(s,\cdot)\right)
π\displaystyle\textstyle\pi πθ(t+1)\displaystyle\textstyle\leftarrow\pi^{\theta^{(t+1)}}
  or Policy Iteration / Q-learning:
πi(s)\displaystyle\textstyle\pi_{i}(s) argmaxaiQ¯iπ(s,ai)\displaystyle\textstyle\leftarrow\operatorname*{argmax}_{a_{i}}\overline{Q}_{i}^{\pi}(s,a_{i})
5:end for

V Numerical Result

We now provide empirical evidence for the benefits of optimistic updates. Our experiments include a simple gridworld testbed, which illustrates the effect of optimism in escaping suboptimal equilibria, and a cooperative continuous-control benchmark, which evaluates performance in a more realistic multi-agent setting.

V-A Full-Information Setting

We first evaluate the modified optimistic gradient update (11) under the full-information setting, where Q¯iπ\overline{Q}_{i}^{\pi} is computed exactly, on a simple two-agent gridworld. Each agent chooses actions ax,ay{1,0,1}a_{x},a_{y}\in\{-1,0,1\}, and the joint state s=(x,y)s=(x,y) evolves as x=mod(x+ax,4)x^{\prime}=\mod(x+a_{x},4)y=mod(y+ay,4)y^{\prime}=\mod(y+a_{y},4). At each position, agents receive reward R(x,y)R(x,y) as shown in Fig. 3 (left). The global optimal policy is for agents to transit to (1,1)(1,1) as fast as possible. However, staying at location (4,4)(4,4) represents a suboptimal Nash equilibrium in some sense. We compare the risk-neutral policy-gradient update with our optimistic variant (11). Fig. 3 and Fig. 3 show the normalized state-visitation distributions dπ,Pd^{\pi,P} (where π\pi is the policy found by the algorithm) under the learned policies. The risk-neutral algorithm tends to concentrate mass on the suboptimal equilibrium (4,4)(4,4), achieving a total reward of 120.66. By contrast, the optimistic update assigns most probability to the true optimum (1,1)(1,1), reaching the globally optimal return of 181.87.

[101010010101001010000005]\displaystyle\begin{bmatrix}-10&-10&-10&0\\ -10&10&-10&0\\ -10&-10&0&0\\ 0&0&0&5\end{bmatrix}

Figure 1: Reward table RR
Refer to caption
Figure 2: Risk-neutral
Refer to caption
Figure 3: Optimistic

V-B Sample-based Setting

Refer to caption
Figure 4: Cooperative Ball Balancing [19]

We next evaluate the decentralized sample-based algorithm on the cooperative ball-balancing task (Fig. 4), where two agents must keep a ball centered on a flat table they jointly control. This benchmark is known to require coordination and has been widely used to test cooperative MARL methods [19, 28]. Table I compares final returns across 10 random seeds. The optimistic algorithms consistently outperform both decentralized QQ-learning and hysteretic QQ-learning. The best-performing setting (β=0.04\beta=0.04) achieves the highest mean return (169.2) with the lowest variance. Performance is robust to moderate changes in β\beta, showing that optimism need not be finely tuned to be effective. Hysteretic QQ-learning performs competitively but slightly worse, while decentralized QQ-learning exhibits higher variance and lower mean performance.

These experiments highlight the advantage of optimistic updates: they not only steer agents away from suboptimal equilibria (gridworld), but also improve both efficiency and stability in challenging cooperative tasks (ball balancing).

TABLE I: Final performance comparison (Mean ±\pm Std).
Method Mean ±\pm Std
Optimistic β=0.04\mathbf{\boldsymbol{\beta=}0.04} 169.218 ±\pm 1.636
Optimistic β=0.01\beta=0.01 168.579 ±\pm 2.013
Optimistic β=0.003\beta=0.003 168.079 ±\pm 2.460
Hysteretic Q-Learning [19] 167.921 ±\pm 1.994
Decentralized Q-learning 167.123 ±\pm 4.131
Refer to caption
Figure 5: Learning curves for different algorithms.

VI Conclusion

In this work, we revisited the role of optimism in MARL through a risk-sensitive lens. By introducing optimistic value functions, we provided a rigorous connection between risk measures and optimism, framing optimism as a principled form of controlled risk-seeking. This perspective led to a policy-gradient theorem for optimistic objectives and to decentralized actor-critic algorithms that implement these ideas in practice.

Our experiments demonstrate that risk-seeking optimistic updates consistently improve coordination compared to risk-neutral and heuristic baselines, highlighting that principled optimism yields both theoretical clarity and practical gains.

Looking ahead, several promising directions remain open. One is to extend our algorithm design to broader families of risk measures and to establish rigorous sample-complexity guarantees for the proposed optimistic algorithms. Another is to adapt our approach for integration with established MARL methods such as MAPPO and QMIX. Finally, applying our framework to high-stakes domains—including autonomous driving, energy management, and distributed robotics—would provide a compelling test of its ability to enhance cooperation in real-world multi-agent systems.

References

  • [1] H. Föllmer and A. Schied, “Convex measures of risk and trading constraints,” Finance and stochastics, vol. 6, pp. 429–447, 2002.
  • [2] A. Ruszczyński, “Risk-averse dynamic programming for Markov decision processes,” Mathematical programming, vol. 125, pp. 235–261, 2010.
  • [3] Y. Chow, A. Tamar, S. Mannor, and M. Pavone, “Risk-sensitive and robust decision-making: a cvar optimization approach,” Advances in neural information processing systems, vol. 28, 2015.
  • [4] Y. Fei, Z. Yang, Y. Chen, and Z. Wang, “Exponential bellman equation and improved regret bounds for risk-sensitive reinforcement learning,” Advances in neural information processing systems, vol. 34, pp. 20436–20446, 2021.
  • [5] A. Huang, L. Leqi, Z. C. Lipton, and K. Azizzadenesheli, “On the convergence and optimality of policy gradient for Markov coherent risk,” arXiv preprint arXiv:2103.02827, 2021.
  • [6] G. Iyengar, “Robust dynamic programming,” Mathematics of Operations Research, vol. 30, no. 2, pp. 257–280, 2005.
  • [7] A. Nilim and L. El Ghaoui, “Robust control of markov decision processes with uncertain transition matrices,” Operations Research, vol. 53, no. 5, pp. 780–798, 2005.
  • [8] K. Panaganti and D. Kalathil, “Sample complexity of robust reinforcement learning with a generative model,” in International Conference on Artificial Intelligence and Statistics, pp. 9582–9602, PMLR, 2022.
  • [9] L. Shi and Y. Chi, “Distributionally robust model-based offline reinforcement learning with near-optimal sample complexity,” arXiv preprint arXiv:2208.05767, 2022.
  • [10] L. Shi, G. Li, Y. Wei, Y. Chen, M. Geist, and Y. Chi, “The curious price of distributional robustness in reinforcement learning with a generative model,” Advances in Neural Information Processing Systems, 2023.
  • [11] N. Bäuerle and A. Glauner, “Distributionally robust Markov decision processes and their connection to risk measures,” Mathematics of Operations Research, vol. 47, no. 3, pp. 1757–1780, 2022.
  • [12] T. Osogami, “Robustness and risk-sensitivity in Markov decision processes,” Advances in Neural Information Processing Systems, vol. 25, 2012.
  • [13] R. Zhang, Y. Hu, and N. Li, “Soft robust mdps and risk-sensitive mdps: Equivalence, policy gradient, and sample complexity,” in The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024, OpenReview.net, 2024.
  • [14] K. Zhang, T. SUN, Y. Tao, S. Genc, S. Mallya, and T. Basar, “Robust multi-agent reinforcement learning with model uncertainty,” in Advances in Neural Information Processing Systems (H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, eds.), vol. 33, pp. 10571–10583, Curran Associates, Inc., 2020.
  • [15] L. Shi, J. Gai, E. Mazumdar, Y. Chi, and A. Wierman, “Breaking the curse of multiagency in robust multi-agent reinforcement learning,” arXiv preprint arXiv:2409.20067, 2024.
  • [16] N. Lanzetti, S. Fricker, S. Bolognani, F. Dörfler, and D. Paccagnan, “Strategically robust game theory via optimal transport,” 2025.
  • [17] E. Mazumdar, K. Panaganti, and L. Shi, “Tractable multi-agent reinforcement learning through behavioral economics,” in The Thirteenth International Conference on Learning Representations, 2025.
  • [18] R. P. Wiegand, An analysis of cooperative coevolutionary algorithms. George Mason University, 2004.
  • [19] L. Matignon, G. J. Laurent, and N. Le Fort-Piat, “Hysteretic q-learning: an algorithm for decentralized reinforcement learning in cooperative multi-agent teams,” in IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2007.
  • [20] M. G. Bellemare, W. Dabney, and M. Rowland, Distributional reinforcement learning. MIT Press, 2023.
  • [21] W. Zhao, Y. Zhao, Z. Li, J. Kannala, and J. Pajarinen, “Optimistic multi-agent policy gradient,” arXiv preprint arXiv:2311.01953, 2023.
  • [22] M. Lauer and M. A. Riedmiller, “An algorithm for distributed reinforcement learning in cooperative multi-agent systems,” in Proceedings of the Seventeenth International Conference on Machine Learning, ICML ’00, (San Francisco, CA, USA), p. 535–542, Morgan Kaufmann Publishers Inc., 2000.
  • [23] L. Panait, K. Sullivan, and S. Luke, “Lenient learners in cooperative multiagent systems,” in Proceedings of the Fifth International Joint Conference on Autonomous Agents and Multiagent Systems, AAMAS ’06, (New York, NY, USA), p. 801–803, Association for Computing Machinery, 2006.
  • [24] E. Wei and S. Luke, “Lenient learning in independent-learner stochastic cooperative games,” J. Mach. Learn. Res., vol. 17, p. 2914–2955, Jan. 2016.
  • [25] S. Kapetanakis and D. Kudenko, “Reinforcement learning of coordination in cooperative multi-agent systems,” in AAAI/IAAI, 2002.
  • [26] R. Zhang, S. Wang, W. Chen, Y. Zhou, Z. Zhao, Z. Zhang, and R. Zhang, “Optimistic ϵ\epsilon-greedy exploration for cooperative multi-agent reinforcement learning,” arXiv preprint arXiv:2502.03506, 2025.
  • [27] X. Zhao, Y. Pan, C. Xiao, S. Chandar, and J. Rajendran, “Conditionally optimistic exploration for cooperative deep multi-agent reinforcement learning,” in Uncertainty in Artificial Intelligence, pp. 2529–2540, PMLR, 2023.
  • [28] T. Taniguchi and T. Sawaragi, “Adaptive organization of generalized behavioral concepts for autonomous robots: schema-based modular reinforcement learning,” in 2005 International Symposium on Computational Intelligence in Robotics and Automation, pp. 601–606, 2005.

Appendix A Proof of Theorem 2

Proof.

Taking differential on both side of (6) we get

θVπθ(s)=σ(πsθ,Qπθ(s,))θ\displaystyle\textstyle\quad\nabla_{\theta}V^{\pi^{\theta}}(s)=\frac{\partial\sigma(\pi_{s}^{\theta},-Q^{\pi^{\theta}}(s,\cdot))}{\partial\theta}
=aσ(πsθ,Qπθ(s,))πθ(a|s)θπθ(a|s)\displaystyle\textstyle=\sum_{a}\frac{\partial\sigma(\pi_{s}^{\theta},-Q^{\pi^{\theta}}(s,\cdot))}{\partial\pi^{\theta}(a|s)}\nabla_{\theta}\pi^{\theta}(a|s)
+aσ(πsθ,Qπθ(s,))Qπθ(s,a)θQπθ(s,a)\displaystyle\textstyle\qquad+\sum_{a}\frac{\partial\sigma(\pi_{s}^{\theta},-Q^{\pi^{\theta}}(s,\cdot))}{\partial Q^{\pi^{\theta}}(s,a)}\nabla_{\theta}Q^{\pi^{\theta}}(s,a)
=aσ(πsθ,Qπθ(s,))πθ(a|s)πθ(a|s)θlogπθ(a|s)\displaystyle\textstyle=\sum_{a}\frac{\partial\sigma(\pi_{s}^{\theta},-Q^{\pi^{\theta}}(s,\cdot))}{\partial\pi^{\theta}(a|s)}\pi^{\theta}(a|s)\nabla_{\theta}\log\pi^{\theta}(a|s)
+aσ(πsθ,Qπθ(s,))Qπθ(s,a)θQπθ(s,a).\displaystyle\textstyle\qquad+\sum_{a}\frac{\partial\sigma(\pi_{s}^{\theta},-Q^{\pi^{\theta}}(s,\cdot))}{\partial Q^{\pi^{\theta}}(s,a)}\nabla_{\theta}Q^{\pi^{\theta}}(s,a).

Further, from Lemma 6 in [13], we have that σ(πsθ,Qπθ(s,))Qπθ(s,a)=π^θ(a|s)\frac{\partial\sigma(\pi_{s}^{\theta},-Q^{\pi^{\theta}}(s,\cdot))}{\partial Q^{\pi^{\theta}}(s,a)}=\widehat{\pi}^{\theta}(a|s), where π^\widehat{\pi} is defined as in (1). Substituting this equation to the above equations we get

θVπθ(s)=aσ(πsθ,Qπθ(s,))πθ(a|s)πθ(a|s)θlogπθ(a|s)\displaystyle\textstyle\nabla_{\theta}V^{\pi^{\theta}}(s)=\sum_{a}\frac{\partial\sigma(\pi_{s}^{\theta},-Q^{\pi^{\theta}}(s,\cdot))}{\partial\pi^{\theta}(a|s)}\pi^{\theta}(a|s)\nabla_{\theta}\log\pi^{\theta}(a|s)
+aπ^θ(a|s)θQπθ(s,a).\displaystyle\textstyle+\sum_{a}\widehat{\pi}^{\theta}(a|s)\nabla_{\theta}Q^{\pi^{\theta}}(s,a).
SinceθQπθ(s,a)\displaystyle\textstyle\textup{Since}~\nabla_{\theta}Q^{\pi^{\theta}}(s,a) =θ(r(s,a)+γsP(s|s,a)Vπθ(s))\displaystyle\textstyle\!=\!\textstyle\nabla_{\theta}\!\left(r(s,a)\!+\!\gamma\sum_{s^{\prime}}P(s^{\prime}|s,a)V^{\pi^{\theta}}(s^{\prime})\right)
=γsP(s|s,a)θVπθ(s),\displaystyle\textstyle\textstyle=\gamma\sum_{s^{\prime}}P(s^{\prime}|s,a)\nabla_{\theta}V^{\pi^{\theta}}(s^{\prime}),

we have

θVπθ(s)=aσ(πsθ,Qπθ(s,))πθ(a|s)πθ(a|s)θlogπθ(a|s)\displaystyle\textstyle\nabla_{\theta}V^{\pi^{\theta}}(s)=\sum_{a}\frac{\partial\sigma(\pi_{s}^{\theta},-Q^{\pi^{\theta}}(s,\cdot))}{\partial\pi^{\theta}(a|s)}\pi^{\theta}(a|s)\nabla_{\theta}\log\pi^{\theta}(a|s)
+γs,aπ^θ(a|s)P(s|s,a)Vπθ(s),\displaystyle\textstyle+\gamma\sum_{s^{\prime},a}\widehat{\pi}^{\theta}(a|s)P(s^{\prime}|s,a)\nabla V^{\pi^{\theta}}(s^{\prime}),

i.e.,

θVπθ(s0)=aσ(πs0θ,Qπθ(s0,))πθ(a|s0)πθ(a|s0)θlogπθ(a|s0)\displaystyle\textstyle\nabla_{\theta}V^{\pi^{\theta}}(s_{0})\!=\!\sum_{a}\!\frac{\partial\sigma(\pi_{s_{0}}^{\theta},-Q^{\pi^{\theta}}(s_{0},\cdot))}{\partial\pi^{\theta}(a|s_{0})}\pi^{\theta}(a|s_{0})\nabla_{\!\theta}\log\pi^{\theta}(a|s_{0})
+γ𝔼s1P(|s0,a0),a0π^θ(|s0)Vπθ(s1).\displaystyle\textstyle+\gamma\mathbb{E}_{s_{1}\sim P(\cdot|s_{0},a_{0}),a_{0}\sim\widehat{\pi}^{\theta}(\cdot|s_{0})}\nabla V^{\pi^{\theta}}(s_{1}).

Applying this equation iteratively we get

θVπθ(s0)=\textstyle\nabla_{\theta}V^{\pi^{\theta}}(s_{0})=
t=0+γt𝔼st,atπ^θ,Paσ(πstθ,Qπθ(st,))πθ(a|st)πθ(a|st)θlogπθ(a|s0)\textstyle\sum_{t\!=\!0}^{\!+\!\infty}\!\gamma^{t}\mathbb{E}_{s_{t},a_{t}\sim\widehat{\pi}^{\theta}\!,P}\!\sum_{a}\!\frac{\partial\sigma(\pi_{s_{t}}^{\theta},-Q^{\pi^{\theta}}(s_{t},\cdot))}{\partial\pi^{\theta}(a|s_{t})}\pi^{\theta}(a|s_{t})\nabla_{\theta}\!\log\pi^{\theta}(a|s_{0})
=11γsds0π^θ(s)aσ(πsθ,Qπθ(s,))πθ(a|s)πθ(a|s)θlogπθ(a|s)\textstyle=\frac{1}{1-\gamma}\sum_{s}d^{\widehat{\pi}^{\theta}}_{s_{0}}(s)\sum_{a}\frac{\partial\sigma(\pi_{s}^{\theta},-Q^{\pi^{\theta}}(s,\cdot))}{\partial\pi^{\theta}(a|s)}\pi^{\theta}(a|s)\nabla_{\theta}\log\pi^{\theta}(a|s) (14)

Specifically, when σ(πsθ,Qπθ(s,))=β1log𝔼aπsθeβQπθ(s,a)\sigma(\pi^{\theta}_{s},\!-\!Q^{\pi^{\!\theta}}\!(s,\!\cdot))\!=\!\beta^{\!-\!1}\!\log\!\mathbb{E}_{a\sim\pi^{\theta}_{s}}e^{\beta Q^{\pi^{\!\theta}}\!\!(s,a)}, we have π^θ(a|s)πθ(a|s)eβQπθ(s,a)\widehat{\pi}^{\theta}(a|s)\propto\pi^{\theta}(a|s)e^{\beta Q^{\pi^{\theta}}(s,a)}, and that

σ(πsθ,Qπθ(s,))πθ(a|s)=β1eβQπθ(s,a)𝔼aπsθeβQπθ(s,a).\displaystyle\textstyle\frac{\partial\sigma(\pi_{s}^{\theta},-Q^{\pi^{\theta}}(s,\cdot))}{\partial\pi^{\theta}(a|s)}=\beta^{-1}\frac{e^{\beta Q^{\pi^{\theta}}(s,a)}}{\mathbb{E}_{a\sim\pi^{\theta}_{s}}e^{\beta Q^{\pi^{\theta}}(s,a)}}.

Additionally, since

Vπθ(s)=σ(πsθ,Qπθ(s,))=β1log𝔼aπsθeβQπθ(s,a)\displaystyle\textstyle V^{\pi^{\theta}}(s)=\sigma(\pi^{\theta}_{s},-Q^{\pi^{\theta}}(s,\cdot))=\beta^{-1}\log\mathbb{E}_{a\sim\pi^{\theta}_{s}}e^{\beta Q^{\pi^{\theta}}(s,a)}
𝔼aπsθeβQπθ(s,a)=eβVπθ(s).\displaystyle\textstyle\Longrightarrow~~\mathbb{E}_{a\sim\pi^{\theta}_{s}}e^{\beta Q^{\pi^{\theta}}(s,a)}=e^{\beta V^{\pi^{\theta}}(s)}.
Thusσ(πsθ,Qπθ(s,))πθ(a|s)=β1eβQπθ(s,a)eβVπθ(s)=β1eβAπθ(s,a).\displaystyle\textstyle\textup{Thus}~~\frac{\partial\sigma(\pi_{s}^{\theta},-Q^{\pi^{\theta}}(s,\cdot))}{\partial\pi^{\theta}(a|s)}=\beta^{-1}\frac{e^{\beta Q^{\pi^{\theta}}(s,a)}}{e^{\beta V^{\pi^{\theta}}(s)}}=\beta^{-1}e^{\beta A^{\pi^{\theta}}(s,a)}.

Substitute this into (14) we get

θVπθ(s0)=1β(1γ)s,ads0π^θ(s)eβAπθ(s,a)πθ(a|s)θlogπθ(a|s),\displaystyle\textstyle\nabla_{\theta}V^{\pi^{\theta}}(s_{0})\!=\!\frac{1}{\beta(1\!-\!\gamma)}\sum_{s,a}d^{\widehat{\pi}^{\theta}}_{s_{0}}(s)e^{\beta A^{\pi^{\theta}}(s,a)}\pi^{\theta}(a|s)\nabla_{\!\theta}\log\pi^{\theta}(a|s),

which completes the proof. ∎