Optimism as Risk-Seeking in Multi-Agent Reinforcement Learning

Runyu Zhang¹, Na Li², Asuman Ozdaglar¹, Jeff Shamma³, Gioele Zardini¹ ¹Runyu Zhang, Asuman Ozdaglar, and Gioele Zardini are with the Laboratory for Information & Decision Systems, Massachusetts Institute of Technology, {runyuzha, asuman, gzardini}@mit.edu²Na Li is with Harvard University, nali@seas.harvard.edu³Jeff Shamma is with the University of Illinois at Urbana-Champaign, jshamma@illinois.edu.

Abstract

Risk sensitivity has become a central theme in reinforcement learning (RL), where convex risk measures and robust formulations provide principled ways to model preferences beyond expected return. Recent extensions to multi-agent RL (MARL) have largely emphasized the risk-averse setting, prioritizing robustness to uncertainty. In cooperative MARL, however, such conservatism often leads to suboptimal equilibria, and a parallel line of work has shown that optimism can promote cooperation. Existing optimistic methods, though effective in practice, are typically heuristic and lack theoretical grounding. Building on the dual representation for convex risk measures, we propose a principled framework that interprets risk-seeking objectives as optimism. We introduce optimistic value functions, which formalize optimism as divergence-penalized risk-seeking evaluations. Building on this foundation, we derive a policy-gradient theorem for optimistic value functions, including explicit formulas for the entropic risk/KL-penalty setting, and develop decentralized optimistic actor-critic algorithms that implement these updates. Empirical results on cooperative benchmarks demonstrate that risk-seeking optimism consistently improves coordination over both risk-neutral baselines and heuristic optimistic methods. Our framework thus unifies risk-sensitive learning and optimism, offering a theoretically grounded and practically effective approach to cooperation in MARL.

I Introduction

Decision-making under uncertainty is central to reinforcement learning (RL). Standard RL optimizes expected return and is therefore risk-neutral – an assumption that is often too restrictive in risk-critical domains. To overcome this, a rich body of work has studied risk-sensitive and robust RL. Risk-sensitive RL leverages convex risk measures [1] such as entropic risk and CVar to shape Bellman updates and encode preferences over full return distributions [2, 3, 4, 5], while robust RL models distributional uncertainty via ambiguity sets of divergence-regularized optimization [6, 7, 8, 9, 10]. These two perspectives are closely linked: risk-sensitive and robust Markov Decision Processes (MDPs) can be shown equivalent under dual formulations [2, 11, 12, 13], yielding both theoretical guarantees and practical policy-gradient algorithms.

In multi-agent RL (MARL), risk-sensitive and robust formulations have also been explored, often addressing not only environment uncertainty but also unpredictable behavior of other agents [14, 15, 16, 17]. Most of these works, however, adopt a risk-averse stance: prioritizing safety at the expense of cooperation. In cooperative tasks, this conservatism can prevent agents from reaching optimal joint solutions (c.f. [18, 19, 20, 21]). This limitation is exemplified by the problem of relative overgeneralization (RO) [18], where agents converge to suboptimal equilibria by adapting too cautiously to noisy or misaligned partner behavior. To overcome RO, a parallel line of research has explored optimism or risk-seeking as a driver of cooperation. Early approaches such as distributed $Q$ -learning [22], hysteretic $Q$ -learning [19], and lenient learning [23, 24] softened negative updates to preserve cooperative actions. Later methods such as FMQ learning [25] and optimistic exploration [26, 27] reinforced this idea. More recently, optimistic policy-gradient algorithms (e.g., optimistic PPO variants) have been proposed [21], biasing updates toward higher-value estimates to avoid being overly pessimistic about joint actions early on, thereby mitigating relative overgeneralization while preserving optimality at convergence.

Despite their empirical success, most of these optimistic MARL methods are heuristic. Optimism is typically introduced through asymmetric learning rates [19], ad hoc return reshaping [25], or tailored initialization strategies. What is missing is a principled framework that explains optimism mathematically and connects it to established concepts in risk-sensitive learning. This motivates our central question: Can risk-seeking objectives provide a rigorous mechanism for optimism in MARL, thereby promoting cooperation in a theoretically grounded way?

Statement of contribution

This work establishes a principled foundation for optimism in MARL by introducing optimistic value functions derived from convex risk measures. We show that divergence-penalized dual formulations naturally instantiate optimism as a controlled form of risk-seeking, unifying heuristic optimistic updates with risk-sensitive RL theory. Building on this framework, we develop a policy-gradient theorem for optimistic value functions, including explicit sample-based formulas in the entropic risk/KL-penalty setting. We then design decentralized optimistic actor-critic algorithms that realize these updates in practice and validate them empirically on cooperative benchmarks, where they achieve more reliable coordination than both risk-neutral baselines and existing heuristic optimistic methods.

II Preliminaries

This section sets up notation for MDPs and their multi-agent extensions, introduces the optimistic value functions that are central to our work, and recalls the basic properties of convex risk measures.

II-A Markov Decision Processes (MDPs)

A finite MDP is a tuple $\mathcal{M}=(\mathcal{S},\mathcal{A},P,r,\gamma,\rho)$ , consisting of a finite set of states $\mathcal{S}$ , a finite action set $\mathcal{A}$ , the transition probability function $P$ such that $P(s^{\prime}|s,a)$ describes the probability of transitioning from state $s$ to state $s^{\prime}$ given a particular action $a$ . A stochastic policy $\pi:\mathcal{S}\to\Delta^{|\mathcal{A}|}$ specifies a strategy where the agent chooses its action based on the current state in a stochastic fashion; specifically, the probability of choosing action $a$ at state $s$ is given by $\Pr(a|s)=\pi(a|s)$ . For notational simplicity, we use $\pi_{s}(\cdot)$ as shorthand for $\pi(\cdot|s)$ . For a given stationary policy $\pi$ , we denote the discounted state visitation distribution by

\displaystyle\textstyle d^{\pi}(s):=(1-\gamma)\sum_{t=0}^{+\infty}\gamma^{t}\Pr(s_{t}=s~|~s_{0}\sim\rho,a_{\tau}\sim\pi_{s_{\tau}}).

II-B Multi-agent MDPs

In multi-agent settings the action space factors as $\mathcal{A}=\mathcal{A}_{1}\times\mathcal{A}_{2}\times\cdots\times\mathcal{A}_{n}$ , so a joint action $a=(a_{1},a_{2},\cdots,a_{n})$ is composed of individual agent actions $a_{i}$ . We focus on decentralized product policies

\displaystyle\textstyle\pi^{\theta}(a|s)=\prod_{i=1}^{n}\pi_{i}^{\theta}(a_{i}|s),

(1)

where parameters are grouped as $\theta=(\theta_{1},\theta_{2},\cdots,\theta_{n})$ . A direct parametrization means each $\theta_{i}\in\mathbb{R}^{\mathcal{S}\times\mathcal{A}_{i}}$ encodes action probabilities explicitly, i.e.,

\displaystyle\textstyle\theta_{s,a_{i}}=\pi_{i}(a_{i}|s),

(2)

where, for notational simplicity, we abbreviate $\theta_{i}(s,a_{i})$ as $\theta_{s,a_{i}}$ .

II-C Optimistic Value Functions

To encourage cooperation in decentralized learning, we evaluate a baseline policy $\pi$ via optimistic value functions and $Q$ -functions that allow deviations penalized by a divergence. For auxiliary policies $\{\hat{\pi}_{t}\}_{t\geq 0}$ ,

	$\displaystyle\textstyle V^{\pi}\!(s)\!=\!\max_{\widehat{\pi}}\!\left[\!\mathbb{E}_{s_{t}\!,a_{t}\!\sim\!\widehat{\pi}}\!\sum_{t\!=\!0}^{+\!\infty}\!\gamma^{t}\!\left(r(s_{t},\!a_{t})\!-\!\!D(\widehat{\pi}_{t,s_{t}}\|\|\pi_{s_{t}})\right)\!\Big\|s_{0}\!=\!s\right],$
	$\displaystyle\textstyle Q^{\pi}(s,a)=r(s,a)+\max_{\{\widehat{\pi}_{t}\}_{t\geq 1}}$
	$\displaystyle\textstyle\left[\mathbb{E}_{s_{t},a_{t}\sim\widehat{\pi}}\!\sum_{t=1}^{+\infty}\!\gamma^{t}\!\left(r(s_{t},a_{t})\!-\!\!D(\widehat{\pi}_{t,s_{t}}\|\|\pi_{s_{t}})\right)\!\Big\|s_{0}\!=\!s,a_{0}\!=\!a\right].\!\!\!\!\!$		(3)

Here, $D(\hat{\pi}||\pi)$ quantifies the deviation of the auxiliary policy $\hat{\pi}$ from the baseline policy $\pi$ , for instance through the KL divergence. The term optimistic reflects the maximization over the auxiliary policy sequence ${\hat{\pi}_{t}}$ in the definitions above. Rather than directly evaluating the return of $\pi$ ,the value functions $V^{\pi}$ and $Q^{\pi}$ incorporate deviations $\hat{\pi}$ that maximize the expected cumulative reward, while paying a penalty proportional to how far $\hat{\pi}$ strays from $\pi$ . This construction encourages the agent to envision the best-case performance of $\pi$ under locally improved decisions, yielding an evaluation that is deliberately biased toward potentially higher-performing behaviors. The advantage function is defined as $A^{\pi}(s,a)=Q^{\pi}(s,a)-V^{\pi}(s)$ .

II-D Convex Risk Measures

Let $\mathcal{F}$ be the set of real-valued functions on a finite action set $\mathcal{A}$ . A map $\sigma:\mathcal{F}\to\mathbb{R}$ is a convex risk measure if it satisfies:

1.

Monotonicity: For $f^{\prime},f\in\mathcal{F}$ , if $f^{\prime}\leq f$ , then $\sigma(f)\leq\sigma(f^{\prime}).$
2.

Translation invariance: for any $f\in\mathcal{F},m\in\mathbb{R}$ , $\sigma(f+m)=\sigma(f)-m$ .
3.

Convexity: for any $f^{\prime},f\in\mathcal{F},\lambda\in[0,1]$ , $\sigma(\lambda f+(1-\lambda)f^{\prime})\leq\lambda\sigma(f)+(1-\lambda)\sigma(f^{\prime}).$

By standard duality theory, convex risk measures admit the following representation [1].

Theorem 1 (Dual Representation Theorem [1]).

The function $\sigma:\mathcal{F}\to\mathbb{R}$ is a convex risk measure if and only if there exists a “penalty function” $D(\cdot):\Delta^{|\mathcal{A}|}\to\mathbb{R}$ such that

\textstyle\sigma(f)=\sup_{\widehat{\pi}\in\Delta^{|\mathcal{A}|}}\left(-\mathbb{E}_{\widehat{\pi}}f-D(\widehat{\pi})\right).

(4)

In specific, $D$ can be written in the following form:

\displaystyle\textstyle D(\widehat{\pi})=\sup_{f\in\mathcal{F}}\left(-\sigma(f)-\mathbb{E}_{a\sim\widehat{\pi}}f(a)\right).

(5)

Note that $\sigma$ and $D$ serve as the Fenchel conjugate of each other. In most cases, a convex risk measure $\sigma(f)$ can be understood as quantifying the risk of a random variable taking values $f(a)$ , where $a$ is sampled from some distribution $a\sim\pi$ . Thus, commonly used risk measures are naturally tied to an underlying probability distribution $\pi\in\Delta^{|\mathcal{A}|}$ (see, e.g., Example 1). In this work, we focus on such distribution-dependent risk measures, and therefore write $\sigma(\pi,\cdot)$ to emphasize the dependence on $\pi$ . Correspondingly, in the dual representation theorem we denote the penalty term $D(\hat{\pi})$ of $\sigma(\pi,\cdot)$ by $D(\hat{\pi}||\pi)$ . ¹¹1The notation $D$ is intentionally overloaded: it denotes both the regularization term in (3) and the penalty function for a risk measure in (4). The connection between these two uses will become clear in subsequent sections.

Here we provide an example of convex risk measure and its dual form.

Example 1 (Entropy risk measure [1]).

For a given $\beta>0$ , the entropy risk measure takes the form:

\textstyle\sigma(\pi,f)=\beta^{-1}\log\mathbb{E}_{a\sim\pi}e^{-\beta f(a)}.

Its corresponding penalty function $D$ in the dual representation theorem is the KL divergence

\textstyle D(\widehat{\pi}||\pi)=\beta^{-1}\textup{KL}(\widehat{\pi}||\pi)=\beta^{-1}\sum_{a\in\mathcal{A}}\widehat{\pi}(a)\log\left({\widehat{\pi}(a)}/{\pi(a)}\right).

III Bellman Equation and Policy Gradient Theorem for Optimistic Value Functions

This section develops the theoretical foundation for optimizing the optimistic value functions introduced earlier in Eq. 3. We first establish a Bellman equation that characterizes these functions, and then derive a policy-gradient theorem that enables their optimization.

Lemma 1 (Bellman Equation).

The optimistic value functions and $Q$ -functions in Eq. 3 satisfy the following Bellman recursion:

\begin{split}\textstyle V^{\pi}(s)&=\max_{\widehat{\pi}}\sum_{a}\widehat{\pi}(a|s)Q^{\pi}(s,a)-D(\widehat{\pi}_{s}||\pi_{s})\\ \textstyle\qquad~~&=\sigma(\pi_{s},-Q^{\pi}(s,\cdot)),\\ \textstyle Q^{\pi}(s,a)&=r(s,a)+\gamma\sum_{s^{\prime}}P(s^{\prime}|s,a)V^{\pi}(s^{\prime}),\end{split}

(6)

where $\sigma(\pi_{s},\cdot)$ is the convex risk measure associated with the penalty $D$ , via Eq. 4. Moreover, the auxiliary optimistic distribution $\widehat{\pi}_{t}$ in Eq. 3 can be taken as a stationary policy:

\displaystyle\widehat{\pi}(\cdot|s)=\operatorname*{argmax}_{\widehat{\pi}}\sum_{a}\widehat{\pi}(a|s)Q^{\pi}(s,a)-D(\widehat{\pi}_{s}||\pi_{s}).

(7)

Proof.

From the definition of $V^{\pi}$ in Eq. 3 and dual representation theorem we get

	$\displaystyle\textstyle V^{\pi}\!(s)\!=\!\max_{\widehat{\pi}_{0}}\mathbb{E}_{a_{0}\sim\widehat{\pi}_{0}}\Bigg[r(s_{0},a_{0})-D(\widehat{\pi}_{0,s_{0}}\|\|\pi_{s_{0}})+$
	$\displaystyle\textstyle\left.\max_{\{\widehat{\pi}_{t}\}_{t\geq 1}}\left[\!\mathbb{E}_{s_{t}\!,a_{t}\!\sim\!\widehat{\pi}}\!\sum_{t\!=\!1}^{+\!\infty}\!\gamma^{t}\!\left(r(s_{t},\!a_{t})\!-\!\!D(\widehat{\pi}_{t,s_{t}}\|\|\pi_{s_{t}})\right)\!\Big\|s_{0}\!=\!s\!\right]\right]$
	$\displaystyle\textstyle=\!\max_{\widehat{\pi}_{0}}\mathbb{E}_{a_{0}\sim\widehat{\pi}_{0}}\left[Q^{\pi}(s_{0},a_{0})-D(\widehat{\pi}_{0,s_{0}}\|\|\pi_{s_{0}})\Big\|s_{0}=s\right]$
	$\displaystyle\textstyle=\max_{\widehat{\pi}}\sum_{a}\widehat{\pi}(a\|s)Q^{\pi}(s,a)-D(\widehat{\pi}_{s}\|\|\pi_{s})$
	$\displaystyle\textstyle=\sigma(\pi_{s},-Q^{\pi}(s,\cdot)).$

Further we know that the optimistic $\widehat{\pi}$ is given by

\displaystyle\textstyle\widehat{\pi}(\cdot|s)=\operatorname*{argmax}_{\widehat{\pi}}\sum_{a}\widehat{\pi}(a|s)Q^{\pi}(s,a)-D(\widehat{\pi}_{s}||\pi_{s}).

Similarly, from the definition of $Q^{\pi}$ in (3) we get that

	$\displaystyle Q^{\pi}(s,a)=r(s,a)+$
	$\displaystyle\max_{\{\widehat{\pi}_{t}\!\}_{t\!\geq\!1}}\!\!\left[\!\mathbb{E}_{s_{t},a_{t}\sim\widehat{\pi}}\!\sum_{t=1}^{+\infty}\!\!\gamma^{t}\!\left(r(s_{t},a_{t})\!-\!\!D(\widehat{\pi}_{t,s_{t}}\|\|\pi_{s_{t}})\right)\!\Big\|s_{0}\!=\!s,\!a_{0}\!=\!a\!\right]$
	$\displaystyle=r(s,a)+$
	$\displaystyle\mathbb{E}_{s^{\prime}\!\sim\!P(\!\cdot\|s,a)}\!\max_{\{\!\widehat{\pi}_{t}\!\}_{t\!\geq\!1}}\!\!\left[\!\mathbb{E}_{s_{t},\!a_{t}\!\sim\!\widehat{\pi}}\!\sum_{t=1}^{+\infty}\!\!\gamma^{t}\!\left(r(\!s_{t},a_{t})\!-\!\!D(\widehat{\pi}_{t,s_{t}}\|\|\pi_{s_{t}})\right)\!\Big\|s_{1}\!=\!s^{\prime}\!\right]$
	$\displaystyle=r(s,a)+\gamma\sum_{s^{\prime}}P(s^{\prime}\|s,a)V^{\pi}(s^{\prime}),$

which completes the proof. ∎

We are now ready to state the policy gradient theorem for the optimistic value function as follows:

Theorem 2 (Policy gradient theorem for optimistic value function).

For a parametrized policy $\pi^{\theta}$ :

	$\displaystyle\textstyle\nabla_{\theta}V^{\pi^{\theta}}(s_{0})=$
	$\displaystyle\textstyle\frac{1}{1-\gamma}\sum_{s}d^{\widehat{\pi}^{\theta}}_{s_{0}}(s)\sum_{a}\frac{\partial\sigma(\pi_{s}^{\theta},-Q^{\pi^{\theta}}(s,\cdot))}{\partial\pi^{\theta}(a\|s)}\pi^{\theta}(a\|s)\nabla_{\theta}\log\pi^{\theta}(a\|s),$

where $d^{\widehat{\pi}^{\theta}}_{s_{0}}$ is the discounted visitation distribution under the optimistic policy $\hat{\pi}^{\theta}$ defined in (1). In the special case of the entropic risk measure

\sigma(\pi_{s},-Q(s,\cdot))=\beta^{-1}\log\mathbb{E}_{a\sim\pi_{s}}e^{\beta Q(s,a)},

the policy gradient simplifies to

	$\displaystyle\textstyle\nabla_{\theta}V^{\pi^{\theta}}(s_{0})=$
	$\displaystyle\qquad\textstyle\frac{1}{\beta(1-\gamma)}\sum_{s,a}d^{\widehat{\pi}^{\theta}}_{s_{0}}(s)e^{\beta A^{\pi^{\theta}}(s,a)}\pi^{\theta}(a\|s)\nabla_{\theta}\log\pi^{\theta}(a\|s),$

with $\widehat{\pi}^{\theta}(a|s)\propto\pi^{\theta}(a|s)e^{\beta Q^{\pi^{\theta}}(s,a)}$ .

The proof of Theorem 2 (deferred to Appendix A) is done by taking the gradient of $\theta$ on both sides of the Bellman equation (6).

Remark 1 (Interpretation).

The Bellman equation above mirrors the risk-neutral case but replaces the baseline expectation under $\pi$ with a risk-sensitive evaluation under an auxiliary distribution $\hat{\pi}$ . This makes the value of $\pi$ optimistic-tilted toward actions or trajectories that could yield higher returns, while still penalizing large deviations from $\pi$ . The policy-gradient theorem retains the same structural form as the classical result, but introduces three important differences. First, the visitation distribution is taken under $\hat{\pi}$ rather than $\pi$ , emphasizing more favorable trajectories. Second, action weights are exponentially scaled by $e^{\beta A^{\pi}(s,a)}$ , amplifying high-advantage actions and inducing a principled risk-seeking bias. Third, as $\beta\to 0$ , we recover the standard policy-gradient theorem: $\hat{\pi}\to\pi$ and the exponential factor reduces to the advantage $A^{\pi}(s,a)$ . Together, these results establish a rigorous connection between optimism and risk-seeking convex risk measures, providing the foundation for optimistic policy-gradient algorithms.

IV Optimistic update for multi-agent learning

In this section, we leverage the policy gradient theorem (Theorem 2) to design optimistic RL algorithms for decentralized multi-agent settings. Throughout this section, we assume direct parametrization (2) of the agents’ policies, and we focus on the case where the penalty function is the KL divergence, i.e., $D(\widehat{\pi}||\pi)=\beta^{-1}\textup{KL}(\widehat{\pi}||\pi)$ , so that the associated risk measure $\sigma$ is the entropic risk. Before presenting the update rules, we introduce two key quantities. For each agent $i$ , define the averaged optimistic $Q$ -function and the averaged optimistic advantage function as

	$\displaystyle\textstyle\overline{Q}_{i}^{\pi^{\theta}}(s,a_{i})=\beta^{-1}\mathbb{E}_{a_{-i}\sim\pi_{-i}^{\theta}(a_{-i}\|s)}e^{\beta Q^{\pi^{\theta}}(s,a_{i},a_{-i})},$		(8)
	$\displaystyle\vskip-10.0pt\textstyle\overline{A}_{i}^{\pi^{\theta}}(s,a_{i})=\beta^{-1}\mathbb{E}_{a_{-i}\sim\pi_{-i}^{\theta}(a_{-i}\|s)}e^{\beta A^{\pi^{\theta}}(s,a_{i},a_{-i})}$		(9)

Lemma 2 (Multi-agent optimistic policy gradient).

Consider decentralized policy (1) with direct parameterization (2). We have that

\displaystyle\textstyle\frac{\partial V^{\pi^{\theta}}(s)}{\partial\theta_{s,a_{i}}}=\frac{1}{1-\gamma}d^{\widehat{\pi}^{\theta}}(s)\overline{A}^{\pi^{\theta}}(s,a_{i}),

(10)

Proof.

This is a direct corollary of the general Theorem 2, obtained by substituting $\pi^{\theta}$ to decentralized policy (1) with direct parameterization (2). ∎

Remark 2 (Interpretation).

The quantities $\overline{Q}_{i}^{\pi^{\theta}}$ $\overline{A}_{i}^{\pi^{\theta}}$ summarize agent $i$ ’s optimistic evaluation of its actions, averaged over the behavior of the other agents under the baseline policy $\pi_{-i}^{\theta}$ . Intuitively, $\overline{Q}_{i}^{\pi^{\theta}}(s,a_{i})$ measures the risk-seeking value of choosing $a_{i}$ , while $\overline{A}_{i}^{\pi^{\theta}}(s,a_{i})$ captures the deviation of that choice from the state value. Both tilt the evaluation toward joint actions that promise higher returns, thus embodying the cooperative bias induced by optimism.

Remark 3 (Practical approximation).

In practice, computing $d^{\hat{\pi}}$ and $\overline{A}_{i}^{\pi}$ exactly is intractable. A common surrogate is to replace $d^{\widehat{\pi}}(s)\overline{A}_{i}^{\pi}(s,a)$ with $\overline{Q}_{i}^{\pi}(s,a)$ in the update, yielding a gradient-like rule that is easier to estimate from data. Specifically, we consider the following updates for agent $i$ . First, the policy-gradient style update:

\displaystyle\textstyle\theta_{i,s}^{(t+1)}=\textup{Proj}_{\Delta^{\mathcal{A}_{i}}}\Big(\theta_{i,s}^{(t)}+\frac{\eta}{1-\gamma}\overline{Q}_{i}^{\pi^{\theta}}(s,\cdot)\Big),

(11)

Second, the policy-iteration/Q-learning style update:

\displaystyle\textstyle\pi_{i}(s)=\arg\max_{a_{i}}\overline{Q}_{i}^{\pi^{\theta}}(s,a_{i}).

(12)

Both formulations are consistent with the optimistic policy-gradient principle and share the same stationary points (since $d^{\hat{\pi}}(s)\overline{A}_{i}^{\pi}(s,a)=\tfrac{d^{\hat{\pi}}(s)}{e^{\beta V^{\pi}(s)}}\overline{Q}_{i}^{\pi}(s,a)$ ). Thus, the resulting updates can be interpreted as a form of scaled gradient ascent, where the exact policy-gradient direction is preserved up to a positive multiplicative constant, making them more practical to implement in decentralized settings.

IV-A Decentralized Learning Algorithm

Building on Lemma 2, we now design a sample-based decentralized optimistic RL algorithm. From the update rules in (11)-(12), the main challenge lies in accurately estimating the averaged optimistic $Q$ -functions $\overline{Q}_{i}^{\pi}(s,a_{i})$ . The following lemma introduces an auxiliary variable that enables a tractable Bellman-style recursion.

Lemma 3 (Decentralized Bellman equation).

Define an auxiliary variable $Z^{\pi}(s):=e^{\beta V^{\pi}(s)}$ . If transitions are deterministic, i.e., $P(s^{\prime}|s,a)=\mathbf{1}\{s^{\prime}=f(s,a)\}$ , then the averaged optimistic quantities satisfy:

	$\displaystyle\overline{Q}_{i}^{\pi}(s,a_{i})$	$\displaystyle=\beta^{-1}\mathbb{E}_{a_{-i}\sim\pi_{-i}}e^{\beta r(s,a_{i},a_{-i})}\left(Z^{\pi}(f(s,a_{i},a_{-i}))\right)^{\gamma},$
	$\displaystyle Z^{\pi}(s)$	$\displaystyle=\mathbb{E}_{a_{i}\sim\pi_{i}^{\theta}}\overline{Q}_{i}^{\pi}(s,a_{i}).$		(13)

Proof.

From the definition of $\overline{Q}_{i}^{\pi}(s,a_{i})$ we have:

	$\displaystyle\overline{Q}_{i}^{\pi}(s,a_{i})$	$\displaystyle=\beta^{-1}\mathbb{E}_{a_{-i}\sim\pi_{-i}(a_{-i}\|s)}e^{\beta Q^{\pi}(s,a_{i},a_{-i})}$
		$\displaystyle=\beta^{-1}\mathbb{E}_{a_{-i}\sim\pi_{-i}(a_{-i}\|s)}e^{\beta\left(r(s,a)+\gamma V^{\pi}(f(s,a))\right)}$
		$\displaystyle=\beta^{-1}\mathbb{E}_{a_{-i}\sim\pi_{-i}(a_{-i}\|s)}e^{\beta\left(r(s,a)+\gamma\beta^{-1}\log(Z^{\pi}(f(s,a)))\right)}$
		$\displaystyle=\beta^{-1}\mathbb{E}_{a_{-i}\sim\pi_{-i}}e^{\beta r(s,a_{i},a_{-i})}\left(Z^{\pi}(f(s,a_{i},a_{-i}))\right)^{\gamma}.$

Further, from the definition of $Z^{\pi}(s)$ we have:

	$\displaystyle Z^{\pi}(s)$	$\displaystyle=e^{\beta V^{\pi}(s)}=e^{\beta\beta^{-1}\log(\mathbb{E}_{a\sim\pi}e^{\beta Q(s,a)})}$
		$\displaystyle=\mathbb{E}_{a_{i}\sim\pi^{\theta}}\mathbb{E}_{a_{-i}\sim\pi_{-i}}e^{\beta Q(s,a_{i},a_{-i})}$
		$\displaystyle=\mathbb{E}_{a_{i}\sim\pi^{\theta}}\overline{Q}_{i}^{\pi}(s,a_{i}),$

which completes the proof. ∎

Remark 4 (Interpretation).

Lemma 3 shows that the auxiliary variable $Z^{\pi}(s)$ acts as an exponential proxy for the state value. This avoids computing expectations over the full joint space directly, since each agent can evaluate $\overline{Q}_{i}^{\pi}(s,a_{i})$ by averaging only over the actions of the other agents. In practice, the expectations are replaced with empirical averages from sampled trajectories, leading to a practical decentralized policy-evaluation procedure. This is summarized in Algorithm 1, that iteratively updates both $\overline{Q}_{i}^{\pi}(s,a_{i})$ and $Z^{\pi}(s)$ . Once we have reliable estimates of $\overline{Q}_{i}^{\pi}(s,a_{i})$ from this evaluation step, we can improve the agent’s policy using either policy gradient or policy iteration. Algorithm 2 formalizes this process, where the computed $\overline{Q}_{i}^{\pi}(s,a_{i})$ serves as a guide for updating each agent’s policy in a decentralized manner, enabling optimistic, coordinated learning across agents.

Algorithm 1
Agent

i

’s Decentralized Optimistic Policy Evaluation

1:Discount

\gamma\in(0,1)

, stepsize sequence

\{\alpha_{t}\}

, optimistic parameter

\beta>0

, horizon

T_{Q}

, and agent

i

’s policy

\pi_{i}

2:Initialize

Z^{\pi}(s),\overline{Q}_{i}^{\pi}(s,a)

for all

s,a

3: e.g.,

Z(s)\leftarrow 1,\overline{Q}_{i}^{\pi}(s,a)\leftarrow 0

4:Sample initial state

s_{0}

5:for

t=0

T_{Q}-1

6: Each agent

i

sample action

a_{i}\sim\pi_{i}(\cdot\mid s_{t})

, then execute

a_{i,t}

, observe reward

r_{t}

and next state

s_{t+1}

7: Q-update:

\overline{Q}_{i}^{\pi}(s_{t},a_{i,t})\leftarrow(1-\alpha_{t})\,\overline{Q}_{i}^{\pi}(s_{t},a_{i,t})+\!\frac{\alpha_{t}}{\beta}\,e^{\beta r_{t}}\,\Big(\!Z^{\pi}(s_{t+1})\!\Big)^{\!\gamma}

8: Z-update:

Z^{\pi}(s_{t})\leftarrow(1-\alpha_{t})\,Z^{\pi}(s_{t})+\alpha_{t}\,\overline{Q}_{i}^{\pi}(s_{t},a_{i,t})

9:end for

Algorithm 2
Agent

i

’s Decentralized Optimistic Policy Update

1:Initial Policy

\pi=\prod_{i=1}^{n}\pi_{i}

2:for

t=0

T-1

and each agent

i

3: Estimate

\overline{Q}_{i}^{\pi}(s,a_{i})

using optimistic policy evaluation (Algorithm 1)

4: Policy Gradient:

	$\displaystyle\textstyle\theta_{i,s}^{(t+1)}$	$\displaystyle\textstyle=\textup{Proj}_{\Delta^{\mathcal{A}_{i}}}\left(\theta_{i,s}^{(t)}+\frac{\eta}{1-\gamma}\overline{Q}_{i}^{\pi}(s,\cdot)\right)$
	$\displaystyle\textstyle\pi$	$\displaystyle\textstyle\leftarrow\pi^{\theta^{(t+1)}}$

or Policy Iteration / Q-learning:

\displaystyle\textstyle\pi_{i}(s)

\displaystyle\textstyle\leftarrow\operatorname*{argmax}_{a_{i}}\overline{Q}_{i}^{\pi}(s,a_{i})

5:end for

V Numerical Result

We now provide empirical evidence for the benefits of optimistic updates. Our experiments include a simple gridworld testbed, which illustrates the effect of optimism in escaping suboptimal equilibria, and a cooperative continuous-control benchmark, which evaluates performance in a more realistic multi-agent setting.

V-A Full-Information Setting

We first evaluate the modified optimistic gradient update (11) under the full-information setting, where $\overline{Q}_{i}^{\pi}$ is computed exactly, on a simple two-agent gridworld. Each agent chooses actions $a_{x},a_{y}\in\{-1,0,1\}$ , and the joint state $s=(x,y)$ evolves as $x^{\prime}=\mod(x+a_{x},4)$ , $y^{\prime}=\mod(y+a_{y},4)$ . At each position, agents receive reward $R(x,y)$ as shown in Fig. 3 (left). The global optimal policy is for agents to transit to $(1,1)$ as fast as possible. However, staying at location $(4,4)$ represents a suboptimal Nash equilibrium in some sense. We compare the risk-neutral policy-gradient update with our optimistic variant (11). Fig. 3 and Fig. 3 show the normalized state-visitation distributions $d^{\pi,P}$ (where $\pi$ is the policy found by the algorithm) under the learned policies. The risk-neutral algorithm tends to concentrate mass on the suboptimal equilibrium $(4,4)$ , achieving a total reward of 120.66. By contrast, the optimistic update assigns most probability to the true optimum $(1,1)$ , reaching the globally optimal return of 181.87.

Refer to caption — Figure 1: Reward table $R$

V-B Sample-based Setting

We next evaluate the decentralized sample-based algorithm on the cooperative ball-balancing task (Fig. 4), where two agents must keep a ball centered on a flat table they jointly control. This benchmark is known to require coordination and has been widely used to test cooperative MARL methods [19, 28]. Table I compares final returns across 10 random seeds. The optimistic algorithms consistently outperform both decentralized $Q$ -learning and hysteretic $Q$ -learning. The best-performing setting ( $\beta=0.04$ ) achieves the highest mean return (169.2) with the lowest variance. Performance is robust to moderate changes in $\beta$ , showing that optimism need not be finely tuned to be effective. Hysteretic $Q$ -learning performs competitively but slightly worse, while decentralized $Q$ -learning exhibits higher variance and lower mean performance.

These experiments highlight the advantage of optimistic updates: they not only steer agents away from suboptimal equilibria (gridworld), but also improve both efficiency and stability in challenging cooperative tasks (ball balancing).

TABLE I: Final performance comparison (Mean

\pm

Std).

Method	Mean $\pm$ Std
Optimistic $\mathbf{\boldsymbol{\beta=}0.04}$	169.218 $\pm$ 1.636
Optimistic $\beta=0.01$	168.579 $\pm$ 2.013
Optimistic $\beta=0.003$	168.079 $\pm$ 2.460
Hysteretic Q-Learning [19]	167.921 $\pm$ 1.994
Decentralized Q-learning	167.123 $\pm$ 4.131

VI Conclusion

In this work, we revisited the role of optimism in MARL through a risk-sensitive lens. By introducing optimistic value functions, we provided a rigorous connection between risk measures and optimism, framing optimism as a principled form of controlled risk-seeking. This perspective led to a policy-gradient theorem for optimistic objectives and to decentralized actor-critic algorithms that implement these ideas in practice.

Our experiments demonstrate that risk-seeking optimistic updates consistently improve coordination compared to risk-neutral and heuristic baselines, highlighting that principled optimism yields both theoretical clarity and practical gains.

Looking ahead, several promising directions remain open. One is to extend our algorithm design to broader families of risk measures and to establish rigorous sample-complexity guarantees for the proposed optimistic algorithms. Another is to adapt our approach for integration with established MARL methods such as MAPPO and QMIX. Finally, applying our framework to high-stakes domains—including autonomous driving, energy management, and distributed robotics—would provide a compelling test of its ability to enhance cooperation in real-world multi-agent systems.

References

[1] H. Föllmer and A. Schied, “Convex measures of risk and trading constraints,” Finance and stochastics, vol. 6, pp. 429–447, 2002.
[2] A. Ruszczyński, “Risk-averse dynamic programming for Markov decision processes,” Mathematical programming, vol. 125, pp. 235–261, 2010.
[3] Y. Chow, A. Tamar, S. Mannor, and M. Pavone, “Risk-sensitive and robust decision-making: a cvar optimization approach,” Advances in neural information processing systems, vol. 28, 2015.
[4] Y. Fei, Z. Yang, Y. Chen, and Z. Wang, “Exponential bellman equation and improved regret bounds for risk-sensitive reinforcement learning,” Advances in neural information processing systems, vol. 34, pp. 20436–20446, 2021.
[5] A. Huang, L. Leqi, Z. C. Lipton, and K. Azizzadenesheli, “On the convergence and optimality of policy gradient for Markov coherent risk,” arXiv preprint arXiv:2103.02827, 2021.
[6] G. Iyengar, “Robust dynamic programming,” Mathematics of Operations Research, vol. 30, no. 2, pp. 257–280, 2005.
[7] A. Nilim and L. El Ghaoui, “Robust control of markov decision processes with uncertain transition matrices,” Operations Research, vol. 53, no. 5, pp. 780–798, 2005.
[8] K. Panaganti and D. Kalathil, “Sample complexity of robust reinforcement learning with a generative model,” in International Conference on Artificial Intelligence and Statistics, pp. 9582–9602, PMLR, 2022.
[9] L. Shi and Y. Chi, “Distributionally robust model-based offline reinforcement learning with near-optimal sample complexity,” arXiv preprint arXiv:2208.05767, 2022.
[10] L. Shi, G. Li, Y. Wei, Y. Chen, M. Geist, and Y. Chi, “The curious price of distributional robustness in reinforcement learning with a generative model,” Advances in Neural Information Processing Systems, 2023.
[11] N. Bäuerle and A. Glauner, “Distributionally robust Markov decision processes and their connection to risk measures,” Mathematics of Operations Research, vol. 47, no. 3, pp. 1757–1780, 2022.
[12] T. Osogami, “Robustness and risk-sensitivity in Markov decision processes,” Advances in Neural Information Processing Systems, vol. 25, 2012.
[13] R. Zhang, Y. Hu, and N. Li, “Soft robust mdps and risk-sensitive mdps: Equivalence, policy gradient, and sample complexity,” in The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024, OpenReview.net, 2024.
[14] K. Zhang, T. SUN, Y. Tao, S. Genc, S. Mallya, and T. Basar, “Robust multi-agent reinforcement learning with model uncertainty,” in Advances in Neural Information Processing Systems (H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, eds.), vol. 33, pp. 10571–10583, Curran Associates, Inc., 2020.
[15] L. Shi, J. Gai, E. Mazumdar, Y. Chi, and A. Wierman, “Breaking the curse of multiagency in robust multi-agent reinforcement learning,” arXiv preprint arXiv:2409.20067, 2024.
[16] N. Lanzetti, S. Fricker, S. Bolognani, F. Dörfler, and D. Paccagnan, “Strategically robust game theory via optimal transport,” 2025.
[17] E. Mazumdar, K. Panaganti, and L. Shi, “Tractable multi-agent reinforcement learning through behavioral economics,” in The Thirteenth International Conference on Learning Representations, 2025.
[18] R. P. Wiegand, An analysis of cooperative coevolutionary algorithms. George Mason University, 2004.
[19] L. Matignon, G. J. Laurent, and N. Le Fort-Piat, “Hysteretic q-learning: an algorithm for decentralized reinforcement learning in cooperative multi-agent teams,” in IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2007.
[20] M. G. Bellemare, W. Dabney, and M. Rowland, Distributional reinforcement learning. MIT Press, 2023.
[21] W. Zhao, Y. Zhao, Z. Li, J. Kannala, and J. Pajarinen, “Optimistic multi-agent policy gradient,” arXiv preprint arXiv:2311.01953, 2023.
[22] M. Lauer and M. A. Riedmiller, “An algorithm for distributed reinforcement learning in cooperative multi-agent systems,” in Proceedings of the Seventeenth International Conference on Machine Learning, ICML ’00, (San Francisco, CA, USA), p. 535–542, Morgan Kaufmann Publishers Inc., 2000.
[23] L. Panait, K. Sullivan, and S. Luke, “Lenient learners in cooperative multiagent systems,” in Proceedings of the Fifth International Joint Conference on Autonomous Agents and Multiagent Systems, AAMAS ’06, (New York, NY, USA), p. 801–803, Association for Computing Machinery, 2006.
[24] E. Wei and S. Luke, “Lenient learning in independent-learner stochastic cooperative games,” J. Mach. Learn. Res., vol. 17, p. 2914–2955, Jan. 2016.
[25] S. Kapetanakis and D. Kudenko, “Reinforcement learning of coordination in cooperative multi-agent systems,” in AAAI/IAAI, 2002.
[26] R. Zhang, S. Wang, W. Chen, Y. Zhou, Z. Zhao, Z. Zhang, and R. Zhang, “Optimistic $\epsilon$ -greedy exploration for cooperative multi-agent reinforcement learning,” arXiv preprint arXiv:2502.03506, 2025.
[27] X. Zhao, Y. Pan, C. Xiao, S. Chandar, and J. Rajendran, “Conditionally optimistic exploration for cooperative deep multi-agent reinforcement learning,” in Uncertainty in Artificial Intelligence, pp. 2529–2540, PMLR, 2023.
[28] T. Taniguchi and T. Sawaragi, “Adaptive organization of generalized behavioral concepts for autonomous robots: schema-based modular reinforcement learning,” in 2005 International Symposium on Computational Intelligence in Robotics and Automation, pp. 601–606, 2005.

Appendix A Proof of Theorem 2

Proof.

Taking differential on both side of (6) we get

		$\displaystyle\textstyle\quad\nabla_{\theta}V^{\pi^{\theta}}(s)=\frac{\partial\sigma(\pi_{s}^{\theta},-Q^{\pi^{\theta}}(s,\cdot))}{\partial\theta}$
		$\displaystyle\textstyle=\sum_{a}\frac{\partial\sigma(\pi_{s}^{\theta},-Q^{\pi^{\theta}}(s,\cdot))}{\partial\pi^{\theta}(a\|s)}\nabla_{\theta}\pi^{\theta}(a\|s)$
		$\displaystyle\textstyle\qquad+\sum_{a}\frac{\partial\sigma(\pi_{s}^{\theta},-Q^{\pi^{\theta}}(s,\cdot))}{\partial Q^{\pi^{\theta}}(s,a)}\nabla_{\theta}Q^{\pi^{\theta}}(s,a)$
		$\displaystyle\textstyle=\sum_{a}\frac{\partial\sigma(\pi_{s}^{\theta},-Q^{\pi^{\theta}}(s,\cdot))}{\partial\pi^{\theta}(a\|s)}\pi^{\theta}(a\|s)\nabla_{\theta}\log\pi^{\theta}(a\|s)$
		$\displaystyle\textstyle\qquad+\sum_{a}\frac{\partial\sigma(\pi_{s}^{\theta},-Q^{\pi^{\theta}}(s,\cdot))}{\partial Q^{\pi^{\theta}}(s,a)}\nabla_{\theta}Q^{\pi^{\theta}}(s,a).$

Further, from Lemma 6 in [13], we have that $\frac{\partial\sigma(\pi_{s}^{\theta},-Q^{\pi^{\theta}}(s,\cdot))}{\partial Q^{\pi^{\theta}}(s,a)}=\widehat{\pi}^{\theta}(a|s)$ , where $\widehat{\pi}$ is defined as in (1). Substituting this equation to the above equations we get

	$\displaystyle\textstyle\nabla_{\theta}V^{\pi^{\theta}}(s)=\sum_{a}\frac{\partial\sigma(\pi_{s}^{\theta},-Q^{\pi^{\theta}}(s,\cdot))}{\partial\pi^{\theta}(a\|s)}\pi^{\theta}(a\|s)\nabla_{\theta}\log\pi^{\theta}(a\|s)$
	$\displaystyle\textstyle+\sum_{a}\widehat{\pi}^{\theta}(a\|s)\nabla_{\theta}Q^{\pi^{\theta}}(s,a).$

	$\displaystyle\textstyle\textup{Since}~\nabla_{\theta}Q^{\pi^{\theta}}(s,a)$	$\displaystyle\textstyle\!=\!\textstyle\nabla_{\theta}\!\left(r(s,a)\!+\!\gamma\sum_{s^{\prime}}P(s^{\prime}\|s,a)V^{\pi^{\theta}}(s^{\prime})\right)$
		$\displaystyle\textstyle\textstyle=\gamma\sum_{s^{\prime}}P(s^{\prime}\|s,a)\nabla_{\theta}V^{\pi^{\theta}}(s^{\prime}),$

we have

	$\displaystyle\textstyle\nabla_{\theta}V^{\pi^{\theta}}(s)=\sum_{a}\frac{\partial\sigma(\pi_{s}^{\theta},-Q^{\pi^{\theta}}(s,\cdot))}{\partial\pi^{\theta}(a\|s)}\pi^{\theta}(a\|s)\nabla_{\theta}\log\pi^{\theta}(a\|s)$
	$\displaystyle\textstyle+\gamma\sum_{s^{\prime},a}\widehat{\pi}^{\theta}(a\|s)P(s^{\prime}\|s,a)\nabla V^{\pi^{\theta}}(s^{\prime}),$

i.e.,

	$\displaystyle\textstyle\nabla_{\theta}V^{\pi^{\theta}}(s_{0})\!=\!\sum_{a}\!\frac{\partial\sigma(\pi_{s_{0}}^{\theta},-Q^{\pi^{\theta}}(s_{0},\cdot))}{\partial\pi^{\theta}(a\|s_{0})}\pi^{\theta}(a\|s_{0})\nabla_{\!\theta}\log\pi^{\theta}(a\|s_{0})$
	$\displaystyle\textstyle+\gamma\mathbb{E}_{s_{1}\sim P(\cdot\|s_{0},a_{0}),a_{0}\sim\widehat{\pi}^{\theta}(\cdot\|s_{0})}\nabla V^{\pi^{\theta}}(s_{1}).$

Applying this equation iteratively we get

	$\textstyle\nabla_{\theta}V^{\pi^{\theta}}(s_{0})=$
	$\textstyle\sum_{t\!=\!0}^{\!+\!\infty}\!\gamma^{t}\mathbb{E}_{s_{t},a_{t}\sim\widehat{\pi}^{\theta}\!,P}\!\sum_{a}\!\frac{\partial\sigma(\pi_{s_{t}}^{\theta},-Q^{\pi^{\theta}}(s_{t},\cdot))}{\partial\pi^{\theta}(a\|s_{t})}\pi^{\theta}(a\|s_{t})\nabla_{\theta}\!\log\pi^{\theta}(a\|s_{0})$
	$\textstyle=\frac{1}{1-\gamma}\sum_{s}d^{\widehat{\pi}^{\theta}}_{s_{0}}(s)\sum_{a}\frac{\partial\sigma(\pi_{s}^{\theta},-Q^{\pi^{\theta}}(s,\cdot))}{\partial\pi^{\theta}(a\|s)}\pi^{\theta}(a\|s)\nabla_{\theta}\log\pi^{\theta}(a\|s)$		(14)

Specifically, when $\sigma(\pi^{\theta}_{s},\!-\!Q^{\pi^{\!\theta}}\!(s,\!\cdot))\!=\!\beta^{\!-\!1}\!\log\!\mathbb{E}_{a\sim\pi^{\theta}_{s}}e^{\beta Q^{\pi^{\!\theta}}\!\!(s,a)}$ , we have $\widehat{\pi}^{\theta}(a|s)\propto\pi^{\theta}(a|s)e^{\beta Q^{\pi^{\theta}}(s,a)}$ , and that

\displaystyle\textstyle\frac{\partial\sigma(\pi_{s}^{\theta},-Q^{\pi^{\theta}}(s,\cdot))}{\partial\pi^{\theta}(a|s)}=\beta^{-1}\frac{e^{\beta Q^{\pi^{\theta}}(s,a)}}{\mathbb{E}_{a\sim\pi^{\theta}_{s}}e^{\beta Q^{\pi^{\theta}}(s,a)}}.

Additionally, since

		$\displaystyle\textstyle V^{\pi^{\theta}}(s)=\sigma(\pi^{\theta}_{s},-Q^{\pi^{\theta}}(s,\cdot))=\beta^{-1}\log\mathbb{E}_{a\sim\pi^{\theta}_{s}}e^{\beta Q^{\pi^{\theta}}(s,a)}$
		$\displaystyle\textstyle\Longrightarrow~~\mathbb{E}_{a\sim\pi^{\theta}_{s}}e^{\beta Q^{\pi^{\theta}}(s,a)}=e^{\beta V^{\pi^{\theta}}(s)}.$

\displaystyle\textstyle\textup{Thus}~~\frac{\partial\sigma(\pi_{s}^{\theta},-Q^{\pi^{\theta}}(s,\cdot))}{\partial\pi^{\theta}(a|s)}=\beta^{-1}\frac{e^{\beta Q^{\pi^{\theta}}(s,a)}}{e^{\beta V^{\pi^{\theta}}(s)}}=\beta^{-1}e^{\beta A^{\pi^{\theta}}(s,a)}.

Substitute this into (14) we get

\displaystyle\textstyle\nabla_{\theta}V^{\pi^{\theta}}(s_{0})\!=\!\frac{1}{\beta(1\!-\!\gamma)}\sum_{s,a}d^{\widehat{\pi}^{\theta}}_{s_{0}}(s)e^{\beta A^{\pi^{\theta}}(s,a)}\pi^{\theta}(a|s)\nabla_{\!\theta}\log\pi^{\theta}(a|s),

which completes the proof. ∎

	$\displaystyle\textstyle V^{\pi}\!(s)\!=\!\max_{\widehat{\pi}_{0}}\mathbb{E}_{a_{0}\sim\widehat{\pi}_{0}}\Bigg[r(s_{0},a_{0})-D(\widehat{\pi}_{0,s_{0}}\|\|\pi_{s_{0}})+$
	$\displaystyle\textstyle\left.\max_{\{\widehat{\pi}_{t}\}_{t\geq 1}}\left[\!\mathbb{E}_{s_{t}\!,a_{t}\!\sim\!\widehat{\pi}}\!\sum_{t\!=\!1}^{+\!\infty}\!\gamma^{t}\!\left(r(s_{t},\!a_{t})\!-\!\!D(\widehat{\pi}_{t,s_{t}}\|\|\pi_{s_{t}})\right)\!\Big\|s_{0}\!=\!s\!\right]\right]$
	$\displaystyle\textstyle=\!\max_{\widehat{\pi}_{0}}\mathbb{E}_{a_{0}\sim\widehat{\pi}_{0}}\left[Q^{\pi}(s_{0},a_{0})-D(\widehat{\pi}_{0,s_{0}}\|\|\pi_{s_{0}})\Big\|s_{0}=s\right]$
	$\displaystyle\textstyle=\max_{\widehat{\pi}}\sum_{a}\widehat{\pi}(a\|s)Q^{\pi}(s,a)-D(\widehat{\pi}_{s}\|\|\pi_{s})$
	$\displaystyle\textstyle=\sigma(\pi_{s},-Q^{\pi}(s,\cdot)).$

	$\displaystyle Q^{\pi}(s,a)=r(s,a)+$
	$\displaystyle\max_{\{\widehat{\pi}_{t}\!\}_{t\!\geq\!1}}\!\!\left[\!\mathbb{E}_{s_{t},a_{t}\sim\widehat{\pi}}\!\sum_{t=1}^{+\infty}\!\!\gamma^{t}\!\left(r(s_{t},a_{t})\!-\!\!D(\widehat{\pi}_{t,s_{t}}\|\|\pi_{s_{t}})\right)\!\Big\|s_{0}\!=\!s,\!a_{0}\!=\!a\!\right]$
	$\displaystyle=r(s,a)+$
	$\displaystyle\mathbb{E}_{s^{\prime}\!\sim\!P(\!\cdot\|s,a)}\!\max_{\{\!\widehat{\pi}_{t}\!\}_{t\!\geq\!1}}\!\!\left[\!\mathbb{E}_{s_{t},\!a_{t}\!\sim\!\widehat{\pi}}\!\sum_{t=1}^{+\infty}\!\!\gamma^{t}\!\left(r(\!s_{t},a_{t})\!-\!\!D(\widehat{\pi}_{t,s_{t}}\|\|\pi_{s_{t}})\right)\!\Big\|s_{1}\!=\!s^{\prime}\!\right]$
	$\displaystyle=r(s,a)+\gamma\sum_{s^{\prime}}P(s^{\prime}\|s,a)V^{\pi}(s^{\prime}),$

	$\displaystyle\overline{Q}_{i}^{\pi}(s,a_{i})$	$\displaystyle=\beta^{-1}\mathbb{E}_{a_{-i}\sim\pi_{-i}(a_{-i}\|s)}e^{\beta Q^{\pi}(s,a_{i},a_{-i})}$
		$\displaystyle=\beta^{-1}\mathbb{E}_{a_{-i}\sim\pi_{-i}(a_{-i}\|s)}e^{\beta\left(r(s,a)+\gamma V^{\pi}(f(s,a))\right)}$
		$\displaystyle=\beta^{-1}\mathbb{E}_{a_{-i}\sim\pi_{-i}(a_{-i}\|s)}e^{\beta\left(r(s,a)+\gamma\beta^{-1}\log(Z^{\pi}(f(s,a)))\right)}$
		$\displaystyle=\beta^{-1}\mathbb{E}_{a_{-i}\sim\pi_{-i}}e^{\beta r(s,a_{i},a_{-i})}\left(Z^{\pi}(f(s,a_{i},a_{-i}))\right)^{\gamma}.$