Does Zero-Shot RL Exist

1
Does Zero-Shot Reinforcement Learning Exist?
백승언, 김현성, 이도현 , 정강민
11 June, 2023

2
 Introduction
 Current Success of AI
 Meta Reinforcement Learning
 Does Zero-Shot Reinforcement Learning Exist?
 Backgrounds
 Previous Strategies for Zero-Shot RL
 Algorithms for SF and FB Representations
 Experiments
 Environments
 Results
Contents

4
 Problem setting of reinforcement learning, meta-learning
 Reinforcement learning
• Given certain MDP, lean a policy 𝜋 that maximize the expected discounted return 𝔼𝜋,𝑝0
Σ𝑡=0
∞
𝛾𝑡−1
𝑟𝑡 𝑠𝑡, 𝑎𝑡, 𝑠𝑡+1
 Meta-learning
• Given data from 𝒯
1, … , 𝒯
N, quickly solve new task 𝒯
𝑡𝑒𝑠𝑡
 Problem setting of Meta-Reinforcement Learning(Meta-RL)
 Setting 1: meta-learning with diverse goal(goal as a task)
• 𝒯
𝑖 ≜ {𝒮, 𝒜, 𝑝 𝑠0 , 𝑝 𝑠′
𝑠, 𝑎 , 𝑟 𝑠, 𝑎, 𝑔 , 𝑔𝑖}
 Setting 2: meta-learning with RL tasks(MDP as a task)
• 𝒯
𝑖 ≜ 𝒮𝑖, 𝒜𝑖, 𝑝𝑖 𝑠0 , 𝑝𝑖 𝑠′
𝑠, 𝑎 , 𝑟𝑖(𝑠, 𝑎)
Meta Reinforcement Learning
Meta RL problem statement in CS-330(Finn)

5
Does Zero-Shot Reinforcement Learning
Exist?

6
 Notation
 Reward-free MDP
• ℳ = (𝑆, 𝐴, 𝑃, 𝛾) is a reward-free Markov Decision Process(MDP) with state space 𝑆, action space 𝐴, transition
probability 𝑃(𝑠′
|𝑠, 𝑎) from state 𝑠 to 𝑠′ given action 𝑎 and discount factor 0 < 𝛾 < 1
 Problem statement
 Goal of zero-shot RL is to compute a compact representation ℰ of the env by observing samples of
reward-free transitions (𝑠𝑡, 𝑎𝑡, 𝑠𝑡+1) in this env
 Once a reward function is specified later, the agent must use ℰ to immediately produce a good policy, via
only elementary computation without any further planning or learning
 Reward functions may be specified at test time either as a relatively small set of reward samples (𝑠𝑖, 𝑟𝑖), or
as an explicit function 𝑠 → 𝑟(𝑠)
Backgrounds (I) – Defining Zero-Shot RL

7
 Successor representations (SR)
 For a finite MDP, the successor representation 𝑀𝜋
(𝑠0, 𝑎0) of a state-action pair(𝑠0, 𝑎0) under a policy 𝜋, is
defined as the discounted sum of future occurrences of each state
• 𝑀𝜋
𝑠0, 𝑎0, 𝑠 ≔ 𝔼 Σ𝑡≥0𝛾𝑡
𝕀 𝑠𝑡+1 = 𝑠 | 𝑠0, 𝑎0 , 𝜋 , ∀𝑠 ∈ 𝑆
 In matrix form, SRs can be written as 𝑀𝜋 = 𝑃Σ𝑡≥0𝛾𝑡𝑃𝜋
𝑡 = 𝑃 𝐼 − 𝛾𝑃𝜋
−1, 𝑃𝜋 𝑖𝑠 𝑡ℎ𝑒 𝑠𝑡𝑎𝑡𝑒 𝑡𝑟𝑎𝑛𝑠𝑖𝑡𝑖𝑜𝑛 𝑝𝑟𝑜𝑏
• 𝑀𝜋
satisfies the matrix Bellman Equation 𝑀𝜋
= 𝑃 + 𝛾𝑃𝜋𝑀𝜋
, the Q-function can be expressed as 𝑄𝑟
𝜋
= 𝑀𝜋
𝑟
 Successor features (SFs)
 Successor features extend SR to continuous MDPs by first assuming we are given a basic feature map
𝜑: 𝑆 → ℝ𝑑
that embeds states into 𝑑-dimensional space, and defining the expected discounted sum of
future state features
• 𝜓𝜋
𝑠0, 𝑎0 ≔ 𝔼 Σ𝑡≥0𝛾𝑡
𝜑 𝑠𝑡+1 | 𝑠0, 𝑎0, 𝜋
 Successor measures (SMs)
 Successor measures extend SRs to continuous spaces by treating the distribution of future visited states
as a measure 𝑀𝜋
over the state space 𝑆
• 𝑀𝜋
𝑠0, 𝑎0 𝑋 ≔ Σ𝑡≥0𝛾𝑡
Pr 𝑠𝑡+1 ∈ 𝑋 𝑠0, 𝑎0, 𝜋), ∀𝑋 ⊂ 𝑆, 𝜓𝜋
𝑠0, 𝑎0 = 𝑠′
.
𝑀𝜋
𝑠0, 𝑎0, 𝑑𝑠′
𝜑(𝑠′
)
Backgrounds (II) – Important Concept

8
 Zero-shot RL from successor features
 Given a basic feature map 𝜑: 𝑆 → ℝ𝑑
to be learned via another criterion, universal SFs learn the successor features of a
particular family of policies 𝜋𝑧 for 𝑧 ∈ ℝ𝑑
,
• 𝜓 𝑠0, 𝑎0, 𝑧 = 𝔼 Σ𝑡≥0𝛾𝑡
𝜑 𝑠𝑡+1 | 𝑠0, 𝑎0 , 𝜋𝑧 , 𝜋𝑧 𝑠 ≔ argmax𝑎𝜓 𝑠, 𝑎, 𝑧 𝑇
𝑧
 Once a reward function 𝑟 is revealed, few reward samples of explicit knowledge of the function 𝑟 are used to perform a
linear regression of 𝑟 onto the features 𝜑
• Namely, the 𝑧𝑟 ≔ argmin𝑧𝔼𝑠~𝜌 𝑟 𝑠 − 𝜑 𝑠 𝑇
𝑧 2
= 𝔼𝜌 𝜑𝜑𝑇 −1
𝔼𝜌 𝜑𝑟 => then the policy 𝜋𝑧 is returned
• This policy is guaranteed to be optimal for all rewards in the linear span of the features 𝜑
– If 𝑟 𝑠 = 𝜑 𝑠 𝑇
𝑤, ∀𝑠 ∈ 𝑆, then 𝑧𝑟 = 𝑤, and 𝜋𝑧𝑟
is the optimal policy for reward 𝑟
 Zero-shot RL from Forward-Backward representation (FB)
 Forward-backward representation look for 𝐹: 𝑆 × 𝐴 × ℝ𝑑
→ ℝ𝑑
and 𝐵: 𝑆 → ℝ𝑑
such that the long-term transition prob 𝑀𝜋𝑧
decompose as
• 𝑀𝜋𝑧 𝑠0, 𝑎0, 𝑑𝑠′
≈ 𝑭 𝒔𝟎, 𝒂𝟎, 𝒛 𝑻
𝑩 𝒔′
𝜌 𝑑𝑠′
, 𝜋𝑧 𝑠 ≔ argmax𝑎𝐹 𝑠, 𝑎, 𝑧 𝑇
𝑧
• In a finite space, 𝑀𝜋𝑧 could be decomposed as 𝑀𝜋𝑧 = 𝐹
𝑧
𝑇
𝐵𝑑𝑖𝑎𝑔(𝜌)
 Once a reward function 𝑟 is revealed, 𝑧𝑟 ≔ 𝔼𝑠~𝜌 𝑟 𝑠 𝐵(𝑠) from a few reward samples or from explicit knowledge of the
function 𝑟 are estimated(e.g. 𝑧𝑟 = 𝐵 𝑠 𝑡𝑜 𝑟𝑒𝑎𝑐ℎ 𝑠)
• Then the policy 𝜋𝑧𝑟
is returned.
• Any reward function 𝑟, the policy 𝜋𝑧𝑟
is optimal for 𝑟, with optimal Q-function 𝑄𝑟
⋆
= 𝐹 𝑠, 𝑎, 𝑧𝑟
𝑇
𝑧𝑟
Previous Strategies for Zero-Shot RL

9
 The authors suggested the novel losses used to train 𝝍 in SFs, and F, B in FB
 To obtain a full zero-shot RL algorithm, SFs must specify the basic feature 𝜑, thus they proposed the ten p
ossible choices based on existing or new representations for RL
 Learning the SF 𝝍𝐓𝐳 instead of 𝝍
 The successor feature 𝜓 satisfy the Bellman equation 𝜓𝜋 = 𝑃𝜑 + 𝛾𝑃𝜋𝜓𝜋, the collection of ordinary Bellman
Eq for each component of 𝜑 in BBQ-network
 Therefore 𝜓 𝑠, 𝑎, 𝑧 for each 𝑧 could be trained by minimizing the Bellman residuals as follows,
• 𝜓 𝑠𝑡, 𝑎𝑡, 𝑧 − 𝜑 𝑠𝑡+1 − 𝛾𝜓(𝑠𝑡+1, 𝜋𝑧 𝑠𝑡+1 , 𝑧
2
, 𝑧 𝑖𝑠 𝑟𝑎𝑛𝑑𝑜𝑚 𝑠𝑎𝑚𝑝𝑙𝑒𝑑 𝑏𝑦 𝑙𝑜𝑔𝑖𝑐
 They proposed the novel loss instead of the vector-valued Bellman residual above,
• ℒ 𝜓 ≔ 𝔼 𝑠𝑡,𝑎𝑡,𝑠𝑡+1 ~𝜌 𝜓(𝑠𝑡, 𝑎𝑡, 𝑧 𝑇
𝑧 − 𝜑 𝑠𝑡+1
𝑇
𝑧 − 𝛾𝜓 𝑠𝑡+1, 𝜋𝑧 𝑠𝑡+1 , 𝑧 𝑇
𝑧
2
for each 𝑧
 This trains 𝜓 ⋅, 𝑧 𝑇
𝑧 as the Q-function of reward 𝜑𝑇
𝑧, the only case needed, while training the full vector
𝜓(⋅, 𝑧) amounts to training the Q-functions of each policy 𝜋𝑧 for all rewards 𝜑𝑇𝑧′ for all 𝑧′ ∈ ℝ𝑑 including
𝑧′
≠ 𝑧.
Algorithms for SF and FB Representations (I)

10
 Learning the FB representations: the FB training loss
 The successor measure 𝑀𝜋
satisfies a Bellman-like equation, 𝑀𝜋
= 𝑃 + 𝛾𝑃𝜋𝑀𝜋
, as matrices in the finite ca
se and as measures in the general case in [Blier et al]
• For any policy 𝜋𝑧, the Q-function for the reward 𝑟 can be written 𝑄𝑟
𝜋𝑧
= 𝑀𝜋𝑧𝑟 in matrix form.
– This is equal to 𝐹
𝑧
𝑇
𝐵𝑑𝑖𝑎𝑔 𝜌 𝑟; thus assume that the 𝑧𝑟 ≔ 𝐵𝑑𝑖𝑎𝑔 𝜌 𝑟 = 𝔼𝑠~𝜌 𝐵 𝑠 𝑟(𝑠) , the Q-function is obtained as
𝑄𝑟
𝜋𝑧
= 𝐹
𝑧
𝑇
𝑧𝑟 for any 𝑧 ∈ ℝ𝑑
 FB could be learned by iteratively minimizing the Bellman residual on the parametric model 𝑀 = 𝐹𝑇𝐵𝜌.
• Using a suitable norm ⋅ 𝜌
for the bellman residual leads to a loss expressed as expectation from the dataset
• ℒ 𝐹, 𝐵 ≔ 𝐹𝑧
𝑇
𝐵𝜌 − (𝑃 + 𝛾𝑃𝜋𝑧
𝐹𝑧
𝑇
𝐵𝜌)
= 𝔼 𝑠𝑡,𝑎𝑡,𝑠𝑡+1 ~𝜌,𝑠′~𝜌 𝐹 𝑠𝑡, 𝑎𝑡, 𝑧 𝑇
𝐵 𝑠′ − 𝛾𝐹 𝑠𝑡+1, 𝜋𝑧 𝑠𝑡+1 , 𝑧 𝑇
𝐵 𝑠′ 2
− 2𝔼 𝑠𝑡,𝑎𝑡,𝑠𝑡+1 ~𝜌 𝐹 𝑠𝑡, 𝑎𝑡, 𝑧 𝑇
𝐵(𝑠𝑡+1) + 𝐶𝑜𝑛𝑠𝑡
– 𝑧 is random sampled by logic
 The authors proposed that the last term involves 𝐵(𝑠𝑡+1) instead of 𝐵(𝑠𝑡), because they used 𝑠𝑡+1 instead
of 𝑠𝑡 for the successor measure
Algorithms for SF and FB representations (II)

11
 Learning basic features 𝝋 for SF
 Any representation learning method could be used to supply 𝜑
• The authors suggested the 10 basic features and described the precise learning objective for each
 1. Random Feature (Rand)
• Using a non-trainable randomly initialized network as features
 2. Autoencoder (AEnc)
• Learning a decoder 𝑓: ℝ𝑑
→ 𝑆 to recover the state from its representation 𝜑
– min
𝑓,𝜑
𝔼𝑠~𝒟 𝑓 𝜑(𝑠) − 𝑠 2
 3. Inverse Curiosity Module (ICM)
• Aiming at extracting the controllable aspects of the environment
• Training an inverse dynamics model 𝑔: ℝ𝑑
× ℝ𝑑
→ 𝐴 to predict the action used for a transition between two consecu
tive states
– min
𝑔,𝜑
𝔼 𝑠𝑡,𝑎𝑡,𝑠𝑡+1 ~𝒟 𝑔 𝜑 𝑠𝑡 , 𝜑 𝑠𝑡+1 − 𝑎𝑡
2
Algorithms for SF and FB representations (III)

12
 4. Transition model (Trans)
• Learning the one-step dynamics 𝑓: ℝ𝑑
× 𝐴 → 𝑆 that predicts the next state from the current state representation
– min
𝑓,𝜑
𝔼 𝑠𝑡,𝑎𝑡,𝑠𝑡+1 ~𝒟 𝑓 𝜑 𝑠𝑡 , 𝑎𝑡 − 𝑠𝑡+1
2
 5. Latent transition model (Latent)
• Learning the latent dynamics model but instead of predicting the next state, it predicts its representation
– min
𝑓,𝜑
𝔼 𝑠𝑡,𝑎𝑡,𝑠𝑡+1 ~𝒟 𝑓 𝜑 𝑠𝑡 , 𝑎𝑡 − 𝜑 𝑠𝑡+1
2
 6. Laplacian Eigenfunction (Lap)
• Wu et al consider the symmetrized MDP graph Laplacian induced by an exploratory policy 𝜋, defined as
ℒ = 𝐼 −
1
2
𝑃𝜋𝑑𝑖𝑎𝑔 𝜌 −1
+ 𝑑𝑖𝑎𝑔 𝜌 −1
𝑃𝜋
𝑇
• They propose to learn the eigenfunctions of ℒ via the spectral graph drawing objective as follows:
– min
𝜑
𝔼 𝑠𝑡,𝑠𝑡+1 ~𝒟 𝜑 𝑠𝑡 − 𝜑(𝑠𝑡+1)
2
+ 𝜆𝔼𝑠~𝒟,𝑠′~𝒟 𝜑 𝑠 𝑇
𝜑 𝑠′ 2
− 𝜑(𝑠) 2
2
− 𝜑(𝑠′
) 2
2
– Where the second term is an orthonormality regularization to ensure that 𝔼𝑠~𝜌 𝜑 𝑠 𝜑 𝑠 𝑇
≈ 𝐼
Algorithms for SF and FB representations (IV)

13
 7. Low-Rank Approximation of P
• Learning the features by estimating a low-rank model of the transition probability densities: 𝑃 𝑑𝑠′
𝑠, 𝑎 ≈ 𝒳 𝑠, 𝑎 𝑇
𝜇 𝑠′
𝜌(𝑑𝑠′
).
• The corresponding loss on 𝒳𝑇
𝜇 − 𝑃/𝜌 could be expressed as
– min
𝒳,𝜇
𝔼 𝑠𝑡,𝑎𝑡 ~𝜌,𝑠′`~𝜌 𝒳 𝑠𝑡, 𝑎𝑡
𝑇
𝜇 𝑠′
−
𝑃 𝑑𝑠′|𝑠𝑡,𝑎𝑡
𝜌(𝑑𝑠′)
2
= 𝔼 𝑠𝑡,𝑎𝑡 ~𝜌
𝑠′~𝜌
𝒳 𝑠𝑡, 𝑎𝑡
𝑇
𝜇 𝑠′ 2
− 2𝔼 𝑠𝑡,𝑎𝑡,𝑠𝑡+1 ~𝜌 𝒳 𝑠𝑡, 𝑎𝑡
𝑇
𝜇(𝑠𝑡+1) + 𝐶
– This loss is also a special case of the FB loss by setting 𝛾 = 0, 𝑜𝑚𝑖𝑖𝑡𝑖𝑛𝑔 𝑧
 8. Contrastive Learning
• Learning the representations by pushing positive pairs closer together while keeping negative pairs part
– Here, two states are considered similar if they lie close on the same trajectory
• They proposed SimCLR-like objective as
– min
𝒳,𝜑
−𝔼𝑘~Geom 1−𝛾𝐶𝐿
𝑠𝑡,𝑠𝑡+𝑘 ~𝒟
log
exp cosine 𝒳 𝑠𝑡 ,𝜑 𝑠𝑡+𝑘
𝔼𝑠′~𝒟 exp cosine 𝒳 𝑠𝑡 ,𝜑 𝑠′
, cosine 𝑢, 𝑣 =
𝑢𝑇𝑣
𝑢 2
𝑣 2
 9. Low-Rank Approximation of SR
• Learning the features by estimating a low-rank model of the successor measure for exploration policy
– min
𝒳,𝜑
𝔼 𝑠𝑡,𝑠𝑡+1 ~𝒟
𝑠′~𝒟
𝒳 𝑠𝑡
𝑇
𝜑 𝑠′
− 𝛾𝒳 𝑠𝑡+1 𝜑 𝑠′
2
− 2𝔼 𝑠𝑡,𝑠𝑡+1 ~𝒟 𝒳 𝑠𝑡
𝑇
𝜑 𝑠𝑡+1 , 𝑤ℎ𝑒𝑟𝑒 𝒳 𝑎𝑛𝑑 𝜑 𝑎𝑟𝑒 𝑡𝑎𝑟𝑔𝑒𝑡 𝑣𝑒𝑟𝑠𝑖𝑜𝑛 𝑜𝑓 𝒳 𝑎𝑛𝑑 𝜑
Algorithms for SF and FB representations (V)

15
Environments
 All the methods were tested in DeepMind Control Suite(ExORL Benchmarks)
 Tasks and environments
• Point-mass Maze
– State, action: 4/2 dim vectors
• Walker: a planner walker
• Cheetah: a running planar biped
• Quadruped: a four-leg ant navigating in 3D space
 Replay buffers
• RND: 10M training transition with RND
• APS: 10M training transition with APS
• Proto: 10M training transition with ProtoRL
Point-mass Maze / Walker
Cheetah / Quadruped

16
 Comparison with 11 methods(FB and 10 SF-based models) and offline/online TD3
 The performance of each method for each task in each env, averaged over the three buffers and 10 seeds
 Control group
• Online TD3: with task reward, and free environment interactions
• Offline TD3: with task reward, and training from the replay buffer
 FB and Lap show superior performance than other methods
• FB and LaP reached about 83% and 76% of supervised offline TD3 performance
Results – Zero-shot performance of proposed methods
Average scores over tasks for each env
Average plots of zero-shot results(task, env, buffer, seeds)

Does Zero-Shot RL Exist

More Related Content

Similar to Does Zero-Shot RL Exist (20)

Recently uploaded (20)

Does Zero-Shot RL Exist

Editor's Notes