1
Does Zero-Shot Reinforcement Learning Exist?
백승언, 김현성, 이도현 , 정강민
11 June, 2023
2
 Introduction
 Current Success of AI
 Meta Reinforcement Learning
 Does Zero-Shot Reinforcement Learning Exist?
 Backgrounds
 Previous Strategies for Zero-Shot RL
 Algorithms for SF and FB Representations
 Experiments
 Environments
 Results
Contents
3
Introduction
4
 Problem setting of reinforcement learning, meta-learning
 Reinforcement learning
• Given certain MDP, lean a policy 𝜋 that maximize the expected discounted return 𝔼𝜋,𝑝0
Σ𝑡=0
∞
𝛾𝑡−1
𝑟𝑡 𝑠𝑡, 𝑎𝑡, 𝑠𝑡+1
 Meta-learning
• Given data from 𝒯
1, … , 𝒯
N, quickly solve new task 𝒯
𝑡𝑒𝑠𝑡
 Problem setting of Meta-Reinforcement Learning(Meta-RL)
 Setting 1: meta-learning with diverse goal(goal as a task)
• 𝒯
𝑖 ≜ {𝒮, 𝒜, 𝑝 𝑠0 , 𝑝 𝑠′
𝑠, 𝑎 , 𝑟 𝑠, 𝑎, 𝑔 , 𝑔𝑖}
 Setting 2: meta-learning with RL tasks(MDP as a task)
• 𝒯
𝑖 ≜ 𝒮𝑖, 𝒜𝑖, 𝑝𝑖 𝑠0 , 𝑝𝑖 𝑠′
𝑠, 𝑎 , 𝑟𝑖(𝑠, 𝑎)
Meta Reinforcement Learning
Meta RL problem statement in CS-330(Finn)
5
Does Zero-Shot Reinforcement Learning
Exist?
6
 Notation
 Reward-free MDP
• ℳ = (𝑆, 𝐴, 𝑃, 𝛾) is a reward-free Markov Decision Process(MDP) with state space 𝑆, action space 𝐴, transition
probability 𝑃(𝑠′
|𝑠, 𝑎) from state 𝑠 to 𝑠′ given action 𝑎 and discount factor 0 < 𝛾 < 1
 Problem statement
 Goal of zero-shot RL is to compute a compact representation ℰ of the env by observing samples of
reward-free transitions (𝑠𝑡, 𝑎𝑡, 𝑠𝑡+1) in this env
 Once a reward function is specified later, the agent must use ℰ to immediately produce a good policy, via
only elementary computation without any further planning or learning
 Reward functions may be specified at test time either as a relatively small set of reward samples (𝑠𝑖, 𝑟𝑖), or
as an explicit function 𝑠 → 𝑟(𝑠)
Backgrounds (I) – Defining Zero-Shot RL
7
 Successor representations (SR)
 For a finite MDP, the successor representation 𝑀𝜋
(𝑠0, 𝑎0) of a state-action pair(𝑠0, 𝑎0) under a policy 𝜋, is
defined as the discounted sum of future occurrences of each state
• 𝑀𝜋
𝑠0, 𝑎0, 𝑠 ≔ 𝔼 Σ𝑡≥0𝛾𝑡
𝕀 𝑠𝑡+1 = 𝑠 | 𝑠0, 𝑎0 , 𝜋 , ∀𝑠 ∈ 𝑆
 In matrix form, SRs can be written as 𝑀𝜋 = 𝑃Σ𝑡≥0𝛾𝑡𝑃𝜋
𝑡 = 𝑃 𝐼 − 𝛾𝑃𝜋
−1, 𝑃𝜋 𝑖𝑠 𝑡ℎ𝑒 𝑠𝑡𝑎𝑡𝑒 𝑡𝑟𝑎𝑛𝑠𝑖𝑡𝑖𝑜𝑛 𝑝𝑟𝑜𝑏
• 𝑀𝜋
satisfies the matrix Bellman Equation 𝑀𝜋
= 𝑃 + 𝛾𝑃𝜋𝑀𝜋
, the Q-function can be expressed as 𝑄𝑟
𝜋
= 𝑀𝜋
𝑟
 Successor features (SFs)
 Successor features extend SR to continuous MDPs by first assuming we are given a basic feature map
𝜑: 𝑆 → ℝ𝑑
that embeds states into 𝑑-dimensional space, and defining the expected discounted sum of
future state features
• 𝜓𝜋
𝑠0, 𝑎0 ≔ 𝔼 Σ𝑡≥0𝛾𝑡
𝜑 𝑠𝑡+1 | 𝑠0, 𝑎0, 𝜋
 Successor measures (SMs)
 Successor measures extend SRs to continuous spaces by treating the distribution of future visited states
as a measure 𝑀𝜋
over the state space 𝑆
• 𝑀𝜋
𝑠0, 𝑎0 𝑋 ≔ Σ𝑡≥0𝛾𝑡
Pr 𝑠𝑡+1 ∈ 𝑋 𝑠0, 𝑎0, 𝜋), ∀𝑋 ⊂ 𝑆, 𝜓𝜋
𝑠0, 𝑎0 = 𝑠′
.
𝑀𝜋
𝑠0, 𝑎0, 𝑑𝑠′
𝜑(𝑠′
)
Backgrounds (II) – Important Concept
8
 Zero-shot RL from successor features
 Given a basic feature map 𝜑: 𝑆 → ℝ𝑑
to be learned via another criterion, universal SFs learn the successor features of a
particular family of policies 𝜋𝑧 for 𝑧 ∈ ℝ𝑑
,
• 𝜓 𝑠0, 𝑎0, 𝑧 = 𝔼 Σ𝑡≥0𝛾𝑡
𝜑 𝑠𝑡+1 | 𝑠0, 𝑎0 , 𝜋𝑧 , 𝜋𝑧 𝑠 ≔ argmax𝑎𝜓 𝑠, 𝑎, 𝑧 𝑇
𝑧
 Once a reward function 𝑟 is revealed, few reward samples of explicit knowledge of the function 𝑟 are used to perform a
linear regression of 𝑟 onto the features 𝜑
• Namely, the 𝑧𝑟 ≔ argmin𝑧𝔼𝑠~𝜌 𝑟 𝑠 − 𝜑 𝑠 𝑇
𝑧 2
= 𝔼𝜌 𝜑𝜑𝑇 −1
𝔼𝜌 𝜑𝑟 => then the policy 𝜋𝑧 is returned
• This policy is guaranteed to be optimal for all rewards in the linear span of the features 𝜑
– If 𝑟 𝑠 = 𝜑 𝑠 𝑇
𝑤, ∀𝑠 ∈ 𝑆, then 𝑧𝑟 = 𝑤, and 𝜋𝑧𝑟
is the optimal policy for reward 𝑟
 Zero-shot RL from Forward-Backward representation (FB)
 Forward-backward representation look for 𝐹: 𝑆 × 𝐴 × ℝ𝑑
→ ℝ𝑑
and 𝐵: 𝑆 → ℝ𝑑
such that the long-term transition prob 𝑀𝜋𝑧
decompose as
• 𝑀𝜋𝑧 𝑠0, 𝑎0, 𝑑𝑠′
≈ 𝑭 𝒔𝟎, 𝒂𝟎, 𝒛 𝑻
𝑩 𝒔′
𝜌 𝑑𝑠′
, 𝜋𝑧 𝑠 ≔ argmax𝑎𝐹 𝑠, 𝑎, 𝑧 𝑇
𝑧
• In a finite space, 𝑀𝜋𝑧 could be decomposed as 𝑀𝜋𝑧 = 𝐹
𝑧
𝑇
𝐵𝑑𝑖𝑎𝑔(𝜌)
 Once a reward function 𝑟 is revealed, 𝑧𝑟 ≔ 𝔼𝑠~𝜌 𝑟 𝑠 𝐵(𝑠) from a few reward samples or from explicit knowledge of the
function 𝑟 are estimated(e.g. 𝑧𝑟 = 𝐵 𝑠 𝑡𝑜 𝑟𝑒𝑎𝑐ℎ 𝑠)
• Then the policy 𝜋𝑧𝑟
is returned.
• Any reward function 𝑟, the policy 𝜋𝑧𝑟
is optimal for 𝑟, with optimal Q-function 𝑄𝑟
⋆
= 𝐹 𝑠, 𝑎, 𝑧𝑟
𝑇
𝑧𝑟
Previous Strategies for Zero-Shot RL
9
 The authors suggested the novel losses used to train 𝝍 in SFs, and F, B in FB
 To obtain a full zero-shot RL algorithm, SFs must specify the basic feature 𝜑, thus they proposed the ten p
ossible choices based on existing or new representations for RL
 Learning the SF 𝝍𝐓𝐳 instead of 𝝍
 The successor feature 𝜓 satisfy the Bellman equation 𝜓𝜋 = 𝑃𝜑 + 𝛾𝑃𝜋𝜓𝜋, the collection of ordinary Bellman
Eq for each component of 𝜑 in BBQ-network
 Therefore 𝜓 𝑠, 𝑎, 𝑧 for each 𝑧 could be trained by minimizing the Bellman residuals as follows,
• 𝜓 𝑠𝑡, 𝑎𝑡, 𝑧 − 𝜑 𝑠𝑡+1 − 𝛾𝜓(𝑠𝑡+1, 𝜋𝑧 𝑠𝑡+1 , 𝑧
2
, 𝑧 𝑖𝑠 𝑟𝑎𝑛𝑑𝑜𝑚 𝑠𝑎𝑚𝑝𝑙𝑒𝑑 𝑏𝑦 𝑙𝑜𝑔𝑖𝑐
 They proposed the novel loss instead of the vector-valued Bellman residual above,
• ℒ 𝜓 ≔ 𝔼 𝑠𝑡,𝑎𝑡,𝑠𝑡+1 ~𝜌 𝜓(𝑠𝑡, 𝑎𝑡, 𝑧 𝑇
𝑧 − 𝜑 𝑠𝑡+1
𝑇
𝑧 − 𝛾𝜓 𝑠𝑡+1, 𝜋𝑧 𝑠𝑡+1 , 𝑧 𝑇
𝑧
2
for each 𝑧
 This trains 𝜓 ⋅, 𝑧 𝑇
𝑧 as the Q-function of reward 𝜑𝑇
𝑧, the only case needed, while training the full vector
𝜓(⋅, 𝑧) amounts to training the Q-functions of each policy 𝜋𝑧 for all rewards 𝜑𝑇𝑧′ for all 𝑧′ ∈ ℝ𝑑 including
𝑧′
≠ 𝑧.
Algorithms for SF and FB Representations (I)
10
 Learning the FB representations: the FB training loss
 The successor measure 𝑀𝜋
satisfies a Bellman-like equation, 𝑀𝜋
= 𝑃 + 𝛾𝑃𝜋𝑀𝜋
, as matrices in the finite ca
se and as measures in the general case in [Blier et al]
• For any policy 𝜋𝑧, the Q-function for the reward 𝑟 can be written 𝑄𝑟
𝜋𝑧
= 𝑀𝜋𝑧𝑟 in matrix form.
– This is equal to 𝐹
𝑧
𝑇
𝐵𝑑𝑖𝑎𝑔 𝜌 𝑟; thus assume that the 𝑧𝑟 ≔ 𝐵𝑑𝑖𝑎𝑔 𝜌 𝑟 = 𝔼𝑠~𝜌 𝐵 𝑠 𝑟(𝑠) , the Q-function is obtained as
𝑄𝑟
𝜋𝑧
= 𝐹
𝑧
𝑇
𝑧𝑟 for any 𝑧 ∈ ℝ𝑑
 FB could be learned by iteratively minimizing the Bellman residual on the parametric model 𝑀 = 𝐹𝑇𝐵𝜌.
• Using a suitable norm ⋅ 𝜌
for the bellman residual leads to a loss expressed as expectation from the dataset
• ℒ 𝐹, 𝐵 ≔ 𝐹𝑧
𝑇
𝐵𝜌 − (𝑃 + 𝛾𝑃𝜋𝑧
𝐹𝑧
𝑇
𝐵𝜌)
= 𝔼 𝑠𝑡,𝑎𝑡,𝑠𝑡+1 ~𝜌,𝑠′~𝜌 𝐹 𝑠𝑡, 𝑎𝑡, 𝑧 𝑇
𝐵 𝑠′ − 𝛾𝐹 𝑠𝑡+1, 𝜋𝑧 𝑠𝑡+1 , 𝑧 𝑇
𝐵 𝑠′ 2
− 2𝔼 𝑠𝑡,𝑎𝑡,𝑠𝑡+1 ~𝜌 𝐹 𝑠𝑡, 𝑎𝑡, 𝑧 𝑇
𝐵(𝑠𝑡+1) + 𝐶𝑜𝑛𝑠𝑡
– 𝑧 is random sampled by logic
 The authors proposed that the last term involves 𝐵(𝑠𝑡+1) instead of 𝐵(𝑠𝑡), because they used 𝑠𝑡+1 instead
of 𝑠𝑡 for the successor measure
Algorithms for SF and FB representations (II)
11
 Learning basic features 𝝋 for SF
 Any representation learning method could be used to supply 𝜑
• The authors suggested the 10 basic features and described the precise learning objective for each
 1. Random Feature (Rand)
• Using a non-trainable randomly initialized network as features
 2. Autoencoder (AEnc)
• Learning a decoder 𝑓: ℝ𝑑
→ 𝑆 to recover the state from its representation 𝜑
– min
𝑓,𝜑
𝔼𝑠~𝒟 𝑓 𝜑(𝑠) − 𝑠 2
 3. Inverse Curiosity Module (ICM)
• Aiming at extracting the controllable aspects of the environment
• Training an inverse dynamics model 𝑔: ℝ𝑑
× ℝ𝑑
→ 𝐴 to predict the action used for a transition between two consecu
tive states
– min
𝑔,𝜑
𝔼 𝑠𝑡,𝑎𝑡,𝑠𝑡+1 ~𝒟 𝑔 𝜑 𝑠𝑡 , 𝜑 𝑠𝑡+1 − 𝑎𝑡
2
Algorithms for SF and FB representations (III)
12
 Learning basic features 𝝋 for SF
 4. Transition model (Trans)
• Learning the one-step dynamics 𝑓: ℝ𝑑
× 𝐴 → 𝑆 that predicts the next state from the current state representation
– min
𝑓,𝜑
𝔼 𝑠𝑡,𝑎𝑡,𝑠𝑡+1 ~𝒟 𝑓 𝜑 𝑠𝑡 , 𝑎𝑡 − 𝑠𝑡+1
2
 5. Latent transition model (Latent)
• Learning the latent dynamics model but instead of predicting the next state, it predicts its representation
– min
𝑓,𝜑
𝔼 𝑠𝑡,𝑎𝑡,𝑠𝑡+1 ~𝒟 𝑓 𝜑 𝑠𝑡 , 𝑎𝑡 − 𝜑 𝑠𝑡+1
2
 6. Laplacian Eigenfunction (Lap)
• Wu et al consider the symmetrized MDP graph Laplacian induced by an exploratory policy 𝜋, defined as
ℒ = 𝐼 −
1
2
𝑃𝜋𝑑𝑖𝑎𝑔 𝜌 −1
+ 𝑑𝑖𝑎𝑔 𝜌 −1
𝑃𝜋
𝑇
• They propose to learn the eigenfunctions of ℒ via the spectral graph drawing objective as follows:
– min
𝜑
𝔼 𝑠𝑡,𝑠𝑡+1 ~𝒟 𝜑 𝑠𝑡 − 𝜑(𝑠𝑡+1)
2
+ 𝜆𝔼𝑠~𝒟,𝑠′~𝒟 𝜑 𝑠 𝑇
𝜑 𝑠′ 2
− 𝜑(𝑠) 2
2
− 𝜑(𝑠′
) 2
2
– Where the second term is an orthonormality regularization to ensure that 𝔼𝑠~𝜌 𝜑 𝑠 𝜑 𝑠 𝑇
≈ 𝐼
Algorithms for SF and FB representations (IV)
13
 Learning basic features 𝝋 for SF
 7. Low-Rank Approximation of P
• Learning the features by estimating a low-rank model of the transition probability densities: 𝑃 𝑑𝑠′
𝑠, 𝑎 ≈ 𝒳 𝑠, 𝑎 𝑇
𝜇 𝑠′
𝜌(𝑑𝑠′
).
• The corresponding loss on 𝒳𝑇
𝜇 − 𝑃/𝜌 could be expressed as
– min
𝒳,𝜇
𝔼 𝑠𝑡,𝑎𝑡 ~𝜌,𝑠′`~𝜌 𝒳 𝑠𝑡, 𝑎𝑡
𝑇
𝜇 𝑠′
−
𝑃 𝑑𝑠′|𝑠𝑡,𝑎𝑡
𝜌(𝑑𝑠′)
2
= 𝔼 𝑠𝑡,𝑎𝑡 ~𝜌
𝑠′~𝜌
𝒳 𝑠𝑡, 𝑎𝑡
𝑇
𝜇 𝑠′ 2
− 2𝔼 𝑠𝑡,𝑎𝑡,𝑠𝑡+1 ~𝜌 𝒳 𝑠𝑡, 𝑎𝑡
𝑇
𝜇(𝑠𝑡+1) + 𝐶
– This loss is also a special case of the FB loss by setting 𝛾 = 0, 𝑜𝑚𝑖𝑖𝑡𝑖𝑛𝑔 𝑧
 8. Contrastive Learning
• Learning the representations by pushing positive pairs closer together while keeping negative pairs part
– Here, two states are considered similar if they lie close on the same trajectory
• They proposed SimCLR-like objective as
– min
𝒳,𝜑
−𝔼𝑘~Geom 1−𝛾𝐶𝐿
𝑠𝑡,𝑠𝑡+𝑘 ~𝒟
log
exp cosine 𝒳 𝑠𝑡 ,𝜑 𝑠𝑡+𝑘
𝔼𝑠′~𝒟 exp cosine 𝒳 𝑠𝑡 ,𝜑 𝑠′
, cosine 𝑢, 𝑣 =
𝑢𝑇𝑣
𝑢 2
𝑣 2
 9. Low-Rank Approximation of SR
• Learning the features by estimating a low-rank model of the successor measure for exploration policy
– min
𝒳,𝜑
𝔼 𝑠𝑡,𝑠𝑡+1 ~𝒟
𝑠′~𝒟
𝒳 𝑠𝑡
𝑇
𝜑 𝑠′
− 𝛾𝒳 𝑠𝑡+1 𝜑 𝑠′
2
− 2𝔼 𝑠𝑡,𝑠𝑡+1 ~𝒟 𝒳 𝑠𝑡
𝑇
𝜑 𝑠𝑡+1 , 𝑤ℎ𝑒𝑟𝑒 𝒳 𝑎𝑛𝑑 𝜑 𝑎𝑟𝑒 𝑡𝑎𝑟𝑔𝑒𝑡 𝑣𝑒𝑟𝑠𝑖𝑜𝑛 𝑜𝑓 𝒳 𝑎𝑛𝑑 𝜑
Algorithms for SF and FB representations (V)
14
Experiments
15
Environments
 All the methods were tested in DeepMind Control Suite(ExORL Benchmarks)
 Tasks and environments
• Point-mass Maze
– State, action: 4/2 dim vectors
• Walker: a planner walker
– State, action: 24/6 dim vectors
• Cheetah: a running planar biped
– State, action: 17/6 dim vectors
• Quadruped: a four-leg ant navigating in 3D space
– State, action: 78/12 dim vectors
 Replay buffers
• RND: 10M training transition with RND
• APS: 10M training transition with APS
• Proto: 10M training transition with ProtoRL
Point-mass Maze / Walker
Cheetah / Quadruped
16
 Comparison with 11 methods(FB and 10 SF-based models) and offline/online TD3
 The performance of each method for each task in each env, averaged over the three buffers and 10 seeds
 Control group
• Online TD3: with task reward, and free environment interactions
• Offline TD3: with task reward, and training from the replay buffer
 FB and Lap show superior performance than other methods
• FB and LaP reached about 83% and 76% of supervised offline TD3 performance
Results – Zero-shot performance of proposed methods
Average scores over tasks for each env
Average plots of zero-shot results(task, env, buffer, seeds)
17
Thank you!
18
Q&A

More Related Content

PPTX
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pptx
PDF
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdf
PPTX
Koh_Liang_ICML2017
PDF
Stochastic optimal control &amp; rl
PDF
Metrics for generativemodels
PPTX
Learning group em - 20171025 - copy
PDF
Generalized Laplace - Mellin Integral Transformation
PDF
geg 113 adds.pdf undergraduate presentation
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pptx
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdf
Koh_Liang_ICML2017
Stochastic optimal control &amp; rl
Metrics for generativemodels
Learning group em - 20171025 - copy
Generalized Laplace - Mellin Integral Transformation
geg 113 adds.pdf undergraduate presentation

Similar to Does Zero-Shot RL Exist (20)

PDF
Distributional RL via Moment Matching
PPTX
Presentation
PDF
Specific topics in optimisation
PDF
NIPS KANSAI Reading Group #5: State Aware Imitation Learning
PPTX
Gradient Boosting
PPTX
Show, Attend and Tell: Neural Image Caption Generation with Visual Attention
PDF
Backpropagation (DLAI D3L1 2017 UPC Deep Learning for Artificial Intelligence)
PPTX
Learning a nonlinear embedding by preserving class neibourhood structure 최종
PDF
Basic calculus (i)
PPTX
Linear regression, costs & gradient descent
PPTX
Restricted boltzmann machine
PPTX
Machine learning ppt and presentation code
PDF
Optimum Engineering Design - Day 2b. Classical Optimization methods
PDF
Support vector machines
PDF
Estimation Theory Class (Summary and Revision)
PDF
E0561719
PPTX
Review of Seiberg Witten duality.pptx
PDF
7_logistic-regression presentation sur la regression logistique.pdf
PPTX
Inverse Function.pptx
PPTX
MAT-314 Relations and Functions
Distributional RL via Moment Matching
Presentation
Specific topics in optimisation
NIPS KANSAI Reading Group #5: State Aware Imitation Learning
Gradient Boosting
Show, Attend and Tell: Neural Image Caption Generation with Visual Attention
Backpropagation (DLAI D3L1 2017 UPC Deep Learning for Artificial Intelligence)
Learning a nonlinear embedding by preserving class neibourhood structure 최종
Basic calculus (i)
Linear regression, costs & gradient descent
Restricted boltzmann machine
Machine learning ppt and presentation code
Optimum Engineering Design - Day 2b. Classical Optimization methods
Support vector machines
Estimation Theory Class (Summary and Revision)
E0561719
Review of Seiberg Witten duality.pptx
7_logistic-regression presentation sur la regression logistique.pdf
Inverse Function.pptx
MAT-314 Relations and Functions
Ad

Recently uploaded (20)

PPTX
Phase1_final PPTuwhefoegfohwfoiehfoegg.pptx
PPTX
Managing Community Partner Relationships
PDF
Capcut Pro Crack For PC Latest Version {Fully Unlocked 2025}
PPTX
CYBER SECURITY the Next Warefare Tactics
PPTX
modul_python (1).pptx for professional and student
PPTX
SET 1 Compulsory MNH machine learning intro
PPTX
sac 451hinhgsgshssjsjsjheegdggeegegdggddgeg.pptx
PPT
Image processing and pattern recognition 2.ppt
PPTX
Business_Capability_Map_Collection__pptx
DOCX
Factor Analysis Word Document Presentation
PPT
DU, AIS, Big Data and Data Analytics.ppt
PPTX
New ISO 27001_2022 standard and the changes
PDF
REAL ILLUMINATI AGENT IN KAMPALA UGANDA CALL ON+256765750853/0705037305
PDF
Global Data and Analytics Market Outlook Report
PPTX
IMPACT OF LANDSLIDE.....................
PPTX
Introduction to Inferential Statistics.pptx
PPTX
Lesson-01intheselfoflifeofthekennyrogersoftheunderstandoftheunderstanded
PPTX
A Complete Guide to Streamlining Business Processes
PDF
Optimise Shopper Experiences with a Strong Data Estate.pdf
PPTX
Steganography Project Steganography Project .pptx
Phase1_final PPTuwhefoegfohwfoiehfoegg.pptx
Managing Community Partner Relationships
Capcut Pro Crack For PC Latest Version {Fully Unlocked 2025}
CYBER SECURITY the Next Warefare Tactics
modul_python (1).pptx for professional and student
SET 1 Compulsory MNH machine learning intro
sac 451hinhgsgshssjsjsjheegdggeegegdggddgeg.pptx
Image processing and pattern recognition 2.ppt
Business_Capability_Map_Collection__pptx
Factor Analysis Word Document Presentation
DU, AIS, Big Data and Data Analytics.ppt
New ISO 27001_2022 standard and the changes
REAL ILLUMINATI AGENT IN KAMPALA UGANDA CALL ON+256765750853/0705037305
Global Data and Analytics Market Outlook Report
IMPACT OF LANDSLIDE.....................
Introduction to Inferential Statistics.pptx
Lesson-01intheselfoflifeofthekennyrogersoftheunderstandoftheunderstanded
A Complete Guide to Streamlining Business Processes
Optimise Shopper Experiences with a Strong Data Estate.pdf
Steganography Project Steganography Project .pptx
Ad

Does Zero-Shot RL Exist

  • 1. 1 Does Zero-Shot Reinforcement Learning Exist? 백승언, 김현성, 이도현 , 정강민 11 June, 2023
  • 2. 2  Introduction  Current Success of AI  Meta Reinforcement Learning  Does Zero-Shot Reinforcement Learning Exist?  Backgrounds  Previous Strategies for Zero-Shot RL  Algorithms for SF and FB Representations  Experiments  Environments  Results Contents
  • 4. 4  Problem setting of reinforcement learning, meta-learning  Reinforcement learning • Given certain MDP, lean a policy 𝜋 that maximize the expected discounted return 𝔼𝜋,𝑝0 Σ𝑡=0 ∞ 𝛾𝑡−1 𝑟𝑡 𝑠𝑡, 𝑎𝑡, 𝑠𝑡+1  Meta-learning • Given data from 𝒯 1, … , 𝒯 N, quickly solve new task 𝒯 𝑡𝑒𝑠𝑡  Problem setting of Meta-Reinforcement Learning(Meta-RL)  Setting 1: meta-learning with diverse goal(goal as a task) • 𝒯 𝑖 ≜ {𝒮, 𝒜, 𝑝 𝑠0 , 𝑝 𝑠′ 𝑠, 𝑎 , 𝑟 𝑠, 𝑎, 𝑔 , 𝑔𝑖}  Setting 2: meta-learning with RL tasks(MDP as a task) • 𝒯 𝑖 ≜ 𝒮𝑖, 𝒜𝑖, 𝑝𝑖 𝑠0 , 𝑝𝑖 𝑠′ 𝑠, 𝑎 , 𝑟𝑖(𝑠, 𝑎) Meta Reinforcement Learning Meta RL problem statement in CS-330(Finn)
  • 6. 6  Notation  Reward-free MDP • ℳ = (𝑆, 𝐴, 𝑃, 𝛾) is a reward-free Markov Decision Process(MDP) with state space 𝑆, action space 𝐴, transition probability 𝑃(𝑠′ |𝑠, 𝑎) from state 𝑠 to 𝑠′ given action 𝑎 and discount factor 0 < 𝛾 < 1  Problem statement  Goal of zero-shot RL is to compute a compact representation ℰ of the env by observing samples of reward-free transitions (𝑠𝑡, 𝑎𝑡, 𝑠𝑡+1) in this env  Once a reward function is specified later, the agent must use ℰ to immediately produce a good policy, via only elementary computation without any further planning or learning  Reward functions may be specified at test time either as a relatively small set of reward samples (𝑠𝑖, 𝑟𝑖), or as an explicit function 𝑠 → 𝑟(𝑠) Backgrounds (I) – Defining Zero-Shot RL
  • 7. 7  Successor representations (SR)  For a finite MDP, the successor representation 𝑀𝜋 (𝑠0, 𝑎0) of a state-action pair(𝑠0, 𝑎0) under a policy 𝜋, is defined as the discounted sum of future occurrences of each state • 𝑀𝜋 𝑠0, 𝑎0, 𝑠 ≔ 𝔼 Σ𝑡≥0𝛾𝑡 𝕀 𝑠𝑡+1 = 𝑠 | 𝑠0, 𝑎0 , 𝜋 , ∀𝑠 ∈ 𝑆  In matrix form, SRs can be written as 𝑀𝜋 = 𝑃Σ𝑡≥0𝛾𝑡𝑃𝜋 𝑡 = 𝑃 𝐼 − 𝛾𝑃𝜋 −1, 𝑃𝜋 𝑖𝑠 𝑡ℎ𝑒 𝑠𝑡𝑎𝑡𝑒 𝑡𝑟𝑎𝑛𝑠𝑖𝑡𝑖𝑜𝑛 𝑝𝑟𝑜𝑏 • 𝑀𝜋 satisfies the matrix Bellman Equation 𝑀𝜋 = 𝑃 + 𝛾𝑃𝜋𝑀𝜋 , the Q-function can be expressed as 𝑄𝑟 𝜋 = 𝑀𝜋 𝑟  Successor features (SFs)  Successor features extend SR to continuous MDPs by first assuming we are given a basic feature map 𝜑: 𝑆 → ℝ𝑑 that embeds states into 𝑑-dimensional space, and defining the expected discounted sum of future state features • 𝜓𝜋 𝑠0, 𝑎0 ≔ 𝔼 Σ𝑡≥0𝛾𝑡 𝜑 𝑠𝑡+1 | 𝑠0, 𝑎0, 𝜋  Successor measures (SMs)  Successor measures extend SRs to continuous spaces by treating the distribution of future visited states as a measure 𝑀𝜋 over the state space 𝑆 • 𝑀𝜋 𝑠0, 𝑎0 𝑋 ≔ Σ𝑡≥0𝛾𝑡 Pr 𝑠𝑡+1 ∈ 𝑋 𝑠0, 𝑎0, 𝜋), ∀𝑋 ⊂ 𝑆, 𝜓𝜋 𝑠0, 𝑎0 = 𝑠′ . 𝑀𝜋 𝑠0, 𝑎0, 𝑑𝑠′ 𝜑(𝑠′ ) Backgrounds (II) – Important Concept
  • 8. 8  Zero-shot RL from successor features  Given a basic feature map 𝜑: 𝑆 → ℝ𝑑 to be learned via another criterion, universal SFs learn the successor features of a particular family of policies 𝜋𝑧 for 𝑧 ∈ ℝ𝑑 , • 𝜓 𝑠0, 𝑎0, 𝑧 = 𝔼 Σ𝑡≥0𝛾𝑡 𝜑 𝑠𝑡+1 | 𝑠0, 𝑎0 , 𝜋𝑧 , 𝜋𝑧 𝑠 ≔ argmax𝑎𝜓 𝑠, 𝑎, 𝑧 𝑇 𝑧  Once a reward function 𝑟 is revealed, few reward samples of explicit knowledge of the function 𝑟 are used to perform a linear regression of 𝑟 onto the features 𝜑 • Namely, the 𝑧𝑟 ≔ argmin𝑧𝔼𝑠~𝜌 𝑟 𝑠 − 𝜑 𝑠 𝑇 𝑧 2 = 𝔼𝜌 𝜑𝜑𝑇 −1 𝔼𝜌 𝜑𝑟 => then the policy 𝜋𝑧 is returned • This policy is guaranteed to be optimal for all rewards in the linear span of the features 𝜑 – If 𝑟 𝑠 = 𝜑 𝑠 𝑇 𝑤, ∀𝑠 ∈ 𝑆, then 𝑧𝑟 = 𝑤, and 𝜋𝑧𝑟 is the optimal policy for reward 𝑟  Zero-shot RL from Forward-Backward representation (FB)  Forward-backward representation look for 𝐹: 𝑆 × 𝐴 × ℝ𝑑 → ℝ𝑑 and 𝐵: 𝑆 → ℝ𝑑 such that the long-term transition prob 𝑀𝜋𝑧 decompose as • 𝑀𝜋𝑧 𝑠0, 𝑎0, 𝑑𝑠′ ≈ 𝑭 𝒔𝟎, 𝒂𝟎, 𝒛 𝑻 𝑩 𝒔′ 𝜌 𝑑𝑠′ , 𝜋𝑧 𝑠 ≔ argmax𝑎𝐹 𝑠, 𝑎, 𝑧 𝑇 𝑧 • In a finite space, 𝑀𝜋𝑧 could be decomposed as 𝑀𝜋𝑧 = 𝐹 𝑧 𝑇 𝐵𝑑𝑖𝑎𝑔(𝜌)  Once a reward function 𝑟 is revealed, 𝑧𝑟 ≔ 𝔼𝑠~𝜌 𝑟 𝑠 𝐵(𝑠) from a few reward samples or from explicit knowledge of the function 𝑟 are estimated(e.g. 𝑧𝑟 = 𝐵 𝑠 𝑡𝑜 𝑟𝑒𝑎𝑐ℎ 𝑠) • Then the policy 𝜋𝑧𝑟 is returned. • Any reward function 𝑟, the policy 𝜋𝑧𝑟 is optimal for 𝑟, with optimal Q-function 𝑄𝑟 ⋆ = 𝐹 𝑠, 𝑎, 𝑧𝑟 𝑇 𝑧𝑟 Previous Strategies for Zero-Shot RL
  • 9. 9  The authors suggested the novel losses used to train 𝝍 in SFs, and F, B in FB  To obtain a full zero-shot RL algorithm, SFs must specify the basic feature 𝜑, thus they proposed the ten p ossible choices based on existing or new representations for RL  Learning the SF 𝝍𝐓𝐳 instead of 𝝍  The successor feature 𝜓 satisfy the Bellman equation 𝜓𝜋 = 𝑃𝜑 + 𝛾𝑃𝜋𝜓𝜋, the collection of ordinary Bellman Eq for each component of 𝜑 in BBQ-network  Therefore 𝜓 𝑠, 𝑎, 𝑧 for each 𝑧 could be trained by minimizing the Bellman residuals as follows, • 𝜓 𝑠𝑡, 𝑎𝑡, 𝑧 − 𝜑 𝑠𝑡+1 − 𝛾𝜓(𝑠𝑡+1, 𝜋𝑧 𝑠𝑡+1 , 𝑧 2 , 𝑧 𝑖𝑠 𝑟𝑎𝑛𝑑𝑜𝑚 𝑠𝑎𝑚𝑝𝑙𝑒𝑑 𝑏𝑦 𝑙𝑜𝑔𝑖𝑐  They proposed the novel loss instead of the vector-valued Bellman residual above, • ℒ 𝜓 ≔ 𝔼 𝑠𝑡,𝑎𝑡,𝑠𝑡+1 ~𝜌 𝜓(𝑠𝑡, 𝑎𝑡, 𝑧 𝑇 𝑧 − 𝜑 𝑠𝑡+1 𝑇 𝑧 − 𝛾𝜓 𝑠𝑡+1, 𝜋𝑧 𝑠𝑡+1 , 𝑧 𝑇 𝑧 2 for each 𝑧  This trains 𝜓 ⋅, 𝑧 𝑇 𝑧 as the Q-function of reward 𝜑𝑇 𝑧, the only case needed, while training the full vector 𝜓(⋅, 𝑧) amounts to training the Q-functions of each policy 𝜋𝑧 for all rewards 𝜑𝑇𝑧′ for all 𝑧′ ∈ ℝ𝑑 including 𝑧′ ≠ 𝑧. Algorithms for SF and FB Representations (I)
  • 10. 10  Learning the FB representations: the FB training loss  The successor measure 𝑀𝜋 satisfies a Bellman-like equation, 𝑀𝜋 = 𝑃 + 𝛾𝑃𝜋𝑀𝜋 , as matrices in the finite ca se and as measures in the general case in [Blier et al] • For any policy 𝜋𝑧, the Q-function for the reward 𝑟 can be written 𝑄𝑟 𝜋𝑧 = 𝑀𝜋𝑧𝑟 in matrix form. – This is equal to 𝐹 𝑧 𝑇 𝐵𝑑𝑖𝑎𝑔 𝜌 𝑟; thus assume that the 𝑧𝑟 ≔ 𝐵𝑑𝑖𝑎𝑔 𝜌 𝑟 = 𝔼𝑠~𝜌 𝐵 𝑠 𝑟(𝑠) , the Q-function is obtained as 𝑄𝑟 𝜋𝑧 = 𝐹 𝑧 𝑇 𝑧𝑟 for any 𝑧 ∈ ℝ𝑑  FB could be learned by iteratively minimizing the Bellman residual on the parametric model 𝑀 = 𝐹𝑇𝐵𝜌. • Using a suitable norm ⋅ 𝜌 for the bellman residual leads to a loss expressed as expectation from the dataset • ℒ 𝐹, 𝐵 ≔ 𝐹𝑧 𝑇 𝐵𝜌 − (𝑃 + 𝛾𝑃𝜋𝑧 𝐹𝑧 𝑇 𝐵𝜌) = 𝔼 𝑠𝑡,𝑎𝑡,𝑠𝑡+1 ~𝜌,𝑠′~𝜌 𝐹 𝑠𝑡, 𝑎𝑡, 𝑧 𝑇 𝐵 𝑠′ − 𝛾𝐹 𝑠𝑡+1, 𝜋𝑧 𝑠𝑡+1 , 𝑧 𝑇 𝐵 𝑠′ 2 − 2𝔼 𝑠𝑡,𝑎𝑡,𝑠𝑡+1 ~𝜌 𝐹 𝑠𝑡, 𝑎𝑡, 𝑧 𝑇 𝐵(𝑠𝑡+1) + 𝐶𝑜𝑛𝑠𝑡 – 𝑧 is random sampled by logic  The authors proposed that the last term involves 𝐵(𝑠𝑡+1) instead of 𝐵(𝑠𝑡), because they used 𝑠𝑡+1 instead of 𝑠𝑡 for the successor measure Algorithms for SF and FB representations (II)
  • 11. 11  Learning basic features 𝝋 for SF  Any representation learning method could be used to supply 𝜑 • The authors suggested the 10 basic features and described the precise learning objective for each  1. Random Feature (Rand) • Using a non-trainable randomly initialized network as features  2. Autoencoder (AEnc) • Learning a decoder 𝑓: ℝ𝑑 → 𝑆 to recover the state from its representation 𝜑 – min 𝑓,𝜑 𝔼𝑠~𝒟 𝑓 𝜑(𝑠) − 𝑠 2  3. Inverse Curiosity Module (ICM) • Aiming at extracting the controllable aspects of the environment • Training an inverse dynamics model 𝑔: ℝ𝑑 × ℝ𝑑 → 𝐴 to predict the action used for a transition between two consecu tive states – min 𝑔,𝜑 𝔼 𝑠𝑡,𝑎𝑡,𝑠𝑡+1 ~𝒟 𝑔 𝜑 𝑠𝑡 , 𝜑 𝑠𝑡+1 − 𝑎𝑡 2 Algorithms for SF and FB representations (III)
  • 12. 12  Learning basic features 𝝋 for SF  4. Transition model (Trans) • Learning the one-step dynamics 𝑓: ℝ𝑑 × 𝐴 → 𝑆 that predicts the next state from the current state representation – min 𝑓,𝜑 𝔼 𝑠𝑡,𝑎𝑡,𝑠𝑡+1 ~𝒟 𝑓 𝜑 𝑠𝑡 , 𝑎𝑡 − 𝑠𝑡+1 2  5. Latent transition model (Latent) • Learning the latent dynamics model but instead of predicting the next state, it predicts its representation – min 𝑓,𝜑 𝔼 𝑠𝑡,𝑎𝑡,𝑠𝑡+1 ~𝒟 𝑓 𝜑 𝑠𝑡 , 𝑎𝑡 − 𝜑 𝑠𝑡+1 2  6. Laplacian Eigenfunction (Lap) • Wu et al consider the symmetrized MDP graph Laplacian induced by an exploratory policy 𝜋, defined as ℒ = 𝐼 − 1 2 𝑃𝜋𝑑𝑖𝑎𝑔 𝜌 −1 + 𝑑𝑖𝑎𝑔 𝜌 −1 𝑃𝜋 𝑇 • They propose to learn the eigenfunctions of ℒ via the spectral graph drawing objective as follows: – min 𝜑 𝔼 𝑠𝑡,𝑠𝑡+1 ~𝒟 𝜑 𝑠𝑡 − 𝜑(𝑠𝑡+1) 2 + 𝜆𝔼𝑠~𝒟,𝑠′~𝒟 𝜑 𝑠 𝑇 𝜑 𝑠′ 2 − 𝜑(𝑠) 2 2 − 𝜑(𝑠′ ) 2 2 – Where the second term is an orthonormality regularization to ensure that 𝔼𝑠~𝜌 𝜑 𝑠 𝜑 𝑠 𝑇 ≈ 𝐼 Algorithms for SF and FB representations (IV)
  • 13. 13  Learning basic features 𝝋 for SF  7. Low-Rank Approximation of P • Learning the features by estimating a low-rank model of the transition probability densities: 𝑃 𝑑𝑠′ 𝑠, 𝑎 ≈ 𝒳 𝑠, 𝑎 𝑇 𝜇 𝑠′ 𝜌(𝑑𝑠′ ). • The corresponding loss on 𝒳𝑇 𝜇 − 𝑃/𝜌 could be expressed as – min 𝒳,𝜇 𝔼 𝑠𝑡,𝑎𝑡 ~𝜌,𝑠′`~𝜌 𝒳 𝑠𝑡, 𝑎𝑡 𝑇 𝜇 𝑠′ − 𝑃 𝑑𝑠′|𝑠𝑡,𝑎𝑡 𝜌(𝑑𝑠′) 2 = 𝔼 𝑠𝑡,𝑎𝑡 ~𝜌 𝑠′~𝜌 𝒳 𝑠𝑡, 𝑎𝑡 𝑇 𝜇 𝑠′ 2 − 2𝔼 𝑠𝑡,𝑎𝑡,𝑠𝑡+1 ~𝜌 𝒳 𝑠𝑡, 𝑎𝑡 𝑇 𝜇(𝑠𝑡+1) + 𝐶 – This loss is also a special case of the FB loss by setting 𝛾 = 0, 𝑜𝑚𝑖𝑖𝑡𝑖𝑛𝑔 𝑧  8. Contrastive Learning • Learning the representations by pushing positive pairs closer together while keeping negative pairs part – Here, two states are considered similar if they lie close on the same trajectory • They proposed SimCLR-like objective as – min 𝒳,𝜑 −𝔼𝑘~Geom 1−𝛾𝐶𝐿 𝑠𝑡,𝑠𝑡+𝑘 ~𝒟 log exp cosine 𝒳 𝑠𝑡 ,𝜑 𝑠𝑡+𝑘 𝔼𝑠′~𝒟 exp cosine 𝒳 𝑠𝑡 ,𝜑 𝑠′ , cosine 𝑢, 𝑣 = 𝑢𝑇𝑣 𝑢 2 𝑣 2  9. Low-Rank Approximation of SR • Learning the features by estimating a low-rank model of the successor measure for exploration policy – min 𝒳,𝜑 𝔼 𝑠𝑡,𝑠𝑡+1 ~𝒟 𝑠′~𝒟 𝒳 𝑠𝑡 𝑇 𝜑 𝑠′ − 𝛾𝒳 𝑠𝑡+1 𝜑 𝑠′ 2 − 2𝔼 𝑠𝑡,𝑠𝑡+1 ~𝒟 𝒳 𝑠𝑡 𝑇 𝜑 𝑠𝑡+1 , 𝑤ℎ𝑒𝑟𝑒 𝒳 𝑎𝑛𝑑 𝜑 𝑎𝑟𝑒 𝑡𝑎𝑟𝑔𝑒𝑡 𝑣𝑒𝑟𝑠𝑖𝑜𝑛 𝑜𝑓 𝒳 𝑎𝑛𝑑 𝜑 Algorithms for SF and FB representations (V)
  • 15. 15 Environments  All the methods were tested in DeepMind Control Suite(ExORL Benchmarks)  Tasks and environments • Point-mass Maze – State, action: 4/2 dim vectors • Walker: a planner walker – State, action: 24/6 dim vectors • Cheetah: a running planar biped – State, action: 17/6 dim vectors • Quadruped: a four-leg ant navigating in 3D space – State, action: 78/12 dim vectors  Replay buffers • RND: 10M training transition with RND • APS: 10M training transition with APS • Proto: 10M training transition with ProtoRL Point-mass Maze / Walker Cheetah / Quadruped
  • 16. 16  Comparison with 11 methods(FB and 10 SF-based models) and offline/online TD3  The performance of each method for each task in each env, averaged over the three buffers and 10 seeds  Control group • Online TD3: with task reward, and free environment interactions • Offline TD3: with task reward, and training from the replay buffer  FB and Lap show superior performance than other methods • FB and LaP reached about 83% and 76% of supervised offline TD3 performance Results – Zero-shot performance of proposed methods Average scores over tasks for each env Average plots of zero-shot results(task, env, buffer, seeds)

Editor's Notes

  • #6: 이제 본격적인 제가 선정한 논문의 알고리즘에 대해서 발표를 시작하겠습니다.
  • #7: 이 장에서는, 배경 지식으로써 on policy algorithm과 off policy algorithm에 대해 설명드리도록 하겠습니다. 먼저, On policy 알고리즘은 sample을 획득하는 behavioral policy pib와, 개선이 수행 될 target policy pi가 동일하거나, 비슷한 알고리즘입니다. 이 알고리즘들은 정책의 학습이 안정적이라는 장점이 있지만, 상대적으로 적은 샘플 효율성을 가지고 있습니다. 그에 반해 off policy 알고리즘들은 behavior policy pib와 update가 수행 될 target policy pi가 독립적이어도 되는 알고리즘으로써, replay buffer를 사용하기 때문에 sample 효율성이 높다는 장점과 함께, 상대적으로 학습이 불안정하다는 단점을 가지고 있습니다.
  • #8: 이 장에서는, 배경 지식으로써 on policy algorithm과 off policy algorithm에 대해 설명드리도록 하겠습니다. 먼저, On policy 알고리즘은 sample을 획득하는 behavioral policy pib와, 개선이 수행 될 target policy pi가 동일하거나, 비슷한 알고리즘입니다. 이 알고리즘들은 정책의 학습이 안정적이라는 장점이 있지만, 상대적으로 적은 샘플 효율성을 가지고 있습니다. 그에 반해 off policy 알고리즘들은 behavior policy pib와 update가 수행 될 target policy pi가 독립적이어도 되는 알고리즘으로써, replay buffer를 사용하기 때문에 sample 효율성이 높다는 장점과 함께, 상대적으로 학습이 불안정하다는 단점을 가지고 있습니다.
  • #15: 이제 본격적인 제가 선정한 논문의 알고리즘에 대해서 발표를 시작하겠습니다.
  • #18: 여기까지가 제가 준비한 gSDE의 발표였습니다. 제 발표를 들어 주셔서 감사합니다!
  • #19: 혹시 궁금하신 사항 있으시면 자유롭게 질문 부탁 드립니다.