SlideShare a Scribd company logo
Safe and Efficient Off-Policy
Reinforcement Learning
NIPS 2016
Yasuhiro Fujita
Preferred Networks Inc.
January 19, 2017
Safe and Efficient Off-Policy Reinforcement Learning
by Remi Munos, Thomas Stepleton, Anna Harutyunyan and Marc G. Bellemare
▶ Off-policy RL: learning the value function for one policy π
Qπ
(x, a) = Eπ[r1 + γr2 + γ2
r3 + · · · |x0 = x, a0 = a]
from data collected by another policy µ ̸= π
▶ Retrace(λ): a new off-policy multi-step RL algorithm
▶ Theoretical advantages
+ It converges for any π, µ (safe)
+ It makes the best use of samples if π and µ are close to
each other (efficient)
+ Its variance is lower than importance sampling
▶ Empirical evaluation
▶ On Atari 2600 it beats one-step Q-learning (DQN) and
the existing multi-step methods (Q∗(λ), Tree-Backup)
Notation and definitions
▶ state x ∈ X
▶ action a ∈ A
▶ discount factor γ ∈ [0, 1]
▶ immediate reward r ∈ R
▶ policies π, µ : X × A → [0, 1]
▶ value function
Qπ
(x, a) := Eπ[r1 + γr2 + γ2
r3 + · · · |x0 = x, a0 = a]
▶ optimal value function Q∗
:= maxπ Qπ
▶ EπQ(x, ·) :=
∑
a π(a|x)Q(x, a)
Policy evaluation
▶ Learning the value function for a policy π:
Qπ
(x, a) = Eπ[r1 + γr2 + γ2
r3 + · · · |x0 = x, a0 = a]
▶ You can learn optimal control if π is a greedy policy to
the current estimate Q(x, a) e.g. Q-learning
▶ On-policy: learning from data collected by π
▶ Off-policy: learning from data collected by µ ̸= π
▶ Off-policy methods have advantages:
+ Sample-efficient (e.g. experience replay)
+ Exploration by µ
On-policy multi-step methods
From the presentation by the authors: https://ewrl.files.wordpress.com/2016/12/munos.pdf
▶ Temporal difference (or “surprise”) at t:
δt = rt + γQ(xt+1, at+1) − Q(xt, at)
▶ You can use δt to estimate Qπ
(xt, at) (one-step)
▶ Can you use δt to estimate Qπ
(xs, as) for all s ≤ t?
(multi-step)
TD(λ)
From the presentation by the authors: https://ewrl.files.wordpress.com/2016/12/munos.pdf
▶ A popular multi-step algorithm for on-policy policy
evaluation
▶ ∆tQ(x, a) = (γλ)t
δt, where λ ∈ [0, 1] is chosen to
balance bias and variance
▶ Multi-step methods have advantages:
+ Rewards are propagated rapidly
+ Bias introduced by bootstrapping is reduced
Off-policy multi-step algorithm
From the presentation by the authors: https://ewrl.files.wordpress.com/2016/12/munos.pdf
▶ δt = rt + γEπQ(xt+1, ·) − Q(xt, at)
▶ You can use δt to estimate Qπ
(xt, at) e.g. Q-learning
▶ Can you use δt to estimate Qπ
(xs, as) for all s ≤ t?
▶ δt might be less relevant to Qπ(xs, as) compared to the
on-policy case
Importance Sampling (IS) [Precup et al. 2000]
From the presentation by the authors: https://ewrl.files.wordpress.com/2016/12/munos.pdf
▶ ∆tQ(x, a) = γt
(
∏
1≤s≤t
π(as |xs )
µ(as |xs )
)δt
+ Unbiased estimate of Qπ
− Large (possibly infinite) variance since π(as |xs )
µ(as |xs )
is not
bounded
Qπ
(λ) [Harutyunyan et al. 2016]
From the presentation by the authors: https://ewrl.files.wordpress.com/2016/12/munos.pdf
▶ ∆tQ(x, a) = (γλ)t
δt
+ Convergent if µ and π are sufficiently close to each other
or λ is sufficiently small:
λ < 1−γ
γϵ
, where ϵ := maxx ∥π(·|x) − µ(·|x)∥1
− Not convergent otherwise
Tree-Backup (TB) [Precup et al. 2000]
From the presentation by the authors: https://ewrl.files.wordpress.com/2016/12/munos.pdf
▶ ∆tQ(x, a) = (γλ)t
(
∏
1≤s≤t π(as|xs))δt
+ Convergent for any π and µ
+ Works even if µ is unknown and/or non-Markov
−
∏
1≤s≤t π(as|xs) decays rapidly when near on-policy
A unified view
▶ General algorithm: ∆Q(x, a) =
∑
t≥0 γt
(
∏
1≤s≤t cs)δt
▶ None of the existing methods is perfect
▶ Low variance (↔ IS)
▶ “Safe” i.e. convergent for any π and µ (↔ Qπ(λ))
▶ “Efficient” i.e. using full returns when on-policy (↔
Tree-Backup)
Choice of the coefficients cs
▶ Contraction speed
▶ Consider a general operator R:
RQ(x, a) = Q(x, a) + Eµ[
∑
t≥0
γt
(
∏
1≤s≤t
cs)δt]
▶ If 0 ≤ cs ≤ π(as |xs )
µ(as |xs ) , R is a contraction and Qπ is its
fixed point (thus the algorithm is “safe”)
|RQ(x, a) − Qπ
(x, a)| ≤ η(x, a)∥Q − Qπ
∥
η(x, a) := 1 − (1 − γ)Eµ[
∑
t≥0
γt
(
t∏
s=1
cs)]
▶ η = 0 for cs = 1 (“efficient”)
▶ Variance
▶ cs ≤ 1 result in low variance since
∏
1≤s≤t cs ≤ 1
Retrace(λ)
From the presentation by the authors: https://ewrl.files.wordpress.com/2016/12/munos.pdf
▶ ∆Q(x, a) = γt
(
∏
1≤s≤t λ min(1, π(as |xs )
µ(as |xs )
))δt
+ Variance is bounded
+ Convergent for any π and µ
+ Uses full returns when on-policy
− Doesn’t work if µ is unknown or non-Markov (↔
Tree-Backup)
Evaluation on Atari 2600
▶ Trained asynchrounously with 16 CPU threads [Mnih
et al. 2016]
▶ Each thread has private replay memory holding 62,500
transitions
▶ Q-learning uses a minibatch of 64 transitions
▶ Retrace, TB and Q∗
(λ) (a control version of Qπ
(λ)) use
four 16-step sub-sequences
Performance comparison
▶ Inter-algorithm scores are normalized so that 0 and 1
respectively correspond to the worst and best scores for a
particular game
▶ λ = 1 performs best except Q∗
▶ Retrace(λ) performs best on 30 out of 60 games
Sensitivity to the value of λ
▶ Retrace(λ) is robust and consistently outperforms Tree-Backup
▶ Q* performs best for small values of λ
▶ Note that the Q-learning scores are fixed across different λ
Conclusions
▶ Retrace(λ)
▶ is an off-policy multi-step value-based RL algorithm
▶ is low-variance, safe and efficient
▶ outperforms one-step Q-learning and existing multi-step
variants on Atari 2600
▶ (is already applied to A3C in another paper [Wang et al.
2016])
References I
[1] Anna Harutyunyan et al. “Q(λ) with Off-Policy Corrections”. In: Proceedings
of Algorithmic Learning Theory (ALT). 2016. arXiv: 1602.04951.
[2] Volodymyr Mnih et al. Asynchronous Methods for Deep Reinforcement
Learning (old). 2016. arXiv: 1602.01783.
[3] Doina Precup, Richard S Sutton, and Satinder P Singh. “Eligibility Traces for
Off-Policy Policy Evaluation”. In: ICML ’00: Proceedings of the Seventeenth
International Conference on Machine Learning (2000), pp. 759–766.
[4] Ziyu Wang et al. “Sample Efficient Actor-Critic with Experience Replay”. In:
arXiv (2016), pp. 1–20. arXiv: 1611.01224.

More Related Content

PPTX
Introduction of "TrailBlazer" algorithm
PPTX
Differential privacy without sensitivity [NIPS2016読み会資料]
PDF
Improving Variational Inference with Inverse Autoregressive Flow
PDF
Dual Learning for Machine Translation (NIPS 2016)
PDF
Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16
PDF
Dueling Network Architectures for Deep Reinforcement Learning
PDF
Parallel Optimization in Machine Learning
PPTX
InfoGAN: Interpretable Representation Learning by Information Maximizing Gen...
Introduction of "TrailBlazer" algorithm
Differential privacy without sensitivity [NIPS2016読み会資料]
Improving Variational Inference with Inverse Autoregressive Flow
Dual Learning for Machine Translation (NIPS 2016)
Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16
Dueling Network Architectures for Deep Reinforcement Learning
Parallel Optimization in Machine Learning
InfoGAN: Interpretable Representation Learning by Information Maximizing Gen...

What's hot (20)

PDF
Gradient Estimation Using Stochastic Computation Graphs
PDF
safe and efficient off policy reinforcement learning
PDF
Matching networks for one shot learning
PDF
Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16
PDF
Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)
PDF
Gradient Boosted Regression Trees in scikit-learn
PDF
VAE-type Deep Generative Models
PDF
Speaker Diarization
PDF
InfoGAN and Generative Adversarial Networks
PDF
Melanie Warrick, Deep Learning Engineer, Skymind.io at MLconf SF - 11/13/15
PDF
Lecture 5: Neural Networks II
PDF
Dueling network architectures for deep reinforcement learning
PDF
Interaction Networks for Learning about Objects, Relations and Physics
PDF
Breaking the Nonsmooth Barrier: A Scalable Parallel Method for Composite Opti...
PDF
Dictionary Learning for Massive Matrix Factorization
PDF
Data-Driven Recommender Systems
PDF
Optimization for Neural Network Training - Veronica Vilaplana - UPC Barcelona...
PDF
ddpg seminar
PDF
[Paper reading] L-SHAPLEY AND C-SHAPLEY: EFFICIENT MODEL INTERPRETATION FOR S...
PPTX
Deep learning with TensorFlow
Gradient Estimation Using Stochastic Computation Graphs
safe and efficient off policy reinforcement learning
Matching networks for one shot learning
Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16
Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)
Gradient Boosted Regression Trees in scikit-learn
VAE-type Deep Generative Models
Speaker Diarization
InfoGAN and Generative Adversarial Networks
Melanie Warrick, Deep Learning Engineer, Skymind.io at MLconf SF - 11/13/15
Lecture 5: Neural Networks II
Dueling network architectures for deep reinforcement learning
Interaction Networks for Learning about Objects, Relations and Physics
Breaking the Nonsmooth Barrier: A Scalable Parallel Method for Composite Opti...
Dictionary Learning for Massive Matrix Factorization
Data-Driven Recommender Systems
Optimization for Neural Network Training - Veronica Vilaplana - UPC Barcelona...
ddpg seminar
[Paper reading] L-SHAPLEY AND C-SHAPLEY: EFFICIENT MODEL INTERPRETATION FOR S...
Deep learning with TensorFlow
Ad

Viewers also liked (11)

PDF
Learning to learn by gradient descent by gradient descent
PDF
Conditional Image Generation with PixelCNN Decoders
PPT
時系列データ3
PDF
Fast and Probvably Seedings for k-Means
PDF
Value iteration networks
PPTX
Introduction of “Fairness in Learning: Classic and Contextual Bandits”
PDF
[DL輪読会]Convolutional Sequence to Sequence Learning
PDF
論文紹介 Combining Model-Based and Model-Free Updates for Trajectory-Centric Rein...
PDF
NIPS 2016 Overview and Deep Learning Topics
PPTX
ICML2016読み会 概要紹介
PDF
論文紹介 Pixel Recurrent Neural Networks
Learning to learn by gradient descent by gradient descent
Conditional Image Generation with PixelCNN Decoders
時系列データ3
Fast and Probvably Seedings for k-Means
Value iteration networks
Introduction of “Fairness in Learning: Classic and Contextual Bandits”
[DL輪読会]Convolutional Sequence to Sequence Learning
論文紹介 Combining Model-Based and Model-Free Updates for Trajectory-Centric Rein...
NIPS 2016 Overview and Deep Learning Topics
ICML2016読み会 概要紹介
論文紹介 Pixel Recurrent Neural Networks
Ad

Similar to Safe and Efficient Off-Policy Reinforcement Learning (20)

PDF
Continuous control
PDF
Learning to discover monte carlo algorithm on spin ice manifold
PDF
week10_Reinforce.pdf
PDF
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
PDF
PDF
Double Q-learning Paper Reading
PDF
block-mdp-masters-defense.pdf
PDF
Policy-Gradient for deep reinforcement learning.pdf
PDF
Bayesian Deep Learning
PDF
Hierarchical Reinforcement Learning with Option-Critic Architecture
PPTX
DDPG algortihm for angry birds
PPTX
ML unit-1.pptx
PDF
Massive Matrix Factorization : Applications to collaborative filtering
PDF
Machine learning in science and industry — day 2
PDF
Matrix Factorizations for Recommender Systems
PDF
MLHEP Lectures - day 1, basic track
PDF
MUMS: Bayesian, Fiducial, and Frequentist Conference - Model Selection in the...
PPTX
2021 06-02-tabnet
PDF
MLHEP Lectures - day 2, basic track
PDF
Distributed Group Analytical Hierarchical Process by Consensus
Continuous control
Learning to discover monte carlo algorithm on spin ice manifold
week10_Reinforce.pdf
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Double Q-learning Paper Reading
block-mdp-masters-defense.pdf
Policy-Gradient for deep reinforcement learning.pdf
Bayesian Deep Learning
Hierarchical Reinforcement Learning with Option-Critic Architecture
DDPG algortihm for angry birds
ML unit-1.pptx
Massive Matrix Factorization : Applications to collaborative filtering
Machine learning in science and industry — day 2
Matrix Factorizations for Recommender Systems
MLHEP Lectures - day 1, basic track
MUMS: Bayesian, Fiducial, and Frequentist Conference - Model Selection in the...
2021 06-02-tabnet
MLHEP Lectures - day 2, basic track
Distributed Group Analytical Hierarchical Process by Consensus

More from mooopan (9)

PDF
Clipped Action Policy Gradient
PDF
Model-Based Reinforcement Learning @NIPS2017
PDF
ChainerRLの紹介
PDF
A3Cという強化学習アルゴリズムで遊んでみた話
PDF
最近のDQN
PDF
Learning Continuous Control Policies by Stochastic Value Gradients
PDF
Trust Region Policy Optimization
PDF
Effective Modern C++ Item 24: Distinguish universal references from rvalue re...
PDF
"Playing Atari with Deep Reinforcement Learning"
Clipped Action Policy Gradient
Model-Based Reinforcement Learning @NIPS2017
ChainerRLの紹介
A3Cという強化学習アルゴリズムで遊んでみた話
最近のDQN
Learning Continuous Control Policies by Stochastic Value Gradients
Trust Region Policy Optimization
Effective Modern C++ Item 24: Distinguish universal references from rvalue re...
"Playing Atari with Deep Reinforcement Learning"

Recently uploaded (20)

PPTX
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
PDF
Wondershare Filmora 15 Crack With Activation Key [2025
PDF
Adobe Illustrator 28.6 Crack My Vision of Vector Design
PDF
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
PPTX
L1 - Introduction to python Backend.pptx
PDF
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
PDF
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
PPTX
Essential Infomation Tech presentation.pptx
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 41
PDF
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
PDF
How to Migrate SBCGlobal Email to Yahoo Easily
PDF
EN-Survey-Report-SAP-LeanIX-EA-Insights-2025.pdf
PDF
Design an Analysis of Algorithms II-SECS-1021-03
PDF
wealthsignaloriginal-com-DS-text-... (1).pdf
PDF
Flood Susceptibility Mapping Using Image-Based 2D-CNN Deep Learnin. Overview ...
PPTX
Transform Your Business with a Software ERP System
PPTX
ai tools demonstartion for schools and inter college
PPTX
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
PPTX
CHAPTER 2 - PM Management and IT Context
PDF
System and Network Administraation Chapter 3
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
Wondershare Filmora 15 Crack With Activation Key [2025
Adobe Illustrator 28.6 Crack My Vision of Vector Design
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
L1 - Introduction to python Backend.pptx
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
Essential Infomation Tech presentation.pptx
Internet Downloader Manager (IDM) Crack 6.42 Build 41
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
How to Migrate SBCGlobal Email to Yahoo Easily
EN-Survey-Report-SAP-LeanIX-EA-Insights-2025.pdf
Design an Analysis of Algorithms II-SECS-1021-03
wealthsignaloriginal-com-DS-text-... (1).pdf
Flood Susceptibility Mapping Using Image-Based 2D-CNN Deep Learnin. Overview ...
Transform Your Business with a Software ERP System
ai tools demonstartion for schools and inter college
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
CHAPTER 2 - PM Management and IT Context
System and Network Administraation Chapter 3

Safe and Efficient Off-Policy Reinforcement Learning

  • 1. Safe and Efficient Off-Policy Reinforcement Learning NIPS 2016 Yasuhiro Fujita Preferred Networks Inc. January 19, 2017
  • 2. Safe and Efficient Off-Policy Reinforcement Learning by Remi Munos, Thomas Stepleton, Anna Harutyunyan and Marc G. Bellemare ▶ Off-policy RL: learning the value function for one policy π Qπ (x, a) = Eπ[r1 + γr2 + γ2 r3 + · · · |x0 = x, a0 = a] from data collected by another policy µ ̸= π ▶ Retrace(λ): a new off-policy multi-step RL algorithm ▶ Theoretical advantages + It converges for any π, µ (safe) + It makes the best use of samples if π and µ are close to each other (efficient) + Its variance is lower than importance sampling ▶ Empirical evaluation ▶ On Atari 2600 it beats one-step Q-learning (DQN) and the existing multi-step methods (Q∗(λ), Tree-Backup)
  • 3. Notation and definitions ▶ state x ∈ X ▶ action a ∈ A ▶ discount factor γ ∈ [0, 1] ▶ immediate reward r ∈ R ▶ policies π, µ : X × A → [0, 1] ▶ value function Qπ (x, a) := Eπ[r1 + γr2 + γ2 r3 + · · · |x0 = x, a0 = a] ▶ optimal value function Q∗ := maxπ Qπ ▶ EπQ(x, ·) := ∑ a π(a|x)Q(x, a)
  • 4. Policy evaluation ▶ Learning the value function for a policy π: Qπ (x, a) = Eπ[r1 + γr2 + γ2 r3 + · · · |x0 = x, a0 = a] ▶ You can learn optimal control if π is a greedy policy to the current estimate Q(x, a) e.g. Q-learning ▶ On-policy: learning from data collected by π ▶ Off-policy: learning from data collected by µ ̸= π ▶ Off-policy methods have advantages: + Sample-efficient (e.g. experience replay) + Exploration by µ
  • 5. On-policy multi-step methods From the presentation by the authors: https://ewrl.files.wordpress.com/2016/12/munos.pdf ▶ Temporal difference (or “surprise”) at t: δt = rt + γQ(xt+1, at+1) − Q(xt, at) ▶ You can use δt to estimate Qπ (xt, at) (one-step) ▶ Can you use δt to estimate Qπ (xs, as) for all s ≤ t? (multi-step)
  • 6. TD(λ) From the presentation by the authors: https://ewrl.files.wordpress.com/2016/12/munos.pdf ▶ A popular multi-step algorithm for on-policy policy evaluation ▶ ∆tQ(x, a) = (γλ)t δt, where λ ∈ [0, 1] is chosen to balance bias and variance ▶ Multi-step methods have advantages: + Rewards are propagated rapidly + Bias introduced by bootstrapping is reduced
  • 7. Off-policy multi-step algorithm From the presentation by the authors: https://ewrl.files.wordpress.com/2016/12/munos.pdf ▶ δt = rt + γEπQ(xt+1, ·) − Q(xt, at) ▶ You can use δt to estimate Qπ (xt, at) e.g. Q-learning ▶ Can you use δt to estimate Qπ (xs, as) for all s ≤ t? ▶ δt might be less relevant to Qπ(xs, as) compared to the on-policy case
  • 8. Importance Sampling (IS) [Precup et al. 2000] From the presentation by the authors: https://ewrl.files.wordpress.com/2016/12/munos.pdf ▶ ∆tQ(x, a) = γt ( ∏ 1≤s≤t π(as |xs ) µ(as |xs ) )δt + Unbiased estimate of Qπ − Large (possibly infinite) variance since π(as |xs ) µ(as |xs ) is not bounded
  • 9. Qπ (λ) [Harutyunyan et al. 2016] From the presentation by the authors: https://ewrl.files.wordpress.com/2016/12/munos.pdf ▶ ∆tQ(x, a) = (γλ)t δt + Convergent if µ and π are sufficiently close to each other or λ is sufficiently small: λ < 1−γ γϵ , where ϵ := maxx ∥π(·|x) − µ(·|x)∥1 − Not convergent otherwise
  • 10. Tree-Backup (TB) [Precup et al. 2000] From the presentation by the authors: https://ewrl.files.wordpress.com/2016/12/munos.pdf ▶ ∆tQ(x, a) = (γλ)t ( ∏ 1≤s≤t π(as|xs))δt + Convergent for any π and µ + Works even if µ is unknown and/or non-Markov − ∏ 1≤s≤t π(as|xs) decays rapidly when near on-policy
  • 11. A unified view ▶ General algorithm: ∆Q(x, a) = ∑ t≥0 γt ( ∏ 1≤s≤t cs)δt ▶ None of the existing methods is perfect ▶ Low variance (↔ IS) ▶ “Safe” i.e. convergent for any π and µ (↔ Qπ(λ)) ▶ “Efficient” i.e. using full returns when on-policy (↔ Tree-Backup)
  • 12. Choice of the coefficients cs ▶ Contraction speed ▶ Consider a general operator R: RQ(x, a) = Q(x, a) + Eµ[ ∑ t≥0 γt ( ∏ 1≤s≤t cs)δt] ▶ If 0 ≤ cs ≤ π(as |xs ) µ(as |xs ) , R is a contraction and Qπ is its fixed point (thus the algorithm is “safe”) |RQ(x, a) − Qπ (x, a)| ≤ η(x, a)∥Q − Qπ ∥ η(x, a) := 1 − (1 − γ)Eµ[ ∑ t≥0 γt ( t∏ s=1 cs)] ▶ η = 0 for cs = 1 (“efficient”) ▶ Variance ▶ cs ≤ 1 result in low variance since ∏ 1≤s≤t cs ≤ 1
  • 13. Retrace(λ) From the presentation by the authors: https://ewrl.files.wordpress.com/2016/12/munos.pdf ▶ ∆Q(x, a) = γt ( ∏ 1≤s≤t λ min(1, π(as |xs ) µ(as |xs ) ))δt + Variance is bounded + Convergent for any π and µ + Uses full returns when on-policy − Doesn’t work if µ is unknown or non-Markov (↔ Tree-Backup)
  • 14. Evaluation on Atari 2600 ▶ Trained asynchrounously with 16 CPU threads [Mnih et al. 2016] ▶ Each thread has private replay memory holding 62,500 transitions ▶ Q-learning uses a minibatch of 64 transitions ▶ Retrace, TB and Q∗ (λ) (a control version of Qπ (λ)) use four 16-step sub-sequences
  • 15. Performance comparison ▶ Inter-algorithm scores are normalized so that 0 and 1 respectively correspond to the worst and best scores for a particular game ▶ λ = 1 performs best except Q∗ ▶ Retrace(λ) performs best on 30 out of 60 games
  • 16. Sensitivity to the value of λ ▶ Retrace(λ) is robust and consistently outperforms Tree-Backup ▶ Q* performs best for small values of λ ▶ Note that the Q-learning scores are fixed across different λ
  • 17. Conclusions ▶ Retrace(λ) ▶ is an off-policy multi-step value-based RL algorithm ▶ is low-variance, safe and efficient ▶ outperforms one-step Q-learning and existing multi-step variants on Atari 2600 ▶ (is already applied to A3C in another paper [Wang et al. 2016])
  • 18. References I [1] Anna Harutyunyan et al. “Q(λ) with Off-Policy Corrections”. In: Proceedings of Algorithmic Learning Theory (ALT). 2016. arXiv: 1602.04951. [2] Volodymyr Mnih et al. Asynchronous Methods for Deep Reinforcement Learning (old). 2016. arXiv: 1602.01783. [3] Doina Precup, Richard S Sutton, and Satinder P Singh. “Eligibility Traces for Off-Policy Policy Evaluation”. In: ICML ’00: Proceedings of the Seventeenth International Conference on Machine Learning (2000), pp. 759–766. [4] Ziyu Wang et al. “Sample Efficient Actor-Critic with Experience Replay”. In: arXiv (2016), pp. 1–20. arXiv: 1611.01224.