Safe and Efficient Off-Policy Reinforcement Learning

Safe and Eﬃcient Oﬀ-Policy
Reinforcement Learning
NIPS 2016
Yasuhiro Fujita
Preferred Networks Inc.
January 19, 2017

Safe and Efficient Off-Policy Reinforcement Learning
by Remi Munos, Thomas Stepleton, Anna Harutyunyan and Marc G. Bellemare
▶ Off-policy RL: learning the value function for one policy π
Qπ
(x, a) = Eπ[r1 + γr2 + γ2
r3 + · · · |x0 = x, a0 = a]
from data collected by another policy µ ̸= π
▶ Retrace(λ): a new off-policy multi-step RL algorithm
▶ Theoretical advantages
+ It converges for any π, µ (safe)
+ It makes the best use of samples if π and µ are close to
each other (efficient)
+ Its variance is lower than importance sampling
▶ Empirical evaluation
▶ On Atari 2600 it beats one-step Q-learning (DQN) and
the existing multi-step methods (Q∗(λ), Tree-Backup)

Notation and deﬁnitions
▶ state x ∈ X
▶ action a ∈ A
▶ discount factor γ ∈ [0, 1]
▶ immediate reward r ∈ R
▶ policies π, µ : X × A → [0, 1]
▶ value function
Qπ
(x, a) := Eπ[r1 + γr2 + γ2
r3 + · · · |x0 = x, a0 = a]
▶ optimal value function Q∗
:= maxπ Qπ
▶ EπQ(x, ·) :=
∑
a π(a|x)Q(x, a)

Policy evaluation
▶ Learning the value function for a policy π:
Qπ
(x, a) = Eπ[r1 + γr2 + γ2
r3 + · · · |x0 = x, a0 = a]
▶ You can learn optimal control if π is a greedy policy to
the current estimate Q(x, a) e.g. Q-learning
▶ On-policy: learning from data collected by π
▶ Off-policy: learning from data collected by µ ̸= π
▶ Off-policy methods have advantages:
+ Sample-efficient (e.g. experience replay)
+ Exploration by µ

On-policy multi-step methods
From the presentation by the authors: https://ewrl.files.wordpress.com/2016/12/munos.pdf
▶ Temporal diﬀerence (or “surprise”) at t:
δt = rt + γQ(xt+1, at+1) − Q(xt, at)
▶ You can use δt to estimate Qπ
(xt, at) (one-step)
▶ Can you use δt to estimate Qπ
(xs, as) for all s ≤ t?
(multi-step)

TD(λ)
▶ A popular multi-step algorithm for on-policy policy
evaluation
▶ ∆tQ(x, a) = (γλ)t
δt, where λ ∈ [0, 1] is chosen to
balance bias and variance
▶ Multi-step methods have advantages:
+ Rewards are propagated rapidly
+ Bias introduced by bootstrapping is reduced

Oﬀ-policy multi-step algorithm
▶ δt = rt + γEπQ(xt+1, ·) − Q(xt, at)
▶ You can use δt to estimate Qπ
(xt, at) e.g. Q-learning
▶ Can you use δt to estimate Qπ
(xs, as) for all s ≤ t?
▶ δt might be less relevant to Qπ(xs, as) compared to the
on-policy case

Importance Sampling (IS) [Precup et al. 2000]
▶ ∆tQ(x, a) = γt
(
∏
1≤s≤t
π(as |xs )
µ(as |xs )
)δt
+ Unbiased estimate of Qπ
− Large (possibly inﬁnite) variance since π(as |xs )
µ(as |xs )
is not
bounded

Qπ
(λ) [Harutyunyan et al. 2016]
▶ ∆tQ(x, a) = (γλ)t
δt
+ Convergent if µ and π are suﬃciently close to each other
or λ is suﬃciently small:
λ < 1−γ
γϵ
, where ϵ := maxx ∥π(·|x) − µ(·|x)∥1
− Not convergent otherwise

Tree-Backup (TB) [Precup et al. 2000]
▶ ∆tQ(x, a) = (γλ)t
(
∏
1≤s≤t π(as|xs))δt
+ Convergent for any π and µ
+ Works even if µ is unknown and/or non-Markov
−
∏
1≤s≤t π(as|xs) decays rapidly when near on-policy

A uniﬁed view
▶ General algorithm: ∆Q(x, a) =
∑
t≥0 γt
(
∏
1≤s≤t cs)δt
▶ None of the existing methods is perfect
▶ Low variance (↔ IS)
▶ “Safe” i.e. convergent for any π and µ (↔ Qπ(λ))
▶ “Eﬃcient” i.e. using full returns when on-policy (↔
Tree-Backup)

Choice of the coefficients cs
▶ Contraction speed
▶ Consider a general operator R:
RQ(x, a) = Q(x, a) + Eµ[
∑
t≥0
γt
(
∏
1≤s≤t
cs)δt]
▶ If 0 ≤ cs ≤ π(as |xs )
µ(as |xs ) , R is a contraction and Qπ is its
fixed point (thus the algorithm is “safe”)
|RQ(x, a) − Qπ
(x, a)| ≤ η(x, a)∥Q − Qπ
∥
η(x, a) := 1 − (1 − γ)Eµ[
∑
t≥0
γt
(
t∏
s=1
cs)]
▶ η = 0 for cs = 1 (“efficient”)
▶ Variance
▶ cs ≤ 1 result in low variance since
∏
1≤s≤t cs ≤ 1

Retrace(λ)
▶ ∆Q(x, a) = γt
(
∏
1≤s≤t λ min(1, π(as |xs )
µ(as |xs )
))δt
+ Variance is bounded
+ Convergent for any π and µ
+ Uses full returns when on-policy
− Doesn’t work if µ is unknown or non-Markov (↔
Tree-Backup)

Evaluation on Atari 2600
▶ Trained asynchrounously with 16 CPU threads [Mnih
et al. 2016]
▶ Each thread has private replay memory holding 62,500
transitions
▶ Q-learning uses a minibatch of 64 transitions
▶ Retrace, TB and Q∗
(λ) (a control version of Qπ
(λ)) use
four 16-step sub-sequences

Performance comparison
▶ Inter-algorithm scores are normalized so that 0 and 1
respectively correspond to the worst and best scores for a
particular game
▶ λ = 1 performs best except Q∗
▶ Retrace(λ) performs best on 30 out of 60 games

Sensitivity to the value of λ
▶ Retrace(λ) is robust and consistently outperforms Tree-Backup
▶ Q* performs best for small values of λ
▶ Note that the Q-learning scores are ﬁxed across diﬀerent λ

Conclusions
▶ Retrace(λ)
▶ is an oﬀ-policy multi-step value-based RL algorithm
▶ is low-variance, safe and eﬃcient
▶ outperforms one-step Q-learning and existing multi-step
variants on Atari 2600
▶ (is already applied to A3C in another paper [Wang et al.
2016])

References I
[1] Anna Harutyunyan et al. “Q(λ) with Off-Policy Corrections”. In: Proceedings
of Algorithmic Learning Theory (ALT). 2016. arXiv: 1602.04951.
[2] Volodymyr Mnih et al. Asynchronous Methods for Deep Reinforcement
Learning (old). 2016. arXiv: 1602.01783.
[3] Doina Precup, Richard S Sutton, and Satinder P Singh. “Eligibility Traces for
Off-Policy Policy Evaluation”. In: ICML ’00: Proceedings of the Seventeenth
International Conference on Machine Learning (2000), pp. 759–766.
[4] Ziyu Wang et al. “Sample Efficient Actor-Critic with Experience Replay”. In:
arXiv (2016), pp. 1–20. arXiv: 1611.01224.

Safe and Efficient Off-Policy Reinforcement Learning

More Related Content

What's hot (20)

Viewers also liked (11)

Similar to Safe and Efficient Off-Policy Reinforcement Learning (20)

More from mooopan (9)

Recently uploaded (20)

Safe and Efficient Off-Policy Reinforcement Learning