Hierarchical Reinforcement Learning with Option-Critic Architecture

Hierarchical Reinforcement Learning with
Option-Critic Architecture
Oğuz Şerbetci
April 4, 2018
Modelling of Cognitive Processes
TU Berlin

Reinforcement Learning
Hierarchical Reinforcement Learning
Demonstration
Resources
Appendix
1

Agent
Environment
2

Agent
Environment
Action at
2

Agent
Environment
Action atState st
Reward rt
2

MDP: S, A, p(s |s, a), r(s, a)
3

Policy π(a|s) : S × A → [0, 1]
3

Policy π(a|s) : S × A → [0, 1]
Goal: an optimal policy π∗ that maximizes E
∞
t=0
γt
rt|s0 = s
3

Problems
• lack of planning and commitment
4

Problems
• inefficient exploration
4

Problems
• temporal credit assignment problem
4

Problems
• temporal credit assignment problem
• inability to divide-and-conquer
4

Hierarchical Reinforcement Learning

Temporal Abstractions
Icons made by Smashicons and Freepick from Freepick
6

Options Framework (Sutton, Precup, et al. 1999)
SMDP:
S, A, p(s , k |s, a), r(s, a)
(Sutton, Precup, et al. 1999)
8

SMDP:
S, A, p(s , k |s, a), r(s, a)
Option:
Iω: initiation set
πω: intra-option-policy
βω: termination-policy
(Sutton, Precup, et al. 1999)
8

SMDP:
S, A, p(s , k |s, a), r(s, a)
Option:
Iω: initiation set
πω: intra-option-policy
βω: termination-policy
πΩ: option-policy (Sutton, Precup, et al. 1999)
8

Option-Critic (Bacon et al. 2017)
Given the number of options Option-Critic learns βω, πω, πΩ in
an end-to-end & online fashion.
Allows non-linear function approximators, enabling
continuous state and action spaces.
• online, end-to-end learning of options in continuous
state/action space
• allows using non-linear function approximators (deep RL)
10

Value Functions
The state value function:
Vπ(s) = E
∞
t=0
γt
rt|s0 = s
11

Value Functions
Vπ(s) = E
∞
t=0
γt
rt|s0 = s
=
a∈As
π(s, a) r(s, a) + γ
s
p(s |s, a)V(s )
Bellman Equations (Bellman 1952)
11

Value Functions
Vπ(s) = E
∞
t=0
γt
rt|s0 = s
=
a∈As
π(s, a) r(s, a) + γ
s
p(s |s, a)V(s )
The action value function:
Qπ(s, a) = E
∞
t=0
γt
rt|s0 = s, a0 = a
11

Value Functions
Vπ(s) = E
∞
t=0
γt
rt|s0 = s
=
a∈As
π(s, a) r(s, a) + γ
s
p(s |s, a)V(s )
Qπ(s, a) = E
∞
t=0
γt
rt|s0 = s, a0 = a
= r(s, a) + γ
s
p(s |s, a)
a ∈A
π(s , a )Q(s , a )
11

Value methods
TD Learning:
Qπ
(s, a) ←Qπ
(s, a) + α
TD target
r + γVπ
(s ) −Qπ
(s, a)
TD error
12

Value methods
TD Learning:
Qπ
(s, a) ←Qπ
(s, a) + α
TD target
r + γVπ
(s ) −Qπ
(s, a)
TD error
Vπ
(s ) = max
a
Q(s , a) Q-Learning
12

Value methods
TD Learning:
Qπ
(s, a) ←Qπ
(s, a) + α
TD target
r + γVπ
(s ) −Qπ
(s, a)
TD error
Vπ
(s ) = max
a
Q(s , a) Q-Learning
π(a|s) = argmax
a
Q(s, a) greedy policy
12

Value methods
TD Learning:
Qπ
(s, a) ←Qπ
(s, a) + α
TD target
r + γVπ
(s ) −Qπ
(s, a)
TD error
Vπ
(s ) = max
a
Q(s , a) Q-Learning
π(a|s) = argmax
a
Q(s, a) greedy policy
π(a|s) =
random with probability
argmaxa Q(s, a) with probability 1 −
- greedy policy
12

Option Value Functions
The option value function:
QΩ(s, ω) =
a
πω(a|s)QU (s, w, a)
13

QΩ(s, ω) =
a
QU (s, ω, a) = r(s, a) + γ
s
p(s |s, a)U(ω, s )
13

QΩ(s, ω) =
a
QU (s, ω, a) = r(s, a) + γ
s
p(s |s, a)U(ω, s )
The state value function upon arrival:
U(ω, s ) = (1 − βω(s ))QΩ(s , ω) + βω(s )VΩ(s )
13

Policy Gradient Methods
π(a|s) = argmax
a
Q(s, a)
14

π(a|s) = argmax
a
Q(s, a) vs. π(a|s, θ) = softmax
a
h(s, a, θ)
14

π(a|s) = argmax
a
a
h(s, a, θ)
J(θ)
14

π(a|s) = argmax
a
a
h(s, a, θ)
J(θ) = V∗
(s)
14

π(a|s) = argmax
a
a
h(s, a, θ)
J(θ) = V∗
(s)
θJ(θ)
14

π(a|s) = argmax
a
a
h(s, a, θ)
J(θ) = V∗
(s)
θJ(θ)
Policy Gradient Theorem (Sutton, McAllester, et al. 2000)
14

π(a|s) = argmax
a
a
h(s, a, θ)
J(θ) = V∗
(s)
θJ(θ)
θJ(θ) =
s
µπ(s)
a
Qπ(s, a) θπ(a|s, θ)
14

π(a|s) = argmax
a
a
h(s, a, θ)
J(θ) = V∗
(s)
θJ(θ)
θJ(θ) =
s
µπ(s)
a
θJ(θ) = E γt
a
14

Actor-Critic (Sutton 1984)
θ ← θ + αγt
δ
TD-Error
θ log π(a|s, θ)
Taken from Pierre-Luc Bacon 15

Taken from (Bacon et al. 2017)
16

The gradient wrt. intra-option-policy parameters θ:
θQΩ(s, ω)
take better primitives inside options.
17

θQΩ(s, ω) = E
∂ log πω,θ(a|s)
∂θ
QU (s, ω, a)
17

θQΩ(s, ω) = E
∂θ
QU (s, ω, a)
The gradient wrt. termination-policy parameters ϑ:
ϑU(ω, s )
shorten options with bad advantage.
17

θQΩ(s, ω) = E
∂θ
QU (s, ω, a)
The gradient wrt. termination-policy parameters ϑ:
ϑU(ω, s ) = E −
∂βω,ϑ(s )
∂ϑ
QΩ(s , ω) − VΩ(s , ω)
e.g. maxω Q(s,w)
shorten options with bad advantage.
17

Complex Environment i
(Bacon et al. 2017)
21

Complex Environment ii
(Harb et al. 2017)
22

But... ii
(Dilokthanakul et al. 2017)
24

• Sutton & Barto, Reinforcement Learning: An Introduction,
Second Edition Draft
• David Silver’s Reinforcement Learning Course
25

References i
Bacon, Pierre-Luc, Jean Harb, and Doina Precup (2017). “The
Option-Critic architecture”. In: Proceedings of the Thirty-First
AAAI Conference on Artificial Intelligence, February 4-9, 2017,
San Francisco, California, USA. Pp. 1726–1734.
Bellman, Richard (1952). “On the theory of dynamic
programming”. In: Proceedings of the National Academy of
Sciences 38.8, pp. 716–719. doi: 10.1073/pnas.38.8.716.
Dilokthanakul, N., C. Kaplanis, N. Pawlowski, and M. Shanahan
(2017). “Feature Control as Intrinsic Motivation for Hierarchical
Reinforcement Learning”. In: ArXiv e-prints. arXiv:
1705.06769 [cs.LG].
26

References ii
Harb, Jean, Pierre-Luc Bacon, Martin Klissarov, and Doina Precup
(2017). “When waiting is not an option: Learning options with a
deliberation cost”. In: arXiv: 1709.04571.
Sutton, Richard S (1984). “Temporal credit assignment in
reinforcement learning”. AAI8410337. PhD thesis.
Sutton, Richard S, David A McAllester, Satinder P Singh, and
Yishay Mansour (2000). “Policy gradient methods for
reinforcement learning with function approximation”. In:
Advances in Neural Information Processing Systems,
pp. 1057–1063.
27

References iii
Sutton, Richard S, Doina Precup, and Satinder Singh (1999).
“Between MDPs and Semi-MDPs: A framework for temporal
abstraction in reinforcement learning”. In: Artificial
Intelligence 112.1-2, pp. 181–211. doi:
10.1016/S0004-3702(99)00052-1.
28

Option-Critic (Bacon et al. 2017) i
procedure train(α, NΩ)
s ← s0
choose ω ∼ πΩ(ω|s) Option-policy
repeat
choose a ∼ πω,θ(a|s) Intra-option-policy
take the action a in s, observe s and r
1. Options evaluation
g ← r TD-Target
if s is not terminal then
g ← g + γ(1 − βω,ϑ(s ))QΩ(s , ω)
+ γ βω,ϑ(s ) max
ω
QΩ(s , ω)
29

Option-Critic (Bacon et al. 2017) ii
2. Critic improvement
δU ← g − QU (s, ω, a)
QU (s, ω, a) ← QU (s, ω, a) + αU δU
3. Intra-option Q-learning
δΩ ← g − QΩ(s, ω)
QΩ(s, ω) ← QΩ(s, ω) + αΩδΩ
4. Options improvement
θ ← θ + αθ
∂θ QU (s, ω, a)
ϑ ← ϑ + αϑ
∂βω,ϑ(s )
∂ϑ (QΩ(s , ω) − maxω QΩ(s , ω) + ξ)
30

Option-Critic (Bacon et al. 2017) iii
if terminate ∼ βω,ϑ(s ) then Termination-policy
choose ω ∼ πΩ(ω|s )
s ← s .
until s is terminal
31

Hierarchical Reinforcement Learning with Option-Critic Architecture

More Related Content

Similar to Hierarchical Reinforcement Learning with Option-Critic Architecture (20)

Recently uploaded (20)

Hierarchical Reinforcement Learning with Option-Critic Architecture