SlideShare a Scribd company logo
Hierarchical Reinforcement Learning with
Option-Critic Architecture
Oğuz Şerbetci
April 4, 2018
Modelling of Cognitive Processes
TU Berlin
Reinforcement Learning
Hierarchical Reinforcement Learning
Demonstration
Resources
Appendix
1
Reinforcement Learning
Reinforcement Learning
Agent
Environment
2
Reinforcement Learning
Agent
Environment
Action at
2
Reinforcement Learning
Agent
Environment
Action atState st
Reward rt
2
Reinforcement Learning
MDP: S, A, p(s |s, a), r(s, a)
3
Reinforcement Learning
MDP: S, A, p(s |s, a), r(s, a)
3
Reinforcement Learning
MDP: S, A, p(s |s, a), r(s, a)
Policy π(a|s) : S × A → [0, 1]
3
Reinforcement Learning
MDP: S, A, p(s |s, a), r(s, a)
Policy π(a|s) : S × A → [0, 1]
3
Reinforcement Learning
MDP: S, A, p(s |s, a), r(s, a)
Policy π(a|s) : S × A → [0, 1]
Goal: an optimal policy π∗ that maximizes E
∞
t=0
γt
rt|s0 = s
3
Problems
4
Problems
• lack of planning and commitment
4
Problems
• lack of planning and commitment
• inefficient exploration
4
Problems
• lack of planning and commitment
• inefficient exploration
• temporal credit assignment problem
4
Problems
• lack of planning and commitment
• inefficient exploration
• temporal credit assignment problem
• inability to divide-and-conquer
4
5
Hierarchical Reinforcement Learning
Temporal Abstractions
Icons made by Smashicons and Freepick from Freepick
6
Temporal Abstractions
Icons made by Smashicons and Freepick from Freepick
6
Temporal Abstractions
Icons made by Smashicons and Freepick from Freepick
6
Temporal Abstractions
Icons made by Smashicons and Freepick from Freepick
6
7
Options Framework (Sutton, Precup, et al. 1999)
SMDP:
S, A, p(s , k |s, a), r(s, a)
(Sutton, Precup, et al. 1999)
8
Options Framework (Sutton, Precup, et al. 1999)
SMDP:
S, A, p(s , k |s, a), r(s, a)
Option:
Iω: initiation set
πω: intra-option-policy
βω: termination-policy
(Sutton, Precup, et al. 1999)
8
Options Framework (Sutton, Precup, et al. 1999)
SMDP:
S, A, p(s , k |s, a), r(s, a)
Option:
Iω: initiation set
πω: intra-option-policy
βω: termination-policy
(Sutton, Precup, et al. 1999)
8
Options Framework (Sutton, Precup, et al. 1999)
SMDP:
S, A, p(s , k |s, a), r(s, a)
Option:
Iω: initiation set
πω: intra-option-policy
βω: termination-policy
πΩ: option-policy (Sutton, Precup, et al. 1999)
8
9
Option-Critic (Bacon et al. 2017)
Given the number of options Option-Critic learns βω, πω, πΩ in
an end-to-end & online fashion.
Allows non-linear function approximators, enabling
continuous state and action spaces.
• online, end-to-end learning of options in continuous
state/action space
• allows using non-linear function approximators (deep RL)
10
Value Functions
The state value function:
Vπ(s) = E
∞
t=0
γt
rt|s0 = s
11
Value Functions
The state value function:
Vπ(s) = E
∞
t=0
γt
rt|s0 = s
=
a∈As
π(s, a) r(s, a) + γ
s
p(s |s, a)V(s )
Bellman Equations (Bellman 1952)
11
Value Functions
The state value function:
Vπ(s) = E
∞
t=0
γt
rt|s0 = s
=
a∈As
π(s, a) r(s, a) + γ
s
p(s |s, a)V(s )
The action value function:
Qπ(s, a) = E
∞
t=0
γt
rt|s0 = s, a0 = a
Bellman Equations (Bellman 1952)
11
Value Functions
The state value function:
Vπ(s) = E
∞
t=0
γt
rt|s0 = s
=
a∈As
π(s, a) r(s, a) + γ
s
p(s |s, a)V(s )
The action value function:
Qπ(s, a) = E
∞
t=0
γt
rt|s0 = s, a0 = a
= r(s, a) + γ
s
p(s |s, a)
a ∈A
π(s , a )Q(s , a )
Bellman Equations (Bellman 1952)
11
Value methods
TD Learning:
Qπ
(s, a) ←Qπ
(s, a) + α
TD target
r + γVπ
(s ) −Qπ
(s, a)
TD error
12
Value methods
TD Learning:
Qπ
(s, a) ←Qπ
(s, a) + α
TD target
r + γVπ
(s ) −Qπ
(s, a)
TD error
Vπ
(s ) = max
a
Q(s , a) Q-Learning
12
Value methods
TD Learning:
Qπ
(s, a) ←Qπ
(s, a) + α
TD target
r + γVπ
(s ) −Qπ
(s, a)
TD error
Vπ
(s ) = max
a
Q(s , a) Q-Learning
π(a|s) = argmax
a
Q(s, a) greedy policy
12
Value methods
TD Learning:
Qπ
(s, a) ←Qπ
(s, a) + α
TD target
r + γVπ
(s ) −Qπ
(s, a)
TD error
Vπ
(s ) = max
a
Q(s , a) Q-Learning
π(a|s) = argmax
a
Q(s, a) greedy policy
π(a|s) =
random with probability
argmaxa Q(s, a) with probability 1 −
- greedy policy
12
Option Value Functions
The option value function:
QΩ(s, ω) =
a
πω(a|s)QU (s, w, a)
13
Option Value Functions
The option value function:
QΩ(s, ω) =
a
πω(a|s)QU (s, w, a)
The action value function:
QU (s, ω, a) = r(s, a) + γ
s
p(s |s, a)U(ω, s )
13
Option Value Functions
The option value function:
QΩ(s, ω) =
a
πω(a|s)QU (s, w, a)
The action value function:
QU (s, ω, a) = r(s, a) + γ
s
p(s |s, a)U(ω, s )
The state value function upon arrival:
U(ω, s ) = (1 − βω(s ))QΩ(s , ω) + βω(s )VΩ(s )
13
Policy Gradient Methods
π(a|s) = argmax
a
Q(s, a)
14
Policy Gradient Methods
π(a|s) = argmax
a
Q(s, a) vs. π(a|s, θ) = softmax
a
h(s, a, θ)
14
Policy Gradient Methods
π(a|s) = argmax
a
Q(s, a) vs. π(a|s, θ) = softmax
a
h(s, a, θ)
J(θ)
14
Policy Gradient Methods
π(a|s) = argmax
a
Q(s, a) vs. π(a|s, θ) = softmax
a
h(s, a, θ)
J(θ) = V∗
(s)
14
Policy Gradient Methods
π(a|s) = argmax
a
Q(s, a) vs. π(a|s, θ) = softmax
a
h(s, a, θ)
J(θ) = V∗
(s)
θJ(θ)
14
Policy Gradient Methods
π(a|s) = argmax
a
Q(s, a) vs. π(a|s, θ) = softmax
a
h(s, a, θ)
J(θ) = V∗
(s)
θJ(θ)
Policy Gradient Theorem (Sutton, McAllester, et al. 2000)
14
Policy Gradient Methods
π(a|s) = argmax
a
Q(s, a) vs. π(a|s, θ) = softmax
a
h(s, a, θ)
J(θ) = V∗
(s)
θJ(θ)
Policy Gradient Theorem (Sutton, McAllester, et al. 2000)
θJ(θ) =
s
µπ(s)
a
Qπ(s, a) θπ(a|s, θ)
14
Policy Gradient Methods
π(a|s) = argmax
a
Q(s, a) vs. π(a|s, θ) = softmax
a
h(s, a, θ)
J(θ) = V∗
(s)
θJ(θ)
Policy Gradient Theorem (Sutton, McAllester, et al. 2000)
θJ(θ) =
s
µπ(s)
a
Qπ(s, a) θπ(a|s, θ)
θJ(θ) = E γt
a
Qπ(s, a) θπ(a|s, θ)
14
Actor-Critic (Sutton 1984)
θ ← θ + αγt
δ
TD-Error
θ log π(a|s, θ)
Taken from Pierre-Luc Bacon 15
Option-Critic (Bacon et al. 2017)
Taken from (Bacon et al. 2017)
16
Option-Critic (Bacon et al. 2017)
The gradient wrt. intra-option-policy parameters θ:
θQΩ(s, ω)
take better primitives inside options.
17
Option-Critic (Bacon et al. 2017)
The gradient wrt. intra-option-policy parameters θ:
θQΩ(s, ω) = E
∂ log πω,θ(a|s)
∂θ
QU (s, ω, a)
take better primitives inside options.
17
Option-Critic (Bacon et al. 2017)
The gradient wrt. intra-option-policy parameters θ:
θQΩ(s, ω) = E
∂ log πω,θ(a|s)
∂θ
QU (s, ω, a)
take better primitives inside options.
The gradient wrt. termination-policy parameters ϑ:
ϑU(ω, s )
shorten options with bad advantage.
17
Option-Critic (Bacon et al. 2017)
The gradient wrt. intra-option-policy parameters θ:
θQΩ(s, ω) = E
∂ log πω,θ(a|s)
∂θ
QU (s, ω, a)
take better primitives inside options.
The gradient wrt. termination-policy parameters ϑ:
ϑU(ω, s ) = E −
∂βω,ϑ(s )
∂ϑ
QΩ(s , ω) − VΩ(s , ω)
e.g. maxω Q(s,w)
shorten options with bad advantage.
17
Demonstration
18
19
20
20
Complex Environment i
(Bacon et al. 2017)
21
Complex Environment ii
(Harb et al. 2017)
22
But... i
23
But... ii
(Dilokthanakul et al. 2017)
24
Resources
• Sutton & Barto, Reinforcement Learning: An Introduction,
Second Edition Draft
• David Silver’s Reinforcement Learning Course
25
References i
Bacon, Pierre-Luc, Jean Harb, and Doina Precup (2017). “The
Option-Critic architecture”. In: Proceedings of the Thirty-First
AAAI Conference on Artificial Intelligence, February 4-9, 2017,
San Francisco, California, USA. Pp. 1726–1734.
Bellman, Richard (1952). “On the theory of dynamic
programming”. In: Proceedings of the National Academy of
Sciences 38.8, pp. 716–719. doi: 10.1073/pnas.38.8.716.
Dilokthanakul, N., C. Kaplanis, N. Pawlowski, and M. Shanahan
(2017). “Feature Control as Intrinsic Motivation for Hierarchical
Reinforcement Learning”. In: ArXiv e-prints. arXiv:
1705.06769 [cs.LG].
26
References ii
Harb, Jean, Pierre-Luc Bacon, Martin Klissarov, and Doina Precup
(2017). “When waiting is not an option: Learning options with a
deliberation cost”. In: arXiv: 1709.04571.
Sutton, Richard S (1984). “Temporal credit assignment in
reinforcement learning”. AAI8410337. PhD thesis.
Sutton, Richard S, David A McAllester, Satinder P Singh, and
Yishay Mansour (2000). “Policy gradient methods for
reinforcement learning with function approximation”. In:
Advances in Neural Information Processing Systems,
pp. 1057–1063.
27
References iii
Sutton, Richard S, Doina Precup, and Satinder Singh (1999).
“Between MDPs and Semi-MDPs: A framework for temporal
abstraction in reinforcement learning”. In: Artificial
Intelligence 112.1-2, pp. 181–211. doi:
10.1016/S0004-3702(99)00052-1.
28
Appendix
Option-Critic (Bacon et al. 2017) i
procedure train(α, NΩ)
s ← s0
choose ω ∼ πΩ(ω|s) Option-policy
repeat
choose a ∼ πω,θ(a|s) Intra-option-policy
take the action a in s, observe s and r
1. Options evaluation
g ← r TD-Target
if s is not terminal then
g ← g + γ(1 − βω,ϑ(s ))QΩ(s , ω)
+ γ βω,ϑ(s ) max
ω
QΩ(s , ω)
29
Option-Critic (Bacon et al. 2017) ii
2. Critic improvement
δU ← g − QU (s, ω, a)
QU (s, ω, a) ← QU (s, ω, a) + αU δU
3. Intra-option Q-learning
δΩ ← g − QΩ(s, ω)
QΩ(s, ω) ← QΩ(s, ω) + αΩδΩ
4. Options improvement
θ ← θ + αθ
∂ log πω,θ(a|s)
∂θ QU (s, ω, a)
ϑ ← ϑ + αϑ
∂βω,ϑ(s )
∂ϑ (QΩ(s , ω) − maxω QΩ(s , ω) + ξ)
30
Option-Critic (Bacon et al. 2017) iii
if terminate ∼ βω,ϑ(s ) then Termination-policy
choose ω ∼ πΩ(ω|s )
s ← s .
until s is terminal
31

More Related Content

PDF
Assimil - L'Arabe Sans Peine (Tome 1).pdf
PDF
Edital Soldado pm2013
PDF
Constitution of Islami Jamiat Talaba pakistan
PDF
Policy-Gradient for deep reinforcement learning.pdf
PDF
Cs229 notes12
PDF
Reinfrocement Learning
PDF
Head First Reinforcement Learning
PPTX
Reinforcement Learning
Assimil - L'Arabe Sans Peine (Tome 1).pdf
Edital Soldado pm2013
Constitution of Islami Jamiat Talaba pakistan
Policy-Gradient for deep reinforcement learning.pdf
Cs229 notes12
Reinfrocement Learning
Head First Reinforcement Learning
Reinforcement Learning

Similar to Hierarchical Reinforcement Learning with Option-Critic Architecture (20)

PPTX
Reinforcement Learning: An Introduction.pptx
PPTX
What is Reinforcement Algorithms and how worked.pptx
PDF
Deep Reinforcement learning
PPTX
Introduce to Reinforcement Learning
PPTX
Introduction to reinforcement learning - Phu Nguyen
PDF
Machine learning (13)
PPTX
2Multi_armed_bandits.pptx
PPTX
An Introduction to Reinforcement Learning - The Doors to AGI
PDF
Reinforcement Learning Overview | Marco Del Pra
PPTX
How to formulate reinforcement learning in illustrative ways
PDF
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
PDF
Maximum Entropy Reinforcement Learning (Stochastic Control)
PPT
reinforcement-learning.ppt
PPT
reinforcement-learning.prsentation for c
PPT
reinforcement-learning its based on the slide of university
PPTX
Navigation in 3 d environment with reinforcement learning by Predrag Njegovan...
PDF
Reinforcement Learning for Financial Markets
PPT
about reinforcement-learning ,reinforcement-learning.ppt
PPT
Reinforcement Learner) is an intelligent agent that’s always striving to lear...
PDF
Continuous control
Reinforcement Learning: An Introduction.pptx
What is Reinforcement Algorithms and how worked.pptx
Deep Reinforcement learning
Introduce to Reinforcement Learning
Introduction to reinforcement learning - Phu Nguyen
Machine learning (13)
2Multi_armed_bandits.pptx
An Introduction to Reinforcement Learning - The Doors to AGI
Reinforcement Learning Overview | Marco Del Pra
How to formulate reinforcement learning in illustrative ways
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Maximum Entropy Reinforcement Learning (Stochastic Control)
reinforcement-learning.ppt
reinforcement-learning.prsentation for c
reinforcement-learning its based on the slide of university
Navigation in 3 d environment with reinforcement learning by Predrag Njegovan...
Reinforcement Learning for Financial Markets
about reinforcement-learning ,reinforcement-learning.ppt
Reinforcement Learner) is an intelligent agent that’s always striving to lear...
Continuous control
Ad

Recently uploaded (20)

PPTX
bas. eng. economics group 4 presentation 1.pptx
PDF
Well-logging-methods_new................
PPTX
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
PPTX
Construction Project Organization Group 2.pptx
PPTX
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
PDF
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
DOCX
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
PPTX
Geodesy 1.pptx...............................................
PPTX
UNIT-1 - COAL BASED THERMAL POWER PLANTS
PPT
Mechanical Engineering MATERIALS Selection
PDF
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
PPTX
Internet of Things (IOT) - A guide to understanding
PDF
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
PPT
Project quality management in manufacturing
PDF
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
PDF
PPT on Performance Review to get promotions
PPTX
Foundation to blockchain - A guide to Blockchain Tech
PDF
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
PDF
Embodied AI: Ushering in the Next Era of Intelligent Systems
PPTX
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
bas. eng. economics group 4 presentation 1.pptx
Well-logging-methods_new................
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
Construction Project Organization Group 2.pptx
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
Geodesy 1.pptx...............................................
UNIT-1 - COAL BASED THERMAL POWER PLANTS
Mechanical Engineering MATERIALS Selection
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
Internet of Things (IOT) - A guide to understanding
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
Project quality management in manufacturing
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
PPT on Performance Review to get promotions
Foundation to blockchain - A guide to Blockchain Tech
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
Embodied AI: Ushering in the Next Era of Intelligent Systems
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
Ad

Hierarchical Reinforcement Learning with Option-Critic Architecture

  • 1. Hierarchical Reinforcement Learning with Option-Critic Architecture Oğuz Şerbetci April 4, 2018 Modelling of Cognitive Processes TU Berlin
  • 2. Reinforcement Learning Hierarchical Reinforcement Learning Demonstration Resources Appendix 1
  • 7. Reinforcement Learning MDP: S, A, p(s |s, a), r(s, a) 3
  • 8. Reinforcement Learning MDP: S, A, p(s |s, a), r(s, a) 3
  • 9. Reinforcement Learning MDP: S, A, p(s |s, a), r(s, a) Policy π(a|s) : S × A → [0, 1] 3
  • 10. Reinforcement Learning MDP: S, A, p(s |s, a), r(s, a) Policy π(a|s) : S × A → [0, 1] 3
  • 11. Reinforcement Learning MDP: S, A, p(s |s, a), r(s, a) Policy π(a|s) : S × A → [0, 1] Goal: an optimal policy π∗ that maximizes E ∞ t=0 γt rt|s0 = s 3
  • 13. Problems • lack of planning and commitment 4
  • 14. Problems • lack of planning and commitment • inefficient exploration 4
  • 15. Problems • lack of planning and commitment • inefficient exploration • temporal credit assignment problem 4
  • 16. Problems • lack of planning and commitment • inefficient exploration • temporal credit assignment problem • inability to divide-and-conquer 4
  • 17. 5
  • 19. Temporal Abstractions Icons made by Smashicons and Freepick from Freepick 6
  • 20. Temporal Abstractions Icons made by Smashicons and Freepick from Freepick 6
  • 21. Temporal Abstractions Icons made by Smashicons and Freepick from Freepick 6
  • 22. Temporal Abstractions Icons made by Smashicons and Freepick from Freepick 6
  • 23. 7
  • 24. Options Framework (Sutton, Precup, et al. 1999) SMDP: S, A, p(s , k |s, a), r(s, a) (Sutton, Precup, et al. 1999) 8
  • 25. Options Framework (Sutton, Precup, et al. 1999) SMDP: S, A, p(s , k |s, a), r(s, a) Option: Iω: initiation set πω: intra-option-policy βω: termination-policy (Sutton, Precup, et al. 1999) 8
  • 26. Options Framework (Sutton, Precup, et al. 1999) SMDP: S, A, p(s , k |s, a), r(s, a) Option: Iω: initiation set πω: intra-option-policy βω: termination-policy (Sutton, Precup, et al. 1999) 8
  • 27. Options Framework (Sutton, Precup, et al. 1999) SMDP: S, A, p(s , k |s, a), r(s, a) Option: Iω: initiation set πω: intra-option-policy βω: termination-policy πΩ: option-policy (Sutton, Precup, et al. 1999) 8
  • 28. 9
  • 29. Option-Critic (Bacon et al. 2017) Given the number of options Option-Critic learns βω, πω, πΩ in an end-to-end & online fashion. Allows non-linear function approximators, enabling continuous state and action spaces. • online, end-to-end learning of options in continuous state/action space • allows using non-linear function approximators (deep RL) 10
  • 30. Value Functions The state value function: Vπ(s) = E ∞ t=0 γt rt|s0 = s 11
  • 31. Value Functions The state value function: Vπ(s) = E ∞ t=0 γt rt|s0 = s = a∈As π(s, a) r(s, a) + γ s p(s |s, a)V(s ) Bellman Equations (Bellman 1952) 11
  • 32. Value Functions The state value function: Vπ(s) = E ∞ t=0 γt rt|s0 = s = a∈As π(s, a) r(s, a) + γ s p(s |s, a)V(s ) The action value function: Qπ(s, a) = E ∞ t=0 γt rt|s0 = s, a0 = a Bellman Equations (Bellman 1952) 11
  • 33. Value Functions The state value function: Vπ(s) = E ∞ t=0 γt rt|s0 = s = a∈As π(s, a) r(s, a) + γ s p(s |s, a)V(s ) The action value function: Qπ(s, a) = E ∞ t=0 γt rt|s0 = s, a0 = a = r(s, a) + γ s p(s |s, a) a ∈A π(s , a )Q(s , a ) Bellman Equations (Bellman 1952) 11
  • 34. Value methods TD Learning: Qπ (s, a) ←Qπ (s, a) + α TD target r + γVπ (s ) −Qπ (s, a) TD error 12
  • 35. Value methods TD Learning: Qπ (s, a) ←Qπ (s, a) + α TD target r + γVπ (s ) −Qπ (s, a) TD error Vπ (s ) = max a Q(s , a) Q-Learning 12
  • 36. Value methods TD Learning: Qπ (s, a) ←Qπ (s, a) + α TD target r + γVπ (s ) −Qπ (s, a) TD error Vπ (s ) = max a Q(s , a) Q-Learning π(a|s) = argmax a Q(s, a) greedy policy 12
  • 37. Value methods TD Learning: Qπ (s, a) ←Qπ (s, a) + α TD target r + γVπ (s ) −Qπ (s, a) TD error Vπ (s ) = max a Q(s , a) Q-Learning π(a|s) = argmax a Q(s, a) greedy policy π(a|s) = random with probability argmaxa Q(s, a) with probability 1 − - greedy policy 12
  • 38. Option Value Functions The option value function: QΩ(s, ω) = a πω(a|s)QU (s, w, a) 13
  • 39. Option Value Functions The option value function: QΩ(s, ω) = a πω(a|s)QU (s, w, a) The action value function: QU (s, ω, a) = r(s, a) + γ s p(s |s, a)U(ω, s ) 13
  • 40. Option Value Functions The option value function: QΩ(s, ω) = a πω(a|s)QU (s, w, a) The action value function: QU (s, ω, a) = r(s, a) + γ s p(s |s, a)U(ω, s ) The state value function upon arrival: U(ω, s ) = (1 − βω(s ))QΩ(s , ω) + βω(s )VΩ(s ) 13
  • 41. Policy Gradient Methods π(a|s) = argmax a Q(s, a) 14
  • 42. Policy Gradient Methods π(a|s) = argmax a Q(s, a) vs. π(a|s, θ) = softmax a h(s, a, θ) 14
  • 43. Policy Gradient Methods π(a|s) = argmax a Q(s, a) vs. π(a|s, θ) = softmax a h(s, a, θ) J(θ) 14
  • 44. Policy Gradient Methods π(a|s) = argmax a Q(s, a) vs. π(a|s, θ) = softmax a h(s, a, θ) J(θ) = V∗ (s) 14
  • 45. Policy Gradient Methods π(a|s) = argmax a Q(s, a) vs. π(a|s, θ) = softmax a h(s, a, θ) J(θ) = V∗ (s) θJ(θ) 14
  • 46. Policy Gradient Methods π(a|s) = argmax a Q(s, a) vs. π(a|s, θ) = softmax a h(s, a, θ) J(θ) = V∗ (s) θJ(θ) Policy Gradient Theorem (Sutton, McAllester, et al. 2000) 14
  • 47. Policy Gradient Methods π(a|s) = argmax a Q(s, a) vs. π(a|s, θ) = softmax a h(s, a, θ) J(θ) = V∗ (s) θJ(θ) Policy Gradient Theorem (Sutton, McAllester, et al. 2000) θJ(θ) = s µπ(s) a Qπ(s, a) θπ(a|s, θ) 14
  • 48. Policy Gradient Methods π(a|s) = argmax a Q(s, a) vs. π(a|s, θ) = softmax a h(s, a, θ) J(θ) = V∗ (s) θJ(θ) Policy Gradient Theorem (Sutton, McAllester, et al. 2000) θJ(θ) = s µπ(s) a Qπ(s, a) θπ(a|s, θ) θJ(θ) = E γt a Qπ(s, a) θπ(a|s, θ) 14
  • 49. Actor-Critic (Sutton 1984) θ ← θ + αγt δ TD-Error θ log π(a|s, θ) Taken from Pierre-Luc Bacon 15
  • 50. Option-Critic (Bacon et al. 2017) Taken from (Bacon et al. 2017) 16
  • 51. Option-Critic (Bacon et al. 2017) The gradient wrt. intra-option-policy parameters θ: θQΩ(s, ω) take better primitives inside options. 17
  • 52. Option-Critic (Bacon et al. 2017) The gradient wrt. intra-option-policy parameters θ: θQΩ(s, ω) = E ∂ log πω,θ(a|s) ∂θ QU (s, ω, a) take better primitives inside options. 17
  • 53. Option-Critic (Bacon et al. 2017) The gradient wrt. intra-option-policy parameters θ: θQΩ(s, ω) = E ∂ log πω,θ(a|s) ∂θ QU (s, ω, a) take better primitives inside options. The gradient wrt. termination-policy parameters ϑ: ϑU(ω, s ) shorten options with bad advantage. 17
  • 54. Option-Critic (Bacon et al. 2017) The gradient wrt. intra-option-policy parameters θ: θQΩ(s, ω) = E ∂ log πω,θ(a|s) ∂θ QU (s, ω, a) take better primitives inside options. The gradient wrt. termination-policy parameters ϑ: ϑU(ω, s ) = E − ∂βω,ϑ(s ) ∂ϑ QΩ(s , ω) − VΩ(s , ω) e.g. maxω Q(s,w) shorten options with bad advantage. 17
  • 56. 18
  • 57. 19
  • 58. 20
  • 59. 20
  • 60. Complex Environment i (Bacon et al. 2017) 21
  • 61. Complex Environment ii (Harb et al. 2017) 22
  • 65. • Sutton & Barto, Reinforcement Learning: An Introduction, Second Edition Draft • David Silver’s Reinforcement Learning Course 25
  • 66. References i Bacon, Pierre-Luc, Jean Harb, and Doina Precup (2017). “The Option-Critic architecture”. In: Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, February 4-9, 2017, San Francisco, California, USA. Pp. 1726–1734. Bellman, Richard (1952). “On the theory of dynamic programming”. In: Proceedings of the National Academy of Sciences 38.8, pp. 716–719. doi: 10.1073/pnas.38.8.716. Dilokthanakul, N., C. Kaplanis, N. Pawlowski, and M. Shanahan (2017). “Feature Control as Intrinsic Motivation for Hierarchical Reinforcement Learning”. In: ArXiv e-prints. arXiv: 1705.06769 [cs.LG]. 26
  • 67. References ii Harb, Jean, Pierre-Luc Bacon, Martin Klissarov, and Doina Precup (2017). “When waiting is not an option: Learning options with a deliberation cost”. In: arXiv: 1709.04571. Sutton, Richard S (1984). “Temporal credit assignment in reinforcement learning”. AAI8410337. PhD thesis. Sutton, Richard S, David A McAllester, Satinder P Singh, and Yishay Mansour (2000). “Policy gradient methods for reinforcement learning with function approximation”. In: Advances in Neural Information Processing Systems, pp. 1057–1063. 27
  • 68. References iii Sutton, Richard S, Doina Precup, and Satinder Singh (1999). “Between MDPs and Semi-MDPs: A framework for temporal abstraction in reinforcement learning”. In: Artificial Intelligence 112.1-2, pp. 181–211. doi: 10.1016/S0004-3702(99)00052-1. 28
  • 70. Option-Critic (Bacon et al. 2017) i procedure train(α, NΩ) s ← s0 choose ω ∼ πΩ(ω|s) Option-policy repeat choose a ∼ πω,θ(a|s) Intra-option-policy take the action a in s, observe s and r 1. Options evaluation g ← r TD-Target if s is not terminal then g ← g + γ(1 − βω,ϑ(s ))QΩ(s , ω) + γ βω,ϑ(s ) max ω QΩ(s , ω) 29
  • 71. Option-Critic (Bacon et al. 2017) ii 2. Critic improvement δU ← g − QU (s, ω, a) QU (s, ω, a) ← QU (s, ω, a) + αU δU 3. Intra-option Q-learning δΩ ← g − QΩ(s, ω) QΩ(s, ω) ← QΩ(s, ω) + αΩδΩ 4. Options improvement θ ← θ + αθ ∂ log πω,θ(a|s) ∂θ QU (s, ω, a) ϑ ← ϑ + αϑ ∂βω,ϑ(s ) ∂ϑ (QΩ(s , ω) − maxω QΩ(s , ω) + ξ) 30
  • 72. Option-Critic (Bacon et al. 2017) iii if terminate ∼ βω,ϑ(s ) then Termination-policy choose ω ∼ πΩ(ω|s ) s ← s . until s is terminal 31