SlideShare a Scribd company logo
Hierarchical
Reinforcement Learning
   Amir massoud Farahmand
   Farahmand@SoloGen.net
Markov Decision
         Problems
• Markov Process: Formulating a wide
  range of dynamical systems
• Finding an optimal solution of an
  objective function
• [Stochastic] Dynamics Programming
• Planning: Known environment
• Learning: Unknown environment
MDP
Reinforcement Learning
           (1)
• Very important Machine Learning
  method
• An approximate online solution of
  MDP
  – Monte Carlo method
  – Stochastic Approximation
  – [Function Approximation]
Reinforcement Learning
          (2)
• Q-Learning and SARSA are among the
  most important solution of RL
Curses of DP
• Curse of Modeling
  – RL solves this problem
• Curse of Dimensionality
  – Approximating Value function
  – Hierarchical methods
Hierarchical RL (1)
• Use some kind of hierarchy in order to …
  – Learn faster
  – Need less values to be updated (smaller
    storage dimension)
  – Incorporate a priori knowledge by designer
  – Increase reusability
  – Have a more meaningful structure than a mere
    Q-table
Hierarchical RL (2)
• Is there any unified meaning of
  hierarchy? NO!
• Different methods:
  –   Temporal abstraction
  –   State abstraction
  –   Behavioral decomposition
  –   …
Hierarchical RL (3)
•   Feudal Q-Learning [Dayan, Hinton]
•   Options [Sutton, Precup, Singh]
•   MaxQ [Dietterich]
•   HAM [Russell, Parr, Andre]
•   HexQ [Hengst]
•   Weakly-Coupled MDP [Bernstein, Dean & Lin, …]
•   Structure Learning in SSA [Farahmand, Nili]
•   …
Feudal Q-Learning
•   Divide each task to a few smaller sub-tasks
•   State abstraction method
•   Different layers of managers
•   Each manager gets orders from its super-manager
    and orders to its sub-managers

                                  Super-Manager


                                  Manager 1       Manager 2


                       Sub-Manager 1      Sub-Manager 2
Feudal Q-Learning
• Principles of Feudal Q-Learning
   – Reward Hiding: Managers must reward sub-managers for
     doing their bidding whether or not this satisfies the commands
     of the super-managers. Sub-managers should just learn to obey
     their managers and leave it up to them to determine what it is
     best to do at the next level up.
   – Information Hiding: Managers only need to know the state
     of the system at the granularity of their own choices of tasks.
     Indeed, allowing some decision making to take place at a
     coarser grain is one of the main goals of the hierarchical
     decomposition. Information is hidden both downwards - sub-
     managers do not know the task the super-manager has set the
     manager - and upwards -a super-manager does not know what
     choices its manager has made to satisfy its command.
Feudal Q-Learning
Feudal Q-Learning
Options: Introduction
• People do decision making at
  different time scales
  – Traveling example
• It is desirable to have a method to
  support this temporally-extended
  actions over different time scales
Options: Concept
• Macro-actions
• Temporal abstraction method of Hierarchical RL
• Options are temporally extended actions which
  each of them is consisted of a set of primitive
  actions
• Example:
   – Primitive actions: walking NSWE
   – Options: go to {door, cornet, table, straight}
      • Options can be Open-loop or Closed-loop
• Semi-Markov Decision Process Theory [Puterman]
Options: Formal Definitions
Options: Rise of SMDP!
• Theorem: MDP + Options = SMDP
Options: Value function
Options:
Bellman-like optimality condition
Options: A simple example
Options: A simple example
Options: A simple example
Interrupting Options
• Option’s policy is followed until it
  terminates.
• It is somehow unnecessary condition
  – You may change your decision in the middle of
    execution of your previous decision.
• Interruption Theorem:        Yes! It is better!
Interrupting Options:
     An example
Options: Other issues
• Intra-option {model, value} learning
• Learning each options
  – Defining sub-goal reward function
MaxQ
• MaxQ Value Function Decomposition
• Somehow related to Feudal Q-
  Learning
• Decomposing Value function in a
  hierarchical structure
MaxQ
MaxQ: Value decomposition
MaxQ: Existence theorem



• Recursive optimal policy.
• There may be many recursive optimal policies with different
  value function.
• Recursive optimal policies are not an optimal policy.
• If H is stationary macro hierarchy for MDP M, then all
  recursively optimal policies w.r.t. have the same value.
MaxQ: Learning



• Theorem: If M is MDP, H is stationary macro, GLIE (Greedy
  in the Limit with Infinite Exploration) policy, common
  convergence conditions (bounded V and C, sum of alpha is …),
  then with Prob. 1, algorithm MaxQ-0 will converge!
MaxQ
• Faster learning: all states updating
  – Similar to “all-goal-updating” of
    Kaelbling
MaxQ
MaxQ: State abstraction
• Advantageous
  – Memory reduction
  – Needed exploration will be reduced
  – Increase reusability as it is not
    dependent on its higher parents
• Is it possible?!
MaxQ: State abstraction
• Exact preservation of value function
• Approximate preservation
MaxQ: State abstraction
• Does it converge?
  – It has not proved formally yet.
• What can we do if we want to use an
  abstraction that violates theorem 3?
  – Reward function decomposition
    • Design a reward function that reinforces
      those responsible parts of the architecture.
MaxQ: Other issues
• Undesired Terminal states
• Non-hierarchical execution (polling
  execution)
  – Better performance
  – Computational intensive
Learning in Subsumption
      Architecture
• Structure learning
  – How should behaviors arranged in the
    architecture?
• Behavior learning
  – How should a single behavior act?
• Structure/Behavior learning
SSA: Purely Parallel Case
              manipulate
              the world


              build maps
  sensors

               explore


            avoid obstacles


               locomote
SSA: Structure learning
        issues
• How should we represent structure?
  – Sufficient (problem space can be
    covered)
  – Tractable (small hypothesis space)
  – Well-defined credit assignment
• How should we assign credits to
  architecture?
SSA: Structure learning
        issues
• Purely parallel structure
  – Is it the most plausible choice
    (regarding SSA-BBS assumptions)?
• Some different representations
  – Beh. learning
  – Beh/Layer learning
  – Order learning
SSA: Behavior learning
        issues
• Reinforcement signal decomposition:
 each Beh. has its own reward function
• Reinforcement signal design:         How should
 we transform our desires into reward function?
  – Reward Shaping
  – Emotional Learning
  – …?

• Hierarchical Credit Assignment
SSA: Structure Learning
       example
•   Suppose we have correct behaviors and want to
    arrange them in an architecture in order to
    maximize a specific behavior
•   Subjective evaluation: We want to lift an object to a
    specific height while its slope does not become too high.
•   Objective evaluation: How should we design it?!
SSA: Structure Learning
       example
SSA: Structure Learning
       example

More Related Content

PDF
Markovian sequential decision-making in non-stationary environments: applicat...
PDF
Эриберто Кваджавитль "Адаптивное обучение с подкреплением для интерактивных ...
PDF
EL MODELO DE NEGOCIO DE YOUTUBE
PDF
Introduction of Deep Reinforcement Learning
PPTX
Reinforcement Learning
PDF
Shanghai deep learning meetup 4
PPTX
How Machine Learning Helps Organizations to Work More Efficiently?
PPT
feature-selection.ppt on machine learning
Markovian sequential decision-making in non-stationary environments: applicat...
Эриберто Кваджавитль "Адаптивное обучение с подкреплением для интерактивных ...
EL MODELO DE NEGOCIO DE YOUTUBE
Introduction of Deep Reinforcement Learning
Reinforcement Learning
Shanghai deep learning meetup 4
How Machine Learning Helps Organizations to Work More Efficiently?
feature-selection.ppt on machine learning

Similar to Hierarchical RL (DAI).ppt (20)

PDF
Feature Selection.pdf
PPTX
An introduction to reinforcement learning
PDF
anintroductiontoreinforcementlearning-180912151720.pdf
PPTX
Learning Task in machine learning
PPTX
Machine Learning
PPT
about reinforcement-learning ,reinforcement-learning.ppt
PPT
Reinforcement Learner) is an intelligent agent that’s always striving to lear...
PPT
reinforcement-learning.prsentation for c
PPT
reinforcement-learning its based on the slide of university
PPT
reinforcement-learning.ppt
PDF
Horizon: Deep Reinforcement Learning at Scale
PPT
Lecture 1--Aug 29.ppt11111111111111111111111111111
PDF
comp422-534-2020-Lecture3-ConcurrencyMapping.pdf
PPTX
Reinforcement course material samples: lecture 1
PPTX
R22 Machine learning jntuh UNIT- 5.pptx
PDF
Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Va...
PPTX
RL_in_10_min.pptx
PPT
02-solving-problems-by-searching-(us).ppt
PPTX
PPT
Machine Learning Deep Learning Machine learning
Feature Selection.pdf
An introduction to reinforcement learning
anintroductiontoreinforcementlearning-180912151720.pdf
Learning Task in machine learning
Machine Learning
about reinforcement-learning ,reinforcement-learning.ppt
Reinforcement Learner) is an intelligent agent that’s always striving to lear...
reinforcement-learning.prsentation for c
reinforcement-learning its based on the slide of university
reinforcement-learning.ppt
Horizon: Deep Reinforcement Learning at Scale
Lecture 1--Aug 29.ppt11111111111111111111111111111
comp422-534-2020-Lecture3-ConcurrencyMapping.pdf
Reinforcement course material samples: lecture 1
R22 Machine learning jntuh UNIT- 5.pptx
Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Va...
RL_in_10_min.pptx
02-solving-problems-by-searching-(us).ppt
Machine Learning Deep Learning Machine learning
Ad

More from butest (20)

DOC
1. MPEG I.B.P frame之不同
PDF
LESSONS FROM THE MICHAEL JACKSON TRIAL
PPT
Timeline: The Life of Michael Jackson
DOCX
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
PDF
LESSONS FROM THE MICHAEL JACKSON TRIAL
PPTX
Com 380, Summer II
PPT
PPT
DOCX
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
DOC
MICHAEL JACKSON.doc
PPTX
Social Networks: Twitter Facebook SL - Slide 1
PPT
Facebook
DOCX
Executive Summary Hare Chevrolet is a General Motors dealership ...
DOC
Welcome to the Dougherty County Public Library's Facebook and ...
DOC
NEWS ANNOUNCEMENT
DOC
C-2100 Ultra Zoom.doc
DOC
MAC Printing on ITS Printers.doc.doc
DOC
Mac OS X Guide.doc
DOC
hier
DOC
WEB DESIGN!
DOC
Download
1. MPEG I.B.P frame之不同
LESSONS FROM THE MICHAEL JACKSON TRIAL
Timeline: The Life of Michael Jackson
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
LESSONS FROM THE MICHAEL JACKSON TRIAL
Com 380, Summer II
PPT
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
MICHAEL JACKSON.doc
Social Networks: Twitter Facebook SL - Slide 1
Facebook
Executive Summary Hare Chevrolet is a General Motors dealership ...
Welcome to the Dougherty County Public Library's Facebook and ...
NEWS ANNOUNCEMENT
C-2100 Ultra Zoom.doc
MAC Printing on ITS Printers.doc.doc
Mac OS X Guide.doc
hier
WEB DESIGN!
Download
Ad

Hierarchical RL (DAI).ppt

  • 1. Hierarchical Reinforcement Learning Amir massoud Farahmand Farahmand@SoloGen.net
  • 2. Markov Decision Problems • Markov Process: Formulating a wide range of dynamical systems • Finding an optimal solution of an objective function • [Stochastic] Dynamics Programming • Planning: Known environment • Learning: Unknown environment
  • 3. MDP
  • 4. Reinforcement Learning (1) • Very important Machine Learning method • An approximate online solution of MDP – Monte Carlo method – Stochastic Approximation – [Function Approximation]
  • 5. Reinforcement Learning (2) • Q-Learning and SARSA are among the most important solution of RL
  • 6. Curses of DP • Curse of Modeling – RL solves this problem • Curse of Dimensionality – Approximating Value function – Hierarchical methods
  • 7. Hierarchical RL (1) • Use some kind of hierarchy in order to … – Learn faster – Need less values to be updated (smaller storage dimension) – Incorporate a priori knowledge by designer – Increase reusability – Have a more meaningful structure than a mere Q-table
  • 8. Hierarchical RL (2) • Is there any unified meaning of hierarchy? NO! • Different methods: – Temporal abstraction – State abstraction – Behavioral decomposition – …
  • 9. Hierarchical RL (3) • Feudal Q-Learning [Dayan, Hinton] • Options [Sutton, Precup, Singh] • MaxQ [Dietterich] • HAM [Russell, Parr, Andre] • HexQ [Hengst] • Weakly-Coupled MDP [Bernstein, Dean & Lin, …] • Structure Learning in SSA [Farahmand, Nili] • …
  • 10. Feudal Q-Learning • Divide each task to a few smaller sub-tasks • State abstraction method • Different layers of managers • Each manager gets orders from its super-manager and orders to its sub-managers Super-Manager Manager 1 Manager 2 Sub-Manager 1 Sub-Manager 2
  • 11. Feudal Q-Learning • Principles of Feudal Q-Learning – Reward Hiding: Managers must reward sub-managers for doing their bidding whether or not this satisfies the commands of the super-managers. Sub-managers should just learn to obey their managers and leave it up to them to determine what it is best to do at the next level up. – Information Hiding: Managers only need to know the state of the system at the granularity of their own choices of tasks. Indeed, allowing some decision making to take place at a coarser grain is one of the main goals of the hierarchical decomposition. Information is hidden both downwards - sub- managers do not know the task the super-manager has set the manager - and upwards -a super-manager does not know what choices its manager has made to satisfy its command.
  • 14. Options: Introduction • People do decision making at different time scales – Traveling example • It is desirable to have a method to support this temporally-extended actions over different time scales
  • 15. Options: Concept • Macro-actions • Temporal abstraction method of Hierarchical RL • Options are temporally extended actions which each of them is consisted of a set of primitive actions • Example: – Primitive actions: walking NSWE – Options: go to {door, cornet, table, straight} • Options can be Open-loop or Closed-loop • Semi-Markov Decision Process Theory [Puterman]
  • 17. Options: Rise of SMDP! • Theorem: MDP + Options = SMDP
  • 20. Options: A simple example
  • 21. Options: A simple example
  • 22. Options: A simple example
  • 23. Interrupting Options • Option’s policy is followed until it terminates. • It is somehow unnecessary condition – You may change your decision in the middle of execution of your previous decision. • Interruption Theorem: Yes! It is better!
  • 24. Interrupting Options: An example
  • 25. Options: Other issues • Intra-option {model, value} learning • Learning each options – Defining sub-goal reward function
  • 26. MaxQ • MaxQ Value Function Decomposition • Somehow related to Feudal Q- Learning • Decomposing Value function in a hierarchical structure
  • 27. MaxQ
  • 29. MaxQ: Existence theorem • Recursive optimal policy. • There may be many recursive optimal policies with different value function. • Recursive optimal policies are not an optimal policy. • If H is stationary macro hierarchy for MDP M, then all recursively optimal policies w.r.t. have the same value.
  • 30. MaxQ: Learning • Theorem: If M is MDP, H is stationary macro, GLIE (Greedy in the Limit with Infinite Exploration) policy, common convergence conditions (bounded V and C, sum of alpha is …), then with Prob. 1, algorithm MaxQ-0 will converge!
  • 31. MaxQ • Faster learning: all states updating – Similar to “all-goal-updating” of Kaelbling
  • 32. MaxQ
  • 33. MaxQ: State abstraction • Advantageous – Memory reduction – Needed exploration will be reduced – Increase reusability as it is not dependent on its higher parents • Is it possible?!
  • 34. MaxQ: State abstraction • Exact preservation of value function • Approximate preservation
  • 35. MaxQ: State abstraction • Does it converge? – It has not proved formally yet. • What can we do if we want to use an abstraction that violates theorem 3? – Reward function decomposition • Design a reward function that reinforces those responsible parts of the architecture.
  • 36. MaxQ: Other issues • Undesired Terminal states • Non-hierarchical execution (polling execution) – Better performance – Computational intensive
  • 37. Learning in Subsumption Architecture • Structure learning – How should behaviors arranged in the architecture? • Behavior learning – How should a single behavior act? • Structure/Behavior learning
  • 38. SSA: Purely Parallel Case manipulate the world build maps sensors explore avoid obstacles locomote
  • 39. SSA: Structure learning issues • How should we represent structure? – Sufficient (problem space can be covered) – Tractable (small hypothesis space) – Well-defined credit assignment • How should we assign credits to architecture?
  • 40. SSA: Structure learning issues • Purely parallel structure – Is it the most plausible choice (regarding SSA-BBS assumptions)? • Some different representations – Beh. learning – Beh/Layer learning – Order learning
  • 41. SSA: Behavior learning issues • Reinforcement signal decomposition: each Beh. has its own reward function • Reinforcement signal design: How should we transform our desires into reward function? – Reward Shaping – Emotional Learning – …? • Hierarchical Credit Assignment
  • 42. SSA: Structure Learning example • Suppose we have correct behaviors and want to arrange them in an architecture in order to maximize a specific behavior • Subjective evaluation: We want to lift an object to a specific height while its slope does not become too high. • Objective evaluation: How should we design it?!