Hierarchical RL (DAI).ppt

Hierarchical
Reinforcement Learning
Amir massoud Farahmand
Farahmand@SoloGen.net

Markov Decision
Problems
• Markov Process: Formulating a wide
range of dynamical systems
• Finding an optimal solution of an
objective function
• [Stochastic] Dynamics Programming
• Planning: Known environment
• Learning: Unknown environment

(1)
• Very important Machine Learning
method
• An approximate online solution of
MDP
– Monte Carlo method
– Stochastic Approximation
– [Function Approximation]

(2)
• Q-Learning and SARSA are among the
most important solution of RL

Curses of DP
• Curse of Modeling
– RL solves this problem
• Curse of Dimensionality
– Approximating Value function
– Hierarchical methods

Hierarchical RL (1)
• Use some kind of hierarchy in order to …
– Learn faster
– Need less values to be updated (smaller
storage dimension)
– Incorporate a priori knowledge by designer
– Increase reusability
– Have a more meaningful structure than a mere
Q-table

Hierarchical RL (2)
• Is there any unified meaning of
hierarchy? NO!
• Different methods:
– Temporal abstraction
– State abstraction
– Behavioral decomposition
– …

Hierarchical RL (3)
• Feudal Q-Learning [Dayan, Hinton]
• Options [Sutton, Precup, Singh]
• MaxQ [Dietterich]
• HAM [Russell, Parr, Andre]
• HexQ [Hengst]
• Weakly-Coupled MDP [Bernstein, Dean & Lin, …]
• Structure Learning in SSA [Farahmand, Nili]
• …

Feudal Q-Learning
• Divide each task to a few smaller sub-tasks
• State abstraction method
• Different layers of managers
• Each manager gets orders from its super-manager
and orders to its sub-managers

Super-Manager

Manager 1 Manager 2

Sub-Manager 1 Sub-Manager 2

Feudal Q-Learning
• Principles of Feudal Q-Learning
– Reward Hiding: Managers must reward sub-managers for
doing their bidding whether or not this satisfies the commands
of the super-managers. Sub-managers should just learn to obey
their managers and leave it up to them to determine what it is
best to do at the next level up.
– Information Hiding: Managers only need to know the state
of the system at the granularity of their own choices of tasks.
Indeed, allowing some decision making to take place at a
coarser grain is one of the main goals of the hierarchical
decomposition. Information is hidden both downwards - sub-
managers do not know the task the super-manager has set the
manager - and upwards -a super-manager does not know what
choices its manager has made to satisfy its command.

Options: Introduction
• People do decision making at
different time scales
– Traveling example
• It is desirable to have a method to
support this temporally-extended
actions over different time scales

Options: Concept
• Macro-actions
• Temporal abstraction method of Hierarchical RL
• Options are temporally extended actions which
each of them is consisted of a set of primitive
actions
• Example:
– Primitive actions: walking NSWE
– Options: go to {door, cornet, table, straight}
• Options can be Open-loop or Closed-loop
• Semi-Markov Decision Process Theory [Puterman]

Options: Rise of SMDP!
• Theorem: MDP + Options = SMDP

Options:
Bellman-like optimality condition

Interrupting Options
• Option’s policy is followed until it
terminates.
• It is somehow unnecessary condition
– You may change your decision in the middle of
execution of your previous decision.
• Interruption Theorem: Yes! It is better!

Interrupting Options:
An example

Options: Other issues
• Intra-option {model, value} learning
• Learning each options
– Defining sub-goal reward function

MaxQ
• MaxQ Value Function Decomposition
• Somehow related to Feudal Q-
Learning
• Decomposing Value function in a
hierarchical structure

MaxQ: Existence theorem

• Recursive optimal policy.
• There may be many recursive optimal policies with different
value function.
• Recursive optimal policies are not an optimal policy.
• If H is stationary macro hierarchy for MDP M, then all
recursively optimal policies w.r.t. have the same value.

MaxQ: Learning

• Theorem: If M is MDP, H is stationary macro, GLIE (Greedy
in the Limit with Infinite Exploration) policy, common
convergence conditions (bounded V and C, sum of alpha is …),
then with Prob. 1, algorithm MaxQ-0 will converge!

MaxQ
• Faster learning: all states updating
– Similar to “all-goal-updating” of
Kaelbling

MaxQ: State abstraction
• Advantageous
– Memory reduction
– Needed exploration will be reduced
– Increase reusability as it is not
dependent on its higher parents
• Is it possible?!

• Exact preservation of value function
• Approximate preservation

• Does it converge?
– It has not proved formally yet.
• What can we do if we want to use an
abstraction that violates theorem 3?
– Reward function decomposition
• Design a reward function that reinforces
those responsible parts of the architecture.

MaxQ: Other issues
• Undesired Terminal states
• Non-hierarchical execution (polling
execution)
– Better performance
– Computational intensive

Learning in Subsumption
Architecture
• Structure learning
– How should behaviors arranged in the
architecture?
• Behavior learning
– How should a single behavior act?
• Structure/Behavior learning

SSA: Purely Parallel Case
manipulate
the world

build maps
sensors

explore

avoid obstacles

locomote

SSA: Structure learning
issues
• How should we represent structure?
– Sufficient (problem space can be
covered)
– Tractable (small hypothesis space)
– Well-defined credit assignment
• How should we assign credits to
architecture?

SSA: Structure learning
issues
• Purely parallel structure
– Is it the most plausible choice
(regarding SSA-BBS assumptions)?
• Some different representations
– Beh. learning
– Beh/Layer learning
– Order learning

SSA: Behavior learning
issues
• Reinforcement signal decomposition:
each Beh. has its own reward function
• Reinforcement signal design: How should
we transform our desires into reward function?
– Reward Shaping
– Emotional Learning
– …?

• Hierarchical Credit Assignment

SSA: Structure Learning
example
• Suppose we have correct behaviors and want to
arrange them in an architecture in order to
maximize a specific behavior
• Subjective evaluation: We want to lift an object to a
specific height while its slope does not become too high.
• Objective evaluation: How should we design it?!

SSA: Structure Learning
example

Hierarchical RL (DAI).ppt

More Related Content

Similar to Hierarchical RL (DAI).ppt (20)

More from butest (20)

Hierarchical RL (DAI).ppt