Inverse Reinforcement Learning CS 285: Deep Reinforcement Learning, Decision Making, and Control Sergey Levine

Inverse Reinforcement
Learning
CS 285: Deep Reinforcement Learning, Decision Making, and Control
Sergey Levine

Today’s Lecture
1. So far: manually design reward function to define a task
2. What if we want to learn the reward function from observing an
expert, and then use reinforcement learning?
3. Apply approximate optimality model from last week, but now
learn the reward!
• Goals:
• Understand the inverse reinforcement learning problem definition
• Understand how probabilistic models of behavior can be used to derive
inverse reinforcement learning algorithms
• Understand a few practical inverse reinforcement learning algorithms we
can use

Optimal Control as a Model of Human Behavior
Mombaur et al. ‘09
Muybridge (c. 1870) Ziebart ‘08
Li & Todorov ‘06
optimize this to explain the data

Why should we worry about learning rewards?
The imitation learning perspective
Standard imitation learning:
• copy the actions performed by the expert
• no reasoning about outcomes of actions
Human imitation learning:
• copy the intent of the expert
• might take very different actions!

Why should we worry about learning rewards?
The reinforcement learning perspective
what is the reward?

Inverse reinforcement learning
Infer reward functions from demonstrations
by itself, this is an underspecified problem
many reward functions can explain the same behavior

A bit more formally
“forward” reinforcement learning inverse reinforcement learning
reward parameters

Feature matching IRL
still ambiguous!

Feature matching IRL & maximum margin
Issues:
• Maximizing the margin is a bit arbitrary
• No clear model of expert suboptimality (can add slack variables…)
• Messy constrained optimization problem – not great for deep learning!
Further reading:
• Abbeel & Ng: Apprenticeship learning via inverse reinforcement learning
• Ratliff et al: Maximum margin planning

Optimal Control as a Model of Human Behavior
Mombaur et al. ‘09
Muybridge (c. 1870) Ziebart ‘08
Li & Todorov ‘06

A probabilistic graphical model of decision making
no assumption of optimal behavior!

Learning the optimality variable
reward parameters

The MaxEnt IRL algorithm
Why MaxEnt?
Ziebart et al. 2008: Maximum Entropy Inverse Reinforcement Learning

Inverse Reinforcement Learning CS 285: Deep Reinforcement Learning, Decision Making, and Control Sergey Levine

• MaxEnt IRL so far requires…
• Solving for (soft) optimal policy in the inner loop
• Enumerating all state-action tuples for visitation frequency and gradient
• To apply this in practical problem settings, we need to handle…
• Large and continuous state and action spaces
• States obtained via sampling only
• Unknown dynamics
What’s missing so far?

Unknown dynamics & large state/action spaces
Assume we don’t know the dynamics, but we can sample, like in standard RL

More efficient sample-based updates

Update reward using
samples & demos
generate policy
samples from π
update π w.r.t. reward
policy π reward r
guided cost learning algorithm
policy π
(Finn et al. ICML ’16)
slides adapted from C. Finn

It looks a bit like a game…
policy π

Generative Adversarial Networks
Goodfellow et al. ‘14
Isola et al. ‘17
Arjovsky et al. ‘17
Zhu et al. ‘17

Inverse RL as a GAN
Finn*, Christiano* et al. “A Connection Between Generative Adversarial Networks, Inverse Reinforcement Learning, and Energy-Based Models.”

Generalization via inverse RL
demonstration reproduce behavior under different conditions
what can we
learn from the
demonstration
to enable
better transfer?
need to
decouple the
goal from the
dynamics!
policy =
reward +
dynamics
Fu et al. Learning Robust Rewards with Adversarial Inverse Reinforcement Learning

Can we just use a regular discriminator?
Ho & Ermon. Generative adversarial imitation learning.
Pros & cons:
+ often simpler to set up optimization, fewer moving parts
- discriminator knows nothing at convergence
- generally cannot reoptimize the “reward”

IRL as adversarial optimization
Generative Adversarial Imitation Learning
Guided Cost Learning
robot attempt
classifier
Ho & Ermon, NIPS 2016
Hausman, Chebotar, Schaal, Sukhatme, Lim
Peng, Kanazawa, Toyer, Abbeel, Levine
Finn et al., ICML 2016
robot attempt
reward function
actually the
same thing!

Suggested Reading on Inverse RL
Classic Papers:
Abbeel & Ng ICML ’04. Apprenticeship Learning via Inverse Reinforcement
Learning. Good introduction to inverse reinforcement learning
Ziebart et al. AAAI ’08. Maximum Entropy Inverse Reinforcement Learning.
Introduction to probabilistic method for inverse reinforcement learning
Modern Papers:
Finn et al. ICML ’16. Guided Cost Learning. Sampling based method for
MaxEnt IRL that handles unknown dynamics and deep reward functions
Wulfmeier et al. arXiv ’16. Deep Maximum Entropy Inverse Reinforcement
Learning. MaxEnt inverse RL using deep reward functions
Ho & Ermon NIPS ’16. Generative Adversarial Imitation Learning. Inverse RL
method using generative adversarial networks
Fu, Luo, Levine ICLR ‘18. Learning Robust Rewards with Adversarial Inverse
Reinforcement Learning

Inverse Reinforcement Learning CS 285: Deep Reinforcement Learning, Decision Making, and Control Sergey Levine

More Related Content

Similar to Inverse Reinforcement Learning CS 285: Deep Reinforcement Learning, Decision Making, and Control Sergey Levine (20)

More from cniclsh1 (20)

Recently uploaded (20)

Inverse Reinforcement Learning CS 285: Deep Reinforcement Learning, Decision Making, and Control Sergey Levine