SlideShare a Scribd company logo
Inverse Reinforcement
Learning
CS 285: Deep Reinforcement Learning, Decision Making, and Control
Sergey Levine
Today’s Lecture
1. So far: manually design reward function to define a task
2. What if we want to learn the reward function from observing an
expert, and then use reinforcement learning?
3. Apply approximate optimality model from last week, but now
learn the reward!
• Goals:
• Understand the inverse reinforcement learning problem definition
• Understand how probabilistic models of behavior can be used to derive
inverse reinforcement learning algorithms
• Understand a few practical inverse reinforcement learning algorithms we
can use
Optimal Control as a Model of Human Behavior
Mombaur et al. ‘09
Muybridge (c. 1870) Ziebart ‘08
Li & Todorov ‘06
optimize this to explain the data
Why should we worry about learning rewards?
The imitation learning perspective
Standard imitation learning:
• copy the actions performed by the expert
• no reasoning about outcomes of actions
Human imitation learning:
• copy the intent of the expert
• might take very different actions!
Why should we worry about learning rewards?
The reinforcement learning perspective
what is the reward?
Inverse reinforcement learning
Infer reward functions from demonstrations
by itself, this is an underspecified problem
many reward functions can explain the same behavior
A bit more formally
“forward” reinforcement learning inverse reinforcement learning
reward parameters
Feature matching IRL
still ambiguous!
Feature matching IRL & maximum margin
Issues:
• Maximizing the margin is a bit arbitrary
• No clear model of expert suboptimality (can add slack variables…)
• Messy constrained optimization problem – not great for deep learning!
Further reading:
• Abbeel & Ng: Apprenticeship learning via inverse reinforcement learning
• Ratliff et al: Maximum margin planning
Optimal Control as a Model of Human Behavior
Mombaur et al. ‘09
Muybridge (c. 1870) Ziebart ‘08
Li & Todorov ‘06
A probabilistic graphical model of decision making
no assumption of optimal behavior!
Learning the optimality variable
reward parameters
The IRL partition function
Estimating the expectation
Estimating the expectation
The MaxEnt IRL algorithm
Why MaxEnt?
Ziebart et al. 2008: Maximum Entropy Inverse Reinforcement Learning
Inverse Reinforcement Learning CS 285: Deep Reinforcement Learning, Decision Making, and Control Sergey Levine
Break
What’s missing so far?
• MaxEnt IRL so far requires…
• Solving for (soft) optimal policy in the inner loop
• Enumerating all state-action tuples for visitation frequency and gradient
• To apply this in practical problem settings, we need to handle…
• Large and continuous state and action spaces
• States obtained via sampling only
• Unknown dynamics
What’s missing so far?
Unknown dynamics & large state/action spaces
Assume we don’t know the dynamics, but we can sample, like in standard RL
More efficient sample-based updates
Importance sampling
Update reward using
samples & demos
generate policy
samples from π
update π w.r.t. reward
policy π reward r
guided cost learning algorithm
policy π
(Finn et al. ICML ’16)
slides adapted from C. Finn
It looks a bit like a game…
policy π
Generative Adversarial Networks
Goodfellow et al. ‘14
Isola et al. ‘17
Arjovsky et al. ‘17
Zhu et al. ‘17
Inverse RL as a GAN
Finn*, Christiano* et al. “A Connection Between Generative Adversarial Networks, Inverse Reinforcement Learning, and Energy-Based Models.”
Inverse RL as a GAN
Finn*, Christiano* et al. “A Connection Between Generative Adversarial Networks, Inverse Reinforcement Learning, and Energy-Based Models.”
Generalization via inverse RL
demonstration reproduce behavior under different conditions
what can we
learn from the
demonstration
to enable
better transfer?
need to
decouple the
goal from the
dynamics!
policy =
reward +
dynamics
Fu et al. Learning Robust Rewards with Adversarial Inverse Reinforcement Learning
Can we just use a regular discriminator?
Ho & Ermon. Generative adversarial imitation learning.
Pros & cons:
+ often simpler to set up optimization, fewer moving parts
- discriminator knows nothing at convergence
- generally cannot reoptimize the “reward”
IRL as adversarial optimization
Generative Adversarial Imitation Learning
Guided Cost Learning
robot attempt
classifier
Ho & Ermon, NIPS 2016
Hausman, Chebotar, Schaal, Sukhatme, Lim
Peng, Kanazawa, Toyer, Abbeel, Levine
Finn et al., ICML 2016
robot attempt
reward function
actually the
same thing!
Suggested Reading on Inverse RL
Classic Papers:
Abbeel & Ng ICML ’04. Apprenticeship Learning via Inverse Reinforcement
Learning. Good introduction to inverse reinforcement learning
Ziebart et al. AAAI ’08. Maximum Entropy Inverse Reinforcement Learning.
Introduction to probabilistic method for inverse reinforcement learning
Modern Papers:
Finn et al. ICML ’16. Guided Cost Learning. Sampling based method for
MaxEnt IRL that handles unknown dynamics and deep reward functions
Wulfmeier et al. arXiv ’16. Deep Maximum Entropy Inverse Reinforcement
Learning. MaxEnt inverse RL using deep reward functions
Ho & Ermon NIPS ’16. Generative Adversarial Imitation Learning. Inverse RL
method using generative adversarial networks
Fu, Luo, Levine ICLR ‘18. Learning Robust Rewards with Adversarial Inverse
Reinforcement Learning

More Related Content

PPTX
Recent Trends in Neural Net Policy Learning
PPTX
Deep Learning in Robotics
PDF
Model-Based Reinforcement Learning CS 285: Deep Reinforcement Learning, Decis...
PPT
5 learning edited 2012.ppt
PPTX
STAT7440StudentIMLPresentationJishan.pptx
PDF
Shanghai deep learning meetup 4
PDF
Policy gradient
PPTX
AI_Unit-4_Learning.pptx
Recent Trends in Neural Net Policy Learning
Deep Learning in Robotics
Model-Based Reinforcement Learning CS 285: Deep Reinforcement Learning, Decis...
5 learning edited 2012.ppt
STAT7440StudentIMLPresentationJishan.pptx
Shanghai deep learning meetup 4
Policy gradient
AI_Unit-4_Learning.pptx

Similar to Inverse Reinforcement Learning CS 285: Deep Reinforcement Learning, Decision Making, and Control Sergey Levine (20)

PPTX
Types of machine learning
PDF
Lec 6 learning
PDF
Complete picture of Ensemble-Learning, boosting, bagging
PDF
Machine Learning , deep learning module imp
PPTX
Literature Review - Presentation on Relevant work for RL4AD capstone
PPTX
Reinforcement course material samples: lecture 1
PPTX
PPTX
chapter Three artificial intelligence 1.pptx
PPT
Statistical learning intro
PPTX
An introduction to reinforcement learning
PDF
anintroductiontoreinforcementlearning-180912151720.pdf
PPTX
When Models Meet Data: From ancient science to todays Artificial Intelligence...
PPTX
Machine learning Basics Introduction ppt
PPTX
REINFORCEMENT LEARNING (reinforced through trial and error).pptx
PDF
Chapter 5 - Machine of it Learning (1).pdf
PDF
AI Lesson 33
PDF
Lesson 33
PDF
A Review on Introduction to Reinforcement Learning
PPTX
lebhhhggjitr677ugghjjnbbbbvcchjhc16.pptx
PDF
Chapter 5 - Machine which of Learning.pdf
Types of machine learning
Lec 6 learning
Complete picture of Ensemble-Learning, boosting, bagging
Machine Learning , deep learning module imp
Literature Review - Presentation on Relevant work for RL4AD capstone
Reinforcement course material samples: lecture 1
chapter Three artificial intelligence 1.pptx
Statistical learning intro
An introduction to reinforcement learning
anintroductiontoreinforcementlearning-180912151720.pdf
When Models Meet Data: From ancient science to todays Artificial Intelligence...
Machine learning Basics Introduction ppt
REINFORCEMENT LEARNING (reinforced through trial and error).pptx
Chapter 5 - Machine of it Learning (1).pdf
AI Lesson 33
Lesson 33
A Review on Introduction to Reinforcement Learning
lebhhhggjitr677ugghjjnbbbbvcchjhc16.pptx
Chapter 5 - Machine which of Learning.pdf
Ad

More from cniclsh1 (20)

PDF
Knowledge Representation Part VI by Jan Pettersen Nytun
PDF
Knowledge Representation Part III by Jan Pettersen Nytun
PDF
interacting-with-ai-2023---module-2---session-4---handout.pdf
PDF
interacting-with-ai-2023---module-2---session-3---handout.pdf
PDF
interacting-with-ai-2023---module-2---session-1---handout.pdf
PDF
Chatbot are sentient, turing test, generative AI
PDF
Probabilistic AI Lecture 1: Introduction to variational inference and the ELBO
PDF
Bayesian Statistics in High Dimensions Lecture 1: Curve and surface estimation
PDF
Foundations of Artificial Intelligence 1. Introduction Organizational Aspects...
PDF
W1L2_11-667 - Building Blocks of Modern LLMs 2: Pretraining Tasks
PDF
W4L2_11-667: LARGE LANGUAGE MODELS: METHODS AND APPLICATIONS - Parameter Effi...
PDF
W4L1_11-667: LARGE LANGUAGE MODELS: METHODS AND APPLICATIONS - Human Evaluati...
PDF
W4L2_11-667: LARGE LANGUAGE MODELS: METHODS AND APPLICATIONS PETM Parameter E...
PDF
W6L1_LARGE LANGUAGE MODELS: METHODS AND APPLICATIONS - Chatbots and AI Agents
PDF
LLM for Search Engines: Part 2,Pretrain retrieval representations
PDF
W9L2 Scaling Up LLM Pretraining: Scaling Law
PDF
W10L2 Scaling Up LLM Pretraining: Parallel Training Scaling Up Optimizer Basi...
PDF
Scaling Up LLM Pretraining: Parallel Training
PDF
W11L2 Efficient Scaling Retrieval Augmentation.pdf
PDF
Interpretation of Pretrained Language Models Chenyan Xiong 11-667
Knowledge Representation Part VI by Jan Pettersen Nytun
Knowledge Representation Part III by Jan Pettersen Nytun
interacting-with-ai-2023---module-2---session-4---handout.pdf
interacting-with-ai-2023---module-2---session-3---handout.pdf
interacting-with-ai-2023---module-2---session-1---handout.pdf
Chatbot are sentient, turing test, generative AI
Probabilistic AI Lecture 1: Introduction to variational inference and the ELBO
Bayesian Statistics in High Dimensions Lecture 1: Curve and surface estimation
Foundations of Artificial Intelligence 1. Introduction Organizational Aspects...
W1L2_11-667 - Building Blocks of Modern LLMs 2: Pretraining Tasks
W4L2_11-667: LARGE LANGUAGE MODELS: METHODS AND APPLICATIONS - Parameter Effi...
W4L1_11-667: LARGE LANGUAGE MODELS: METHODS AND APPLICATIONS - Human Evaluati...
W4L2_11-667: LARGE LANGUAGE MODELS: METHODS AND APPLICATIONS PETM Parameter E...
W6L1_LARGE LANGUAGE MODELS: METHODS AND APPLICATIONS - Chatbots and AI Agents
LLM for Search Engines: Part 2,Pretrain retrieval representations
W9L2 Scaling Up LLM Pretraining: Scaling Law
W10L2 Scaling Up LLM Pretraining: Parallel Training Scaling Up Optimizer Basi...
Scaling Up LLM Pretraining: Parallel Training
W11L2 Efficient Scaling Retrieval Augmentation.pdf
Interpretation of Pretrained Language Models Chenyan Xiong 11-667
Ad

Recently uploaded (20)

PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Empathic Computing: Creating Shared Understanding
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PPTX
Big Data Technologies - Introduction.pptx
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Machine learning based COVID-19 study performance prediction
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Encapsulation theory and applications.pdf
PDF
Approach and Philosophy of On baking technology
PDF
Review of recent advances in non-invasive hemoglobin estimation
PPTX
Spectroscopy.pptx food analysis technology
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Programs and apps: productivity, graphics, security and other tools
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Empathic Computing: Creating Shared Understanding
Understanding_Digital_Forensics_Presentation.pptx
Big Data Technologies - Introduction.pptx
Per capita expenditure prediction using model stacking based on satellite ima...
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Chapter 3 Spatial Domain Image Processing.pdf
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Diabetes mellitus diagnosis method based random forest with bat algorithm
Dropbox Q2 2025 Financial Results & Investor Presentation
Machine learning based COVID-19 study performance prediction
20250228 LYD VKU AI Blended-Learning.pptx
Encapsulation theory and applications.pdf
Approach and Philosophy of On baking technology
Review of recent advances in non-invasive hemoglobin estimation
Spectroscopy.pptx food analysis technology

Inverse Reinforcement Learning CS 285: Deep Reinforcement Learning, Decision Making, and Control Sergey Levine

  • 1. Inverse Reinforcement Learning CS 285: Deep Reinforcement Learning, Decision Making, and Control Sergey Levine
  • 2. Today’s Lecture 1. So far: manually design reward function to define a task 2. What if we want to learn the reward function from observing an expert, and then use reinforcement learning? 3. Apply approximate optimality model from last week, but now learn the reward! • Goals: • Understand the inverse reinforcement learning problem definition • Understand how probabilistic models of behavior can be used to derive inverse reinforcement learning algorithms • Understand a few practical inverse reinforcement learning algorithms we can use
  • 3. Optimal Control as a Model of Human Behavior Mombaur et al. ‘09 Muybridge (c. 1870) Ziebart ‘08 Li & Todorov ‘06 optimize this to explain the data
  • 4. Why should we worry about learning rewards? The imitation learning perspective Standard imitation learning: • copy the actions performed by the expert • no reasoning about outcomes of actions Human imitation learning: • copy the intent of the expert • might take very different actions!
  • 5. Why should we worry about learning rewards? The reinforcement learning perspective what is the reward?
  • 6. Inverse reinforcement learning Infer reward functions from demonstrations by itself, this is an underspecified problem many reward functions can explain the same behavior
  • 7. A bit more formally “forward” reinforcement learning inverse reinforcement learning reward parameters
  • 9. Feature matching IRL & maximum margin Issues: • Maximizing the margin is a bit arbitrary • No clear model of expert suboptimality (can add slack variables…) • Messy constrained optimization problem – not great for deep learning! Further reading: • Abbeel & Ng: Apprenticeship learning via inverse reinforcement learning • Ratliff et al: Maximum margin planning
  • 10. Optimal Control as a Model of Human Behavior Mombaur et al. ‘09 Muybridge (c. 1870) Ziebart ‘08 Li & Todorov ‘06
  • 11. A probabilistic graphical model of decision making no assumption of optimal behavior!
  • 12. Learning the optimality variable reward parameters
  • 13. The IRL partition function
  • 16. The MaxEnt IRL algorithm Why MaxEnt? Ziebart et al. 2008: Maximum Entropy Inverse Reinforcement Learning
  • 18. Break
  • 20. • MaxEnt IRL so far requires… • Solving for (soft) optimal policy in the inner loop • Enumerating all state-action tuples for visitation frequency and gradient • To apply this in practical problem settings, we need to handle… • Large and continuous state and action spaces • States obtained via sampling only • Unknown dynamics What’s missing so far?
  • 21. Unknown dynamics & large state/action spaces Assume we don’t know the dynamics, but we can sample, like in standard RL
  • 24. Update reward using samples & demos generate policy samples from π update π w.r.t. reward policy π reward r guided cost learning algorithm policy π (Finn et al. ICML ’16) slides adapted from C. Finn
  • 25. It looks a bit like a game… policy π
  • 26. Generative Adversarial Networks Goodfellow et al. ‘14 Isola et al. ‘17 Arjovsky et al. ‘17 Zhu et al. ‘17
  • 27. Inverse RL as a GAN Finn*, Christiano* et al. “A Connection Between Generative Adversarial Networks, Inverse Reinforcement Learning, and Energy-Based Models.”
  • 28. Inverse RL as a GAN Finn*, Christiano* et al. “A Connection Between Generative Adversarial Networks, Inverse Reinforcement Learning, and Energy-Based Models.”
  • 29. Generalization via inverse RL demonstration reproduce behavior under different conditions what can we learn from the demonstration to enable better transfer? need to decouple the goal from the dynamics! policy = reward + dynamics Fu et al. Learning Robust Rewards with Adversarial Inverse Reinforcement Learning
  • 30. Can we just use a regular discriminator? Ho & Ermon. Generative adversarial imitation learning. Pros & cons: + often simpler to set up optimization, fewer moving parts - discriminator knows nothing at convergence - generally cannot reoptimize the “reward”
  • 31. IRL as adversarial optimization Generative Adversarial Imitation Learning Guided Cost Learning robot attempt classifier Ho & Ermon, NIPS 2016 Hausman, Chebotar, Schaal, Sukhatme, Lim Peng, Kanazawa, Toyer, Abbeel, Levine Finn et al., ICML 2016 robot attempt reward function actually the same thing!
  • 32. Suggested Reading on Inverse RL Classic Papers: Abbeel & Ng ICML ’04. Apprenticeship Learning via Inverse Reinforcement Learning. Good introduction to inverse reinforcement learning Ziebart et al. AAAI ’08. Maximum Entropy Inverse Reinforcement Learning. Introduction to probabilistic method for inverse reinforcement learning Modern Papers: Finn et al. ICML ’16. Guided Cost Learning. Sampling based method for MaxEnt IRL that handles unknown dynamics and deep reward functions Wulfmeier et al. arXiv ’16. Deep Maximum Entropy Inverse Reinforcement Learning. MaxEnt inverse RL using deep reward functions Ho & Ermon NIPS ’16. Generative Adversarial Imitation Learning. Inverse RL method using generative adversarial networks Fu, Luo, Levine ICLR ‘18. Learning Robust Rewards with Adversarial Inverse Reinforcement Learning