SlideShare a Scribd company logo
Learning Contact-Rich
Manipulation Skills with Guided
Policy Search
Sergey Levine, Nolan Wagener, and
Pieter Abbeel
ICRA 2015
Presenter: Sungjoon Choi
Recent trends in Reinforcement
Learning
: Deep Neural Policy Learning
based on my private opinion..
which can be somewhat misleading
Presenter: Sungjoon Choi
Learning Contact-Rich Manipulation Skills
with Guided Policy Search
http://rll.berkeley.edu/icra2015gps/
Learning Contact-Rich Manipulation Skills
with Guided Policy Search
This paper wins the ICRA 2015 Best Manipulation Paper Award!
But why? What’s so great about this paper?
Personally, main contribution of this paper is to propose a direct
policy learning method that can ‘actually train a real-world robot’
to perform some tasks.
That’s it??
I guess so! By the way, ‘actually training a real-world robot’ is
harder than you might imagine!
You will see how brilliant this paper is!
Brief review of MDP and RL
actionobservation
reward
Agent
Brief review of MDP and RL
State
Reward
Value
Policy
Action
Model
Brief review of MDP and RL
Remember! The goal of MDP and RL is to find an optimal policy!
It is like saying “I will find a function which best satisfies given
conditions!”.
However, learning a function is not an easy problem. (In fact, impossible
unless we use some ‘prior’ knowledge!)
So, instead of learning a function itself, most of the works try to find the
‘parameters’ of a function by restricting the solution space to a space of
certain parametric functions such as linear functions.
Brief review of MDP and RL
What are typical impediments in reinforcement learning?
2. However, linear functions do not work well in practice.
In other words, why is it so HARD to find an optimal policy??
1. We are living in a continuous world, not a discrete grid world.
3. (Dynamic) model, which is often required, is HARD to obtain.
- In this continuous world, standard MDP cannot be established.
- So instead, we usually use function approximation to handle this issue.
- And, of course, nonlinear functions are hard to optimize.
- The definition of value is “expected sum of rewards!”.
Today’s paper tackles all three problems listed above!!
RL: Reinforcement Learning
IRL: Inverse Reinforcement Learning
LfD: Learning from Demonstration
DPL: Direct Policy Learning
RL
DPL
IRL
(=IOC)
LfD
Guided Policy
Search
Objective? What’s given? What’s NOT given Algorithms
RL Find optimal policy Reward
Dynamic model
Policy Policy iteration, Value iteration,
TD learning, Q learning
IRL Find underlying reward
Find optimal policy
Experts’ demonstrations
(often dynamic model)
Reward
Policy
MaxEnt IRL, MaxMargin planning,
Apprenticeship learning
DPL Find optimal policy Experts’ demonstrations
Reward
Dynamic model (not always) Guided policy search
LfD Find underlying reward
Find optimal policy
Experts’ demonstrations
+ others..
Dynamic model (not always) GP motion controller
Big Picture (which might be wrong)
IOC: Inverse Optimal Control
Constrained Guided
Policy Search
[10] Sergey Levine, Nolan Wagener, and Pieter Abbeel. "Learning Contact-Rich Manipulation Skills with Guided Policy Search." ICRA 2015
[2] Dvijotham, Krishnamurthy, and Emanuel Todorov. "Inverse optimal control with linearly-solvable MDPs." ICML 2010
[11] Stadie, Bradly C., Sergey Levine, and Pieter Abbeel. "Incentivizing Exploration In Reinforcement Learning With Deep Predictive Models." arXiv 2015
[3] Brian D. Ziebart, Andrew Maas, J.Andrew Bagnell, and Anind K. Dey. "Maximum Entropy Inverse Reinforcement Learning.“ AAAI 2008
[1] Emanuel Todorov. "Linearly-solvable Markov decision problems." NIPS 2006
[6] Levine, Sergey, and Vladlen Koltun. "Guided policy search." ICML 2013
[5] Levine, Sergey, and Vladlen Koltun. "Continuous inverse optimal control with locally optimal examples." ICML 2012
[9] Sergey Levine, Chelsea Finn, Trevor Darrell, Pieter Abbeel. "End-to-End Training of Deep Visuomotor Policies." ICRA 2015
[7] Levine, Sergey, and Vladlen Koltun. "Learning complex neural network policies with trajectory optimization." ICML 2014
[8] Levine, Sergey, and Pieter Abbeel. "Learning neural network policies with guided policy search under unknown dynamics." NIPS 2014.
MDP is powerful.
But it requires heavy computation for finding the value function.  LMDP [1]
Let’s use the LMDP in inverse optimal control problem!  [2]
How can we measure the ‘probability’ of (experts’) state-action sequence?  [3]
Can we learn ‘nonlinear’ reward function?  [4]
Can we do that with ‘locally’ optimal examples?  [5]
Given reward, how can we ‘effectively’ learn optimal policy?  [6]
Re-formalize the guided policy search.  [7]
Let learn both ‘dynamic model’ and policy!!  [8]
Image based control with CNN!!  [9]
Applied to a real-world robot, PR2!!  [10]
[4] Levine, Sergey, Zoran Popovic, and Vladlen Koltun. "Nonlinear inverse reinforcement learning with Gaussian processes" NIPS 2011
The beginning of a new era!
(RL + Deep learning)
Note that reward is Given!!
How can we ‘effectively’ search the optimal policy?  [11] (latest)
Learning Contact-Rich Manipulation Skills
with Guided Policy Search
Main building block is a Guided Policy Search (GPS).
GPS is a two stage algorithm consists of a trajectory optimization
stage and policy learning stage.
Levine, Sergey, and Vladlen Koltun. "Guided policy search." ICML 2013
Levine, Sergey, and Vladlen Koltun. "Learning complex neural network policies with trajectory optimization." ICML 2014
GPS is a direct policy search algorithm, that can effectively scale to
high-dimensional systems.
Levine, Sergey, and Pieter Abbeel. "Learning neural network policies with guided policy search under unknown dynamics." NIPS 2014.
Guided Policy Search
Stage 1) Trajectory optimization (iterative LQR)
Given a reward function and dynamic model,
Each trajectory consists of
(state-action) pairs.
Levine, Sergey, and Vladlen Koltun. "Guided policy search." ICML 2013
Iterative LQR
Iterative linear quadratic regulator optimizes a trajectory by
repeatedly solving for the optimal policy under linear-
quadratic assumptions.
Levine, Sergey, and Vladlen Koltun. "Guided policy search." ICML 2013
Linear dynamics
Quadratic reward
Iterative LQR
Iterative linear quadratic regulator optimizes a trajectory by
repeatedly solving for the optimal policy under linear-
quadratic assumptions.
Levine, Sergey, and Vladlen Koltun. "Guided policy search." ICML 2013
Iterative LQR
Levine, Sergey, and Vladlen Koltun. "Guided policy search." ICML 2013
Iteratively compute a trajectory, find a deterministic policy
based on the trajectory, and recomputed a trajectory until
convergence.
But this only results a deterministic policy. We need
something stochastic!
By exploiting the concept of linearly solvable MDP and
maximum entropy control, one can derive following
stochastic policy!
Guided Policy Search
Stage 2) Policy learning
From collected (state-action) pairs, Train neural network controllers,
using Importance Sampled Policy
Search.
Levine, Sergey, and Vladlen Koltun. "Guided policy search." ICML 2013
Importance Sampled Policy Search
𝜋 𝜃 𝜁1:𝑡 = 𝑘=1
𝑡
𝑁 𝑢 𝑘; 𝜇 𝑥 𝑘 , 𝜎2
Importance sampled policy search finds 𝜃 which maximizes
following cost function.
𝑍𝑡 𝜃 = 𝑖=1
𝑚 𝜋 𝜃 𝜁1:𝑡
𝑞 𝜁1:𝑡
reward (cost)
Neural policy (data-fitting)
Average guiding distributions
or previous policy (compensate)
Lower variance (exploration)
Analytic gradient of 𝚽(𝜽)
Neural network
Back-propagation
Constrained Guided Policy Search
Levine, Sergey, and Pieter Abbeel. "Learning neural network policies with guided policy search under unknown dynamics." NIPS 2014.
What if we don’t know the dynamics of a robot?
We can use real-world trajectories to locally approximate
dynamic models.
Constrained Guided Policy Search
Levine, Sergey, and Pieter Abbeel. "Learning neural network policies with guided policy search under unknown dynamics." NIPS 2014.
However, as it is a local approximation, large deviation from
previous trajectories might lead to disastrous optimization
results.
Gaussian mixture model is further used to reduce the
number of examples to model a dynamic model.
Impose a constraint on the KL-divergence between the old
and new trajectory distribution!
Learning Contact-Rich Manipulation Skills
with Guided Policy Search
This paper use the constrained guided policy search to
perform contact-rich manipulation skills.
Learning Contact-Rich Manipulation Skills
with Guided Policy Search
(a) stacking large lego blocks on a
fixed base, (b) onto a free-standing
block, (c) held in both gripper;
(d) threading wooden rings onto a
tight-fitting peg; (e) assembling a
toy airplane by inserting the wheels
into a slot; (f) inserting a shoe
tree into a shoe; (g,h) screwing
caps onto pill bottles and (i) onto a
water bottle.
Learning Contact-Rich Manipulation Skills
with Guided Policy Search
Agent 7 Torques
1. Current joint angles and
velocities
2. Cartesian velocities of two
or three points on the
manipulated object
3. Vector from the target
positions of these points to
their current position
4. Torque applied at the
previous time step
Conclusion
Constrained guided policy search is used to train a real-world PR2 robot to
perform some contact-rich tasks.
Policy function is modeled with a neural network.
Prior knowledge about dynamics is NOT required.
Iterative LQR is used for defining a guiding distribution which works as a
proposal distribution in an importance sampled policy search.
Thank you!
Any questions?

More Related Content

PPTX
Deep Learning in Robotics
PPTX
InfoGAIL
PPTX
IROS 2017 Slides
PDF
PPTX
Capsule networks
PDF
Kernel, RKHS, and Gaussian Processes
PDF
[PR12] Spectral Normalization for Generative Adversarial Networks
PDF
[PR12] understanding deep learning requires rethinking generalization
Deep Learning in Robotics
InfoGAIL
IROS 2017 Slides
Capsule networks
Kernel, RKHS, and Gaussian Processes
[PR12] Spectral Normalization for Generative Adversarial Networks
[PR12] understanding deep learning requires rethinking generalization

What's hot (20)

PDF
Deep robotics
PPTX
Value iteration networks
PDF
Variants of GANs - Jaejun Yoo
PDF
Recent Trends in Deep Learning
PDF
Introduction to ambient GAN
PDF
Model-Based Reinforcement Learning @NIPS2017
PDF
Leveraged Gaussian Process
PDF
[PR12] intro. to gans jaejun yoo
PPTX
Robot, Learning From Data
PDF
Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021
PDF
Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Va...
PDF
Task Adaptive Neural Network Search with Meta-Contrastive Learning
PDF
Domain Transfer and Adaptation Survey
PDF
PR12-094: Model-Agnostic Meta-Learning for fast adaptation of deep networks
PPTX
Artificial Intelligence, Machine Learning and Deep Learning
PDF
Corinna Cortes, Head of Research, Google, at MLconf NYC 2017
PDF
Modeling uncertainty in deep learning
PDF
Proximal Policy Optimization Algorithms, Schulman et al, 2017
PPTX
Reinforcement Learning
PDF
Reinforcement Learning (DLAI D7L2 2017 UPC Deep Learning for Artificial Intel...
Deep robotics
Value iteration networks
Variants of GANs - Jaejun Yoo
Recent Trends in Deep Learning
Introduction to ambient GAN
Model-Based Reinforcement Learning @NIPS2017
Leveraged Gaussian Process
[PR12] intro. to gans jaejun yoo
Robot, Learning From Data
Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021
Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Va...
Task Adaptive Neural Network Search with Meta-Contrastive Learning
Domain Transfer and Adaptation Survey
PR12-094: Model-Agnostic Meta-Learning for fast adaptation of deep networks
Artificial Intelligence, Machine Learning and Deep Learning
Corinna Cortes, Head of Research, Google, at MLconf NYC 2017
Modeling uncertainty in deep learning
Proximal Policy Optimization Algorithms, Schulman et al, 2017
Reinforcement Learning
Reinforcement Learning (DLAI D7L2 2017 UPC Deep Learning for Artificial Intel...
Ad

Viewers also liked (9)

PDF
Inverse Reinforcement Learning Algorithms
PPTX
CNN Tutorial
PDF
Connection between Bellman equation and Markov Decision Processes
PPTX
Semantic Segmentation Methods using Deep Learning
PPTX
TensorFlow Tutorial Part2
PPTX
TensorFlow Tutorial Part1
PPTX
Object Detection Methods using Deep Learning
PDF
Trust Region Policy Optimization
PPTX
Deep Learning in Computer Vision
Inverse Reinforcement Learning Algorithms
CNN Tutorial
Connection between Bellman equation and Markov Decision Processes
Semantic Segmentation Methods using Deep Learning
TensorFlow Tutorial Part2
TensorFlow Tutorial Part1
Object Detection Methods using Deep Learning
Trust Region Policy Optimization
Deep Learning in Computer Vision
Ad

Similar to Recent Trends in Neural Net Policy Learning (20)

PDF
Inverse Reinforcement Learning CS 285: Deep Reinforcement Learning, Decision ...
PDF
AI - history and recent breakthroughs
PDF
Context-aware Dynamics Model for Generalization in Model-Based Reinforcement ...
PDF
Model-Based Reinforcement Learning CS 285: Deep Reinforcement Learning, Decis...
PPTX
Literature Review - Presentation on Relevant work for RL4AD capstone
PDF
Policy gradient
PDF
Modular Multitask Reinforcement Learning with Policy Sketches
PPTX
lebhhhggjitr677ugghjjnbbbbvcchjhc16.pptx
PPTX
Preference learning for guiding the tree searches in continuous POMDPs (CoRL ...
PDF
Helping Searchers Satisfice through Query Understanding
PDF
Harm van Seijen, Research Scientist, Maluuba at MLconf SF 2016
PPTX
Reinforcement course material samples: lecture 1
PDF
MILA DL & RL summer school highlights
PPTX
Intro to Deep Reinforcement Learning
PDF
Shanghai deep learning meetup 4
PPTX
ICT Colloquium Presentation
PDF
Module 1.3 data exploratory
PPTX
Difference between Artificial Intelligence, Machine Learning, Deep Learning a...
PDF
Reinforcement Learning with Deep Architectures
PDF
Reinforcement Learning.pdf
Inverse Reinforcement Learning CS 285: Deep Reinforcement Learning, Decision ...
AI - history and recent breakthroughs
Context-aware Dynamics Model for Generalization in Model-Based Reinforcement ...
Model-Based Reinforcement Learning CS 285: Deep Reinforcement Learning, Decis...
Literature Review - Presentation on Relevant work for RL4AD capstone
Policy gradient
Modular Multitask Reinforcement Learning with Policy Sketches
lebhhhggjitr677ugghjjnbbbbvcchjhc16.pptx
Preference learning for guiding the tree searches in continuous POMDPs (CoRL ...
Helping Searchers Satisfice through Query Understanding
Harm van Seijen, Research Scientist, Maluuba at MLconf SF 2016
Reinforcement course material samples: lecture 1
MILA DL & RL summer school highlights
Intro to Deep Reinforcement Learning
Shanghai deep learning meetup 4
ICT Colloquium Presentation
Module 1.3 data exploratory
Difference between Artificial Intelligence, Machine Learning, Deep Learning a...
Reinforcement Learning with Deep Architectures
Reinforcement Learning.pdf

Recently uploaded (20)

PPT
Mechanical Engineering MATERIALS Selection
PPTX
additive manufacturing of ss316l using mig welding
PDF
Well-logging-methods_new................
PPTX
bas. eng. economics group 4 presentation 1.pptx
PDF
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
PPTX
Sustainable Sites - Green Building Construction
PDF
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
PPT
Project quality management in manufacturing
PPTX
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
PPTX
UNIT 4 Total Quality Management .pptx
DOCX
573137875-Attendance-Management-System-original
PPTX
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
PDF
Embodied AI: Ushering in the Next Era of Intelligent Systems
PPTX
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
PDF
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
PPTX
Welding lecture in detail for understanding
PPTX
UNIT-1 - COAL BASED THERMAL POWER PLANTS
PDF
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
PPT
CRASH COURSE IN ALTERNATIVE PLUMBING CLASS
PDF
Automation-in-Manufacturing-Chapter-Introduction.pdf
Mechanical Engineering MATERIALS Selection
additive manufacturing of ss316l using mig welding
Well-logging-methods_new................
bas. eng. economics group 4 presentation 1.pptx
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
Sustainable Sites - Green Building Construction
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
Project quality management in manufacturing
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
UNIT 4 Total Quality Management .pptx
573137875-Attendance-Management-System-original
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
Embodied AI: Ushering in the Next Era of Intelligent Systems
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
Welding lecture in detail for understanding
UNIT-1 - COAL BASED THERMAL POWER PLANTS
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
CRASH COURSE IN ALTERNATIVE PLUMBING CLASS
Automation-in-Manufacturing-Chapter-Introduction.pdf

Recent Trends in Neural Net Policy Learning

  • 1. Learning Contact-Rich Manipulation Skills with Guided Policy Search Sergey Levine, Nolan Wagener, and Pieter Abbeel ICRA 2015 Presenter: Sungjoon Choi
  • 2. Recent trends in Reinforcement Learning : Deep Neural Policy Learning based on my private opinion.. which can be somewhat misleading Presenter: Sungjoon Choi
  • 3. Learning Contact-Rich Manipulation Skills with Guided Policy Search http://rll.berkeley.edu/icra2015gps/
  • 4. Learning Contact-Rich Manipulation Skills with Guided Policy Search This paper wins the ICRA 2015 Best Manipulation Paper Award! But why? What’s so great about this paper? Personally, main contribution of this paper is to propose a direct policy learning method that can ‘actually train a real-world robot’ to perform some tasks. That’s it?? I guess so! By the way, ‘actually training a real-world robot’ is harder than you might imagine! You will see how brilliant this paper is!
  • 5. Brief review of MDP and RL actionobservation reward Agent
  • 6. Brief review of MDP and RL State Reward Value Policy Action Model
  • 7. Brief review of MDP and RL Remember! The goal of MDP and RL is to find an optimal policy! It is like saying “I will find a function which best satisfies given conditions!”. However, learning a function is not an easy problem. (In fact, impossible unless we use some ‘prior’ knowledge!) So, instead of learning a function itself, most of the works try to find the ‘parameters’ of a function by restricting the solution space to a space of certain parametric functions such as linear functions.
  • 8. Brief review of MDP and RL What are typical impediments in reinforcement learning? 2. However, linear functions do not work well in practice. In other words, why is it so HARD to find an optimal policy?? 1. We are living in a continuous world, not a discrete grid world. 3. (Dynamic) model, which is often required, is HARD to obtain. - In this continuous world, standard MDP cannot be established. - So instead, we usually use function approximation to handle this issue. - And, of course, nonlinear functions are hard to optimize. - The definition of value is “expected sum of rewards!”. Today’s paper tackles all three problems listed above!!
  • 9. RL: Reinforcement Learning IRL: Inverse Reinforcement Learning LfD: Learning from Demonstration DPL: Direct Policy Learning RL DPL IRL (=IOC) LfD Guided Policy Search Objective? What’s given? What’s NOT given Algorithms RL Find optimal policy Reward Dynamic model Policy Policy iteration, Value iteration, TD learning, Q learning IRL Find underlying reward Find optimal policy Experts’ demonstrations (often dynamic model) Reward Policy MaxEnt IRL, MaxMargin planning, Apprenticeship learning DPL Find optimal policy Experts’ demonstrations Reward Dynamic model (not always) Guided policy search LfD Find underlying reward Find optimal policy Experts’ demonstrations + others.. Dynamic model (not always) GP motion controller Big Picture (which might be wrong) IOC: Inverse Optimal Control Constrained Guided Policy Search
  • 10. [10] Sergey Levine, Nolan Wagener, and Pieter Abbeel. "Learning Contact-Rich Manipulation Skills with Guided Policy Search." ICRA 2015 [2] Dvijotham, Krishnamurthy, and Emanuel Todorov. "Inverse optimal control with linearly-solvable MDPs." ICML 2010 [11] Stadie, Bradly C., Sergey Levine, and Pieter Abbeel. "Incentivizing Exploration In Reinforcement Learning With Deep Predictive Models." arXiv 2015 [3] Brian D. Ziebart, Andrew Maas, J.Andrew Bagnell, and Anind K. Dey. "Maximum Entropy Inverse Reinforcement Learning.“ AAAI 2008 [1] Emanuel Todorov. "Linearly-solvable Markov decision problems." NIPS 2006 [6] Levine, Sergey, and Vladlen Koltun. "Guided policy search." ICML 2013 [5] Levine, Sergey, and Vladlen Koltun. "Continuous inverse optimal control with locally optimal examples." ICML 2012 [9] Sergey Levine, Chelsea Finn, Trevor Darrell, Pieter Abbeel. "End-to-End Training of Deep Visuomotor Policies." ICRA 2015 [7] Levine, Sergey, and Vladlen Koltun. "Learning complex neural network policies with trajectory optimization." ICML 2014 [8] Levine, Sergey, and Pieter Abbeel. "Learning neural network policies with guided policy search under unknown dynamics." NIPS 2014. MDP is powerful. But it requires heavy computation for finding the value function.  LMDP [1] Let’s use the LMDP in inverse optimal control problem!  [2] How can we measure the ‘probability’ of (experts’) state-action sequence?  [3] Can we learn ‘nonlinear’ reward function?  [4] Can we do that with ‘locally’ optimal examples?  [5] Given reward, how can we ‘effectively’ learn optimal policy?  [6] Re-formalize the guided policy search.  [7] Let learn both ‘dynamic model’ and policy!!  [8] Image based control with CNN!!  [9] Applied to a real-world robot, PR2!!  [10] [4] Levine, Sergey, Zoran Popovic, and Vladlen Koltun. "Nonlinear inverse reinforcement learning with Gaussian processes" NIPS 2011 The beginning of a new era! (RL + Deep learning) Note that reward is Given!! How can we ‘effectively’ search the optimal policy?  [11] (latest)
  • 11. Learning Contact-Rich Manipulation Skills with Guided Policy Search Main building block is a Guided Policy Search (GPS). GPS is a two stage algorithm consists of a trajectory optimization stage and policy learning stage. Levine, Sergey, and Vladlen Koltun. "Guided policy search." ICML 2013 Levine, Sergey, and Vladlen Koltun. "Learning complex neural network policies with trajectory optimization." ICML 2014 GPS is a direct policy search algorithm, that can effectively scale to high-dimensional systems. Levine, Sergey, and Pieter Abbeel. "Learning neural network policies with guided policy search under unknown dynamics." NIPS 2014.
  • 12. Guided Policy Search Stage 1) Trajectory optimization (iterative LQR) Given a reward function and dynamic model, Each trajectory consists of (state-action) pairs. Levine, Sergey, and Vladlen Koltun. "Guided policy search." ICML 2013
  • 13. Iterative LQR Iterative linear quadratic regulator optimizes a trajectory by repeatedly solving for the optimal policy under linear- quadratic assumptions. Levine, Sergey, and Vladlen Koltun. "Guided policy search." ICML 2013 Linear dynamics Quadratic reward
  • 14. Iterative LQR Iterative linear quadratic regulator optimizes a trajectory by repeatedly solving for the optimal policy under linear- quadratic assumptions. Levine, Sergey, and Vladlen Koltun. "Guided policy search." ICML 2013
  • 15. Iterative LQR Levine, Sergey, and Vladlen Koltun. "Guided policy search." ICML 2013 Iteratively compute a trajectory, find a deterministic policy based on the trajectory, and recomputed a trajectory until convergence. But this only results a deterministic policy. We need something stochastic! By exploiting the concept of linearly solvable MDP and maximum entropy control, one can derive following stochastic policy!
  • 16. Guided Policy Search Stage 2) Policy learning From collected (state-action) pairs, Train neural network controllers, using Importance Sampled Policy Search. Levine, Sergey, and Vladlen Koltun. "Guided policy search." ICML 2013
  • 17. Importance Sampled Policy Search 𝜋 𝜃 𝜁1:𝑡 = 𝑘=1 𝑡 𝑁 𝑢 𝑘; 𝜇 𝑥 𝑘 , 𝜎2 Importance sampled policy search finds 𝜃 which maximizes following cost function. 𝑍𝑡 𝜃 = 𝑖=1 𝑚 𝜋 𝜃 𝜁1:𝑡 𝑞 𝜁1:𝑡 reward (cost) Neural policy (data-fitting) Average guiding distributions or previous policy (compensate) Lower variance (exploration) Analytic gradient of 𝚽(𝜽) Neural network Back-propagation
  • 18. Constrained Guided Policy Search Levine, Sergey, and Pieter Abbeel. "Learning neural network policies with guided policy search under unknown dynamics." NIPS 2014. What if we don’t know the dynamics of a robot? We can use real-world trajectories to locally approximate dynamic models.
  • 19. Constrained Guided Policy Search Levine, Sergey, and Pieter Abbeel. "Learning neural network policies with guided policy search under unknown dynamics." NIPS 2014. However, as it is a local approximation, large deviation from previous trajectories might lead to disastrous optimization results. Gaussian mixture model is further used to reduce the number of examples to model a dynamic model. Impose a constraint on the KL-divergence between the old and new trajectory distribution!
  • 20. Learning Contact-Rich Manipulation Skills with Guided Policy Search This paper use the constrained guided policy search to perform contact-rich manipulation skills.
  • 21. Learning Contact-Rich Manipulation Skills with Guided Policy Search (a) stacking large lego blocks on a fixed base, (b) onto a free-standing block, (c) held in both gripper; (d) threading wooden rings onto a tight-fitting peg; (e) assembling a toy airplane by inserting the wheels into a slot; (f) inserting a shoe tree into a shoe; (g,h) screwing caps onto pill bottles and (i) onto a water bottle.
  • 22. Learning Contact-Rich Manipulation Skills with Guided Policy Search Agent 7 Torques 1. Current joint angles and velocities 2. Cartesian velocities of two or three points on the manipulated object 3. Vector from the target positions of these points to their current position 4. Torque applied at the previous time step
  • 23. Conclusion Constrained guided policy search is used to train a real-world PR2 robot to perform some contact-rich tasks. Policy function is modeled with a neural network. Prior knowledge about dynamics is NOT required. Iterative LQR is used for defining a guiding distribution which works as a proposal distribution in an importance sampled policy search.