SlideShare a Scribd company logo
Combining Model-Based and
Model-Free Updates for
Trajectory-Centric Reinforcement
Learning (ICML2017)
Reader: Kusano Hitoshi (Kyoto University)
July 9, 2017
1 / 38
Index
1 Introduction
2 Preliminary
3 Integrating Model-Based Updates into PI2
4 Experimental Evaluation
5 Future work
2 / 38
About author
Yevgen Chebotar∗1
, Karol Hausman∗1
, Marvin Zhang∗2
,
Gaurav Sukhatme1
, Stefan Schaal1
, Sergey Levine2
∗
means Equal contribution
1
University of Southern California
2
University of California Berkeley
All of ∗
are Ph.D students.
3 / 38
1 Introduction
2 Preliminary
3 Integrating Model-Based Updates into PI2
4 Experimental Evaluation
5 Future work
4 / 38
2 requirements for RL
• A goal of reinforcement learning(RL):
Learn motor skills through trial and error
• In RL, there are 2 requirements,
1 Data-efficiency, small sampling cost is better
2 Compatibility with unknown dynamical system
Figure: Example of environment used to evaluate their method
5 / 38
Trade-off between 2 types of RL and hybrid
There are 2 types of RL
• Model-Based: Dynamics are known or estimated
• Model-Free: we do not know the dynamics
Usually there is trade-off among them
Model-Based Model-free
Data-efficiency More efficient Less efficient
Compatibility Smaller Higher
In this research, they try to integrate and achieve both
data efficient and compatibility with unknown dynamics
6 / 38
Novelty of this research
• Prior works achieve modest gains in efficiency and
performance[3, 2]
Novelty of this research
• Proposal of integrated algorithm with the efficiency
and the generality in the context of specific policy
representation, TVLG
• Can solve complex continuous control tasks
that are infeasible for either of them
• Remaining orders of magnitude more efficient
than standard model-free
• Can extend to any policies via Guided Policy
Search [4]
7 / 38
1 Introduction
2 Preliminary
3 Integrating Model-Based Updates into PI2
4 Experimental Evaluation
5 Future work
8 / 38
Optimize the expected cost of the policy
under TVLG controllers
• Goal of policy search: optimize θ, parameter of
p(ut|xt), where ut: actions, xt: state at time t
• Cost: J(θ) = Ep[c(τ)] =
∫
c(τ)p(τ)dτ,
where
• τ = (x1, u1, ..., xT .uT ): trajectory,
• c(τ) =
∑T
t=1 c(xt , ut ): trajectory cost,
• p(τ) = p(x1)
∏T
t=1 p(xt+1|xt , ut )p(ut |xt ): policy trajectory
distribution
• Assume time-varying linear gaussian, TVLG:
p(ut|xt) = N(Ktxt, Σt)
• Model-Based: LQR-FLM
• Model-Free: PI2
9 / 38
Model-based optimization of TVLG
controller, LQR-FLM
• LQR-FLM is a model-based RL algorithm based
on prior works [4, 5]
• Use samples to fit a TVLG dynamics model
p(xt+1|xt, ut) = N(fx,txt + fu,tut, Ft) and assume a
twice-differentiable cost function.
• KL constraints is to deal with unknown dynamics[4]
• LQR-FLM is efficient though highly dependent on
being able to model the system dynamics precisely
Optimize
min
p(i)
Ep(i)[Q(xt, ut)]s.t.Ep(i)[DKL (p(i)
∥p(i−1)
)] ≤ ϵt (1)
10 / 38
Model-free optimization via PI2
• PI2
is a model-free algorithm based on stochastic
optimal control [6]
• Let S(xi,t, ui,t) = c(xi,t, ui,t) +
∑T
j=t+1 c(xi,j, ui,j), the
cost-to-go of trajectory i ∈ 1, ..., N
• Probability P(xi,t, ui,t) =
exp(− 1
ηt
S(xi,t ,ui,t ))
∫
exp(− 1
ηt
S(xi,t ,ui,t ))dui,t
,
Intuition: the trajectories with lower costs receives
higher probabilities
• After computing new P, we update the policy
distribution by reweighing each sampled control ui,t
by P(xi,t, ui,t) and update policy parameters by a
MLE[1]
11 / 38
Model-free optimization via PI2
Theorem 1:
The PI2
update corresponds to
minp(i)Ep(i)[S(xt, ut)]s.t.Ep(i)[DKL (p(i)
∥p(i−1)
)] ≤ ϵ, (2)
where ϵ is the maximum KL-divergence between the
new policy p(i)
(ut|xt) and the old policy p(i−1)
(ut|xt)
PI2
was used to solve robotic tasks such as
door-opening, and achieves better final performance
than LQR-FLM.
By minimizing (2), we can get below relationship
p(i)
(ut|xt) ∝ p(i−1)
(ut|xt)Ep(i−1)[exp(−
1
η
S(xt, ut))] (3)
12 / 38
1 Introduction
2 Preliminary
3 Integrating Model-Based Updates into PI2
4 Experimental Evaluation
5 Future work
13 / 38
Overview
• Both PI2
and LQR-FLM can be used to learn TVLG
policies and both has strength and weaknesses
• Divide PI2
into 2 parts, a part using a model-based
cost approximation, another part using the residual
cost error after first one
• Employ LQR-FLM as a model-based cost
approximation and integrate
• Demonstrate our method successfully achieves
strengths of PI2
and LQR-FLM while
compensating for their weaknesses
14 / 38
Two-Stage PI2
update
• Divide PI2
into 2 parts
• Let
• ˆc(xt , ut ): approx cost
• c(xt , ut ): real cost
• ˜c(xt , ut ): residual cost
• ˆSt : approx cost-to-go
• St : real cost-to-go
• ˜St : residual cost-to-go
• where, ˆSt = ˆSt (xt , ut ), St = St (xt , ut ), ˜St = ˜St (xt , ut )
˜c(xt , ut ) = c(xt , ut ) - ˆc(xt , ut ), ˜St = St - ˆSt
p(i)
(ut, xt) ∝ p(i−1)
(ut|xt)Ep(i−1)[exp(−
1
η
(ˆSt + ˜St))]
∝ ˆp(ut|xt)Ep(i−1)[exp(−
1
η
˜St)]
(4)
where, ˆp(ut|xt) = p(i−1)
(ut|xt)Ep(i−1)[exp(−1
η
ˆSt)]
15 / 38
Two-Stage PI2
update: Summary
Whole Update Procedure
1 Update using approx costs ˆc(xt, ut) and samples
from the old policy p(i−1)
(ut|xt) to get ˆp(ut|xt)
2 Update p(i)
(ut, xt) using the residual costs ˜c(xt, ut)
and samples from ˆp(ut|xt)
ˆp(ut|xt) = p(i−1)
(ut|xt)Ep(i−1)[exp(−
1
η
ˆSt)] (5)
p(i)
(ut, xt) ∝ ˆp(ut|xt)Ep(i−1)[exp(−
1
η
˜St)] (6)
16 / 38
Model-Based Substitution with LQR-FLM
We can use correspondence between Eq.(2) and
Eq.(3) to rewrite Eq.(5) as a optimization problem,
Rewrited Eq.(5)
min
ˆp
Eˆp[ˆS(xt, ut)] s.t. Ep(i−1)[DKL (ˆp∥p(i−1)
)] ≤ ϵ (7)
Thus, p(i)
(ut, xt) can be updated using any algorithm
that can solve this optimization problem
They employ LQR-FLM for this update
17 / 38
Optimizing Cost Residuals with PI2
Assume the same structure for both controllers
Controller for p(i)
(ut, xt)
ui,t = Ktxi,t + kt +
√
Σtξi,t (8)
where Kt, kt,
∑
t: parameters of p(i−1)
Controller for ˆp(ut, xt)
ˆui,t = ˆKtxi,t + ˆkt +
√
ˆΣtξi,t (9)
where ˆKt, ˆkt, ˆ∑
t: parameters of ˆp
ξi,t: corresponding noise to xi,t
18 / 38
Summary of PILQR algorithm
19 / 38
Training Parametric Policies with GPS
PILQR offers an approach to perform trajectory
optimization of only TVLG policies
We can extend PILQR to train parametric policy by
employing mirror descent guided policy search
Procedure
1 PILQR learn simple TVLG policies, local polices
p(ut, xt)
2 Optimized control are used to learn global policy πθ
of MDGPS in a supervised manner.
Hence, the global policy is generalized across multiple
global policies
20 / 38
1 Introduction
2 Preliminary
3 Integrating Model-Based Updates into PI2
4 Experimental Evaluation
5 Future work
21 / 38
Aim of Experiment
To answer below question, conducted both simulated
comparison and real robot experiment
Key questions:
• How does our method compare to other
trajectory-centric and deep RL algorithms in terms
of final performance and sample efficiency?
• Can we utilize linear-Gaussian policies trained
using PILQR to obtain robust neural network
policies using MDGPS?
• Is our proposed algorithm capable of learning
complex manipulation skills on a real robotic
platform?
22 / 38
Simulation Experiments
Evaluate on 3 simulated robotic manipulation tasks
1 Pushing a block: Involves controlling a 4 DoF arm to push a white block
to a red goal area. The cost function is a weighted combination of the
distance from the gripper to the block and from the block to the goal.
2 Opening a door in 3D: Requires opening a door with a 6 DoF 3D arm.
The cost function is a weighted combination of the distance of the end
effector to the door handle and the angle of the door.
3 Reacher: Requires moving the end of a 2 DoF arm to a target position.
The cost function is the distance from the end effector to the target.
Figure: Simulated robotic manipulation tasks
23 / 38
Pushing a block
Comparison of PILQR to LQR-FLM and PI2
on the
most difficult condition for the gripper pusher task.
Figure: Average final distance
Fast convergence and better performance
24 / 38
Pushing a block
PILQR solves all four conditions with 400 total episodes
per condition, prove it is able to learn a diverse set of
successful behaviors including flicking, guiding, and
hitting the block
25 / 38
Door opening task
PILQR succeed at opening the door from each of the
four initial robot positions, while LQR-FLM fails to open
the door
26 / 38
Neural network policies on the reacher task
Results for MDGPS with each local policy method, as
well as two prior deep RL methods TRPO and DDPG
Figure: Final distance from the reacher end effector to the target
averaged across 300 random test conditions per iteration
Too simple task! But PILQR-MDGPS converge 25 and
150 times faster than DDPG and TRPO, while prior
hybrid methods is about up to 5 times faster 27 / 38
Neural network policies on the door opening
Results for MDGPS with each local policy method, as
well as two prior deep RL methods TRPO and DDPG
Figure: Minimum angle in radians of the door hinge averaged
across 100 random test conditions per iteration
TRPO requires 20 times more samples. Others fail
28 / 38
Task Description
They conducted 2 kinds of robot experiments.
Hockey
Requires using a stick to hit a puck into a goal 1.4 m
away. The cost function consists of two parts: the
distance between the current position of the stick and a
target pose that is close to the puck, and the distance
between the position of the puck and the goal.
Power plug plugging
In this task, the robot must plug a power plug into an
outlet. The cost function is the distance between the
plug and a target location inside the outlet.
29 / 38
Environment
Figure: A PR2 robot is used to conduct 3 experiments, Left: a
hockey task, Right: a power plug task
In the hockey task, the puck and goal is tracked using a
motion capture system.
30 / 38
Hockey task1
Learn a policy that is able to hit the puck into the goal
for a single position of the goal and the puck
Figure: Single condition comparison of the hockey task
31 / 38
Hockey task2
Learn a neural network policy using the
MDGPS-PILQR algorithm that can hit the puck into
different goal locations. Collected 30 rollouts for 3
different positions, and train neural net.
Figure: Right: Experimental setup for hockey task, left: success
rate for each position.
32 / 38
Power plug plugging
Requires very fine manipulation. While LQR-FLM
succeeded only 60 % at convergence, their method
was able to converge to a policy that plugged in the
power plug on every rollout at convergence.
33 / 38
1 Introduction
2 Preliminary
3 Integrating Model-Based Updates into PI2
4 Experimental Evaluation
5 Future work
34 / 38
Future work
• Current algorithm requires to reset the
environment into consistent initial state. Recent
work proposes a clustering method for lifting this
restriction by sampling trajectories from random
initial states. Integrating this technique into our
method would further improve its generality.
• Their method requires a continuous action space.
Extensions to discrete or hybrid action spaces
would require some kind of continuous relaxation.
35 / 38
References I
Yevgen Chebotar, Mrinal Kalakrishnan, Ali Yahya, Adrian Li,
Stefan Schaal, and Sergey Levine.
Path integral guided policy search.
In Proceedings of the IEEE International Conference on
Robotics and Automation (ICRA), April 2017.
Shixiang Gu, Timothy Lillicrap, Ilya Sutskever, and Sergey
Levine.
Continuous deep q-learning with model-based acceleration.
arXiv preprint arXiv:1603.00748, 2016.
36 / 38
References II
Nicolas Heess, Gregory Wayne, David Silver, Tim Lillicrap,
Tom Erez, and Yuval Tassa.
Learning continuous control policies by stochastic value
gradients.
In Advances in Neural Information Processing Systems, pages
2944–2952, 2015.
Sergey Levine and Pieter Abbeel.
Learning neural network policies with guided policy search
under unknown dynamics.
In Advances in Neural Information Processing Systems, pages
1071–1079, 2014.
37 / 38
References III
Yuval Tassa, Tom Erez, and Emanuel Todorov.
Synthesis and stabilization of complex behaviors through
online trajectory optimization.
In IROS, pages 4906–4913. IEEE, 2012.
Evangelos Theodorou, Jonas Buchli, and Stefan Schaal.
A generalized path integral control approach to reinforcement
learning.
Journal of Machine Learning Research, 11(Nov):3137–3181,
2010.
38 / 38

More Related Content

PDF
DESIGN OF DELAY COMPUTATION METHOD FOR CYCLOTOMIC FAST FOURIER TRANSFORM
PPTX
PPT - AutoML-Zero: Evolving Machine Learning Algorithms From Scratch
PDF
Performance evaluations of grioryan fft and cooley tukey fft onto xilinx virt...
PDF
PERFORMANCE EVALUATIONS OF GRIORYAN FFT AND COOLEY-TUKEY FFT ONTO XILINX VIRT...
PDF
Parallel Hardware Implementation of Convolution using Vedic Mathematics
PPTX
PPT - Enhancing the Locality and Breaking the Memory Bottleneck of Transforme...
PDF
Implementation Of Grigoryan FFT For Its Performance Case Study Over Cooley-Tu...
PDF
IRJET - Distributed Arithmetic Method for Complex Multiplication
DESIGN OF DELAY COMPUTATION METHOD FOR CYCLOTOMIC FAST FOURIER TRANSFORM
PPT - AutoML-Zero: Evolving Machine Learning Algorithms From Scratch
Performance evaluations of grioryan fft and cooley tukey fft onto xilinx virt...
PERFORMANCE EVALUATIONS OF GRIORYAN FFT AND COOLEY-TUKEY FFT ONTO XILINX VIRT...
Parallel Hardware Implementation of Convolution using Vedic Mathematics
PPT - Enhancing the Locality and Breaking the Memory Bottleneck of Transforme...
Implementation Of Grigoryan FFT For Its Performance Case Study Over Cooley-Tu...
IRJET - Distributed Arithmetic Method for Complex Multiplication

What's hot (20)

PPTX
Lecture 2 data structures and algorithms
PDF
Modified approximate 8-point multiplier less DCT like transform
PPT
3rd 3DDRESD: Floorplacer
PDF
presentation_final
PDF
Solving Unit Commitment Problem Using Chemo-tactic PSO–DE Optimization Algori...
PDF
Area efficient parallel LFSR for cyclic redundancy check
PDF
IRJET - Design and Implementation of FFT using Compressor with XOR Gate Topology
PPT
Chap12 slides
PDF
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
PDF
Data Structure: Algorithm and analysis
PDF
Algorithm Analyzing
PDF
Bi-objective Optimization Apply to Environment a land Economic Dispatch Probl...
PDF
20120140506008 2
PDF
Towards Automatic Code Selection with ppOpen-AT: A Case of FDM - Variants of ...
PDF
Exploration of genetic network programming with two-stage reinforcement learn...
PDF
Scheduling Using Multi Objective Genetic Algorithm
PDF
第12回 配信講義 計算科学技術特論A(2021)
PPTX
64 point fft chip
PDF
Iare ds ppt_3
PPTX
Complexity analysis in Algorithms
Lecture 2 data structures and algorithms
Modified approximate 8-point multiplier less DCT like transform
3rd 3DDRESD: Floorplacer
presentation_final
Solving Unit Commitment Problem Using Chemo-tactic PSO–DE Optimization Algori...
Area efficient parallel LFSR for cyclic redundancy check
IRJET - Design and Implementation of FFT using Compressor with XOR Gate Topology
Chap12 slides
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
Data Structure: Algorithm and analysis
Algorithm Analyzing
Bi-objective Optimization Apply to Environment a land Economic Dispatch Probl...
20120140506008 2
Towards Automatic Code Selection with ppOpen-AT: A Case of FDM - Variants of ...
Exploration of genetic network programming with two-stage reinforcement learn...
Scheduling Using Multi Objective Genetic Algorithm
第12回 配信講義 計算科学技術特論A(2021)
64 point fft chip
Iare ds ppt_3
Complexity analysis in Algorithms
Ad

Viewers also liked (20)

PDF
FCN-Based 6D Robotic Grasping for Arbitrary Placed Objects
PDF
生存時間分析の書き方
PPTX
Introduction of "TrailBlazer" algorithm
PDF
Learning to learn by gradient descent by gradient descent
PPTX
Introduction of “Fairness in Learning: Classic and Contextual Bandits”
PDF
Dual Learning for Machine Translation (NIPS 2016)
PDF
Interaction Networks for Learning about Objects, Relations and Physics
PDF
Safe and Efficient Off-Policy Reinforcement Learning
PPTX
InfoGAN: Interpretable Representation Learning by Information Maximizing Gen...
PPT
時系列データ3
PDF
Fast and Probvably Seedings for k-Means
PDF
Conditional Image Generation with PixelCNN Decoders
PDF
Value iteration networks
PDF
Improving Variational Inference with Inverse Autoregressive Flow
PDF
状態空間モデルの実行方法と実行環境の比較
PDF
[DL輪読会]Convolutional Sequence to Sequence Learning
PDF
NIPS 2016 Overview and Deep Learning Topics
PPTX
Differential privacy without sensitivity [NIPS2016読み会資料]
PDF
Matching networks for one shot learning
PPTX
ICML2016読み会 概要紹介
FCN-Based 6D Robotic Grasping for Arbitrary Placed Objects
生存時間分析の書き方
Introduction of "TrailBlazer" algorithm
Learning to learn by gradient descent by gradient descent
Introduction of “Fairness in Learning: Classic and Contextual Bandits”
Dual Learning for Machine Translation (NIPS 2016)
Interaction Networks for Learning about Objects, Relations and Physics
Safe and Efficient Off-Policy Reinforcement Learning
InfoGAN: Interpretable Representation Learning by Information Maximizing Gen...
時系列データ3
Fast and Probvably Seedings for k-Means
Conditional Image Generation with PixelCNN Decoders
Value iteration networks
Improving Variational Inference with Inverse Autoregressive Flow
状態空間モデルの実行方法と実行環境の比較
[DL輪読会]Convolutional Sequence to Sequence Learning
NIPS 2016 Overview and Deep Learning Topics
Differential privacy without sensitivity [NIPS2016読み会資料]
Matching networks for one shot learning
ICML2016読み会 概要紹介
Ad

Similar to 論文紹介 Combining Model-Based and Model-Free Updates for Trajectory-Centric Reinforcement Learning (20)

PDF
vonmoll-paper
PPTX
Deep Learning in Robotics
PDF
Master defence 2020 - Oleh Lukianykhin - Reinforcement Learning for Voltage C...
PDF
自然方策勾配法の基礎と応用
PPTX
Recent Trends in Neural Net Policy Learning
PPTX
Navigation in 3 d environment with reinforcement learning by Predrag Njegovan...
PDF
Derya_Sezen_POMDP_thesis
PDF
Computing near-optimal policies from trajectories by solving a sequence of st...
PDF
Optimal PID Controller Design for Speed Control of a Separately Excited DC Mo...
PDF
OPTIMAL PID CONTROLLER DESIGN FOR SPEED CONTROL OF A SEPARATELY EXCITED DC MO...
PDF
Development of deep reinforcement learning for inverted pendulum
PPTX
Does Zero-Shot RL Exist
PDF
[DL輪読会]近年のオフライン強化学習のまとめ —Offline Reinforcement Learning: Tutorial, Review, an...
PDF
9-12 MPC mode predictive control system.pdf
PDF
OPTIMAL PID CONTROLLER DESIGN FOR SPEED CONTROL OF A SEPARATELY EXCITED DC MO...
PDF
20181125 pybullet
PPTX
Intro to Deep Reinforcement Learning
PDF
esnq_control
PDF
Reinforcement learning in a nutshell
PDF
Continuous control with deep reinforcement learning (DDPG)
vonmoll-paper
Deep Learning in Robotics
Master defence 2020 - Oleh Lukianykhin - Reinforcement Learning for Voltage C...
自然方策勾配法の基礎と応用
Recent Trends in Neural Net Policy Learning
Navigation in 3 d environment with reinforcement learning by Predrag Njegovan...
Derya_Sezen_POMDP_thesis
Computing near-optimal policies from trajectories by solving a sequence of st...
Optimal PID Controller Design for Speed Control of a Separately Excited DC Mo...
OPTIMAL PID CONTROLLER DESIGN FOR SPEED CONTROL OF A SEPARATELY EXCITED DC MO...
Development of deep reinforcement learning for inverted pendulum
Does Zero-Shot RL Exist
[DL輪読会]近年のオフライン強化学習のまとめ —Offline Reinforcement Learning: Tutorial, Review, an...
9-12 MPC mode predictive control system.pdf
OPTIMAL PID CONTROLLER DESIGN FOR SPEED CONTROL OF A SEPARATELY EXCITED DC MO...
20181125 pybullet
Intro to Deep Reinforcement Learning
esnq_control
Reinforcement learning in a nutshell
Continuous control with deep reinforcement learning (DDPG)

Recently uploaded (20)

PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PPT
Teaching material agriculture food technology
PDF
Approach and Philosophy of On baking technology
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PPTX
Spectroscopy.pptx food analysis technology
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PPTX
Cloud computing and distributed systems.
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PPTX
MYSQL Presentation for SQL database connectivity
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Encapsulation theory and applications.pdf
PDF
Spectral efficient network and resource selection model in 5G networks
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Electronic commerce courselecture one. Pdf
20250228 LYD VKU AI Blended-Learning.pptx
Teaching material agriculture food technology
Approach and Philosophy of On baking technology
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Spectroscopy.pptx food analysis technology
Advanced methodologies resolving dimensionality complications for autism neur...
Cloud computing and distributed systems.
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
MYSQL Presentation for SQL database connectivity
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Digital-Transformation-Roadmap-for-Companies.pptx
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Review of recent advances in non-invasive hemoglobin estimation
The Rise and Fall of 3GPP – Time for a Sabbatical?
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Encapsulation theory and applications.pdf
Spectral efficient network and resource selection model in 5G networks
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Electronic commerce courselecture one. Pdf

論文紹介 Combining Model-Based and Model-Free Updates for Trajectory-Centric Reinforcement Learning

  • 1. Combining Model-Based and Model-Free Updates for Trajectory-Centric Reinforcement Learning (ICML2017) Reader: Kusano Hitoshi (Kyoto University) July 9, 2017 1 / 38
  • 2. Index 1 Introduction 2 Preliminary 3 Integrating Model-Based Updates into PI2 4 Experimental Evaluation 5 Future work 2 / 38
  • 3. About author Yevgen Chebotar∗1 , Karol Hausman∗1 , Marvin Zhang∗2 , Gaurav Sukhatme1 , Stefan Schaal1 , Sergey Levine2 ∗ means Equal contribution 1 University of Southern California 2 University of California Berkeley All of ∗ are Ph.D students. 3 / 38
  • 4. 1 Introduction 2 Preliminary 3 Integrating Model-Based Updates into PI2 4 Experimental Evaluation 5 Future work 4 / 38
  • 5. 2 requirements for RL • A goal of reinforcement learning(RL): Learn motor skills through trial and error • In RL, there are 2 requirements, 1 Data-efficiency, small sampling cost is better 2 Compatibility with unknown dynamical system Figure: Example of environment used to evaluate their method 5 / 38
  • 6. Trade-off between 2 types of RL and hybrid There are 2 types of RL • Model-Based: Dynamics are known or estimated • Model-Free: we do not know the dynamics Usually there is trade-off among them Model-Based Model-free Data-efficiency More efficient Less efficient Compatibility Smaller Higher In this research, they try to integrate and achieve both data efficient and compatibility with unknown dynamics 6 / 38
  • 7. Novelty of this research • Prior works achieve modest gains in efficiency and performance[3, 2] Novelty of this research • Proposal of integrated algorithm with the efficiency and the generality in the context of specific policy representation, TVLG • Can solve complex continuous control tasks that are infeasible for either of them • Remaining orders of magnitude more efficient than standard model-free • Can extend to any policies via Guided Policy Search [4] 7 / 38
  • 8. 1 Introduction 2 Preliminary 3 Integrating Model-Based Updates into PI2 4 Experimental Evaluation 5 Future work 8 / 38
  • 9. Optimize the expected cost of the policy under TVLG controllers • Goal of policy search: optimize θ, parameter of p(ut|xt), where ut: actions, xt: state at time t • Cost: J(θ) = Ep[c(τ)] = ∫ c(τ)p(τ)dτ, where • τ = (x1, u1, ..., xT .uT ): trajectory, • c(τ) = ∑T t=1 c(xt , ut ): trajectory cost, • p(τ) = p(x1) ∏T t=1 p(xt+1|xt , ut )p(ut |xt ): policy trajectory distribution • Assume time-varying linear gaussian, TVLG: p(ut|xt) = N(Ktxt, Σt) • Model-Based: LQR-FLM • Model-Free: PI2 9 / 38
  • 10. Model-based optimization of TVLG controller, LQR-FLM • LQR-FLM is a model-based RL algorithm based on prior works [4, 5] • Use samples to fit a TVLG dynamics model p(xt+1|xt, ut) = N(fx,txt + fu,tut, Ft) and assume a twice-differentiable cost function. • KL constraints is to deal with unknown dynamics[4] • LQR-FLM is efficient though highly dependent on being able to model the system dynamics precisely Optimize min p(i) Ep(i)[Q(xt, ut)]s.t.Ep(i)[DKL (p(i) ∥p(i−1) )] ≤ ϵt (1) 10 / 38
  • 11. Model-free optimization via PI2 • PI2 is a model-free algorithm based on stochastic optimal control [6] • Let S(xi,t, ui,t) = c(xi,t, ui,t) + ∑T j=t+1 c(xi,j, ui,j), the cost-to-go of trajectory i ∈ 1, ..., N • Probability P(xi,t, ui,t) = exp(− 1 ηt S(xi,t ,ui,t )) ∫ exp(− 1 ηt S(xi,t ,ui,t ))dui,t , Intuition: the trajectories with lower costs receives higher probabilities • After computing new P, we update the policy distribution by reweighing each sampled control ui,t by P(xi,t, ui,t) and update policy parameters by a MLE[1] 11 / 38
  • 12. Model-free optimization via PI2 Theorem 1: The PI2 update corresponds to minp(i)Ep(i)[S(xt, ut)]s.t.Ep(i)[DKL (p(i) ∥p(i−1) )] ≤ ϵ, (2) where ϵ is the maximum KL-divergence between the new policy p(i) (ut|xt) and the old policy p(i−1) (ut|xt) PI2 was used to solve robotic tasks such as door-opening, and achieves better final performance than LQR-FLM. By minimizing (2), we can get below relationship p(i) (ut|xt) ∝ p(i−1) (ut|xt)Ep(i−1)[exp(− 1 η S(xt, ut))] (3) 12 / 38
  • 13. 1 Introduction 2 Preliminary 3 Integrating Model-Based Updates into PI2 4 Experimental Evaluation 5 Future work 13 / 38
  • 14. Overview • Both PI2 and LQR-FLM can be used to learn TVLG policies and both has strength and weaknesses • Divide PI2 into 2 parts, a part using a model-based cost approximation, another part using the residual cost error after first one • Employ LQR-FLM as a model-based cost approximation and integrate • Demonstrate our method successfully achieves strengths of PI2 and LQR-FLM while compensating for their weaknesses 14 / 38
  • 15. Two-Stage PI2 update • Divide PI2 into 2 parts • Let • ˆc(xt , ut ): approx cost • c(xt , ut ): real cost • ˜c(xt , ut ): residual cost • ˆSt : approx cost-to-go • St : real cost-to-go • ˜St : residual cost-to-go • where, ˆSt = ˆSt (xt , ut ), St = St (xt , ut ), ˜St = ˜St (xt , ut ) ˜c(xt , ut ) = c(xt , ut ) - ˆc(xt , ut ), ˜St = St - ˆSt p(i) (ut, xt) ∝ p(i−1) (ut|xt)Ep(i−1)[exp(− 1 η (ˆSt + ˜St))] ∝ ˆp(ut|xt)Ep(i−1)[exp(− 1 η ˜St)] (4) where, ˆp(ut|xt) = p(i−1) (ut|xt)Ep(i−1)[exp(−1 η ˆSt)] 15 / 38
  • 16. Two-Stage PI2 update: Summary Whole Update Procedure 1 Update using approx costs ˆc(xt, ut) and samples from the old policy p(i−1) (ut|xt) to get ˆp(ut|xt) 2 Update p(i) (ut, xt) using the residual costs ˜c(xt, ut) and samples from ˆp(ut|xt) ˆp(ut|xt) = p(i−1) (ut|xt)Ep(i−1)[exp(− 1 η ˆSt)] (5) p(i) (ut, xt) ∝ ˆp(ut|xt)Ep(i−1)[exp(− 1 η ˜St)] (6) 16 / 38
  • 17. Model-Based Substitution with LQR-FLM We can use correspondence between Eq.(2) and Eq.(3) to rewrite Eq.(5) as a optimization problem, Rewrited Eq.(5) min ˆp Eˆp[ˆS(xt, ut)] s.t. Ep(i−1)[DKL (ˆp∥p(i−1) )] ≤ ϵ (7) Thus, p(i) (ut, xt) can be updated using any algorithm that can solve this optimization problem They employ LQR-FLM for this update 17 / 38
  • 18. Optimizing Cost Residuals with PI2 Assume the same structure for both controllers Controller for p(i) (ut, xt) ui,t = Ktxi,t + kt + √ Σtξi,t (8) where Kt, kt, ∑ t: parameters of p(i−1) Controller for ˆp(ut, xt) ˆui,t = ˆKtxi,t + ˆkt + √ ˆΣtξi,t (9) where ˆKt, ˆkt, ˆ∑ t: parameters of ˆp ξi,t: corresponding noise to xi,t 18 / 38
  • 19. Summary of PILQR algorithm 19 / 38
  • 20. Training Parametric Policies with GPS PILQR offers an approach to perform trajectory optimization of only TVLG policies We can extend PILQR to train parametric policy by employing mirror descent guided policy search Procedure 1 PILQR learn simple TVLG policies, local polices p(ut, xt) 2 Optimized control are used to learn global policy πθ of MDGPS in a supervised manner. Hence, the global policy is generalized across multiple global policies 20 / 38
  • 21. 1 Introduction 2 Preliminary 3 Integrating Model-Based Updates into PI2 4 Experimental Evaluation 5 Future work 21 / 38
  • 22. Aim of Experiment To answer below question, conducted both simulated comparison and real robot experiment Key questions: • How does our method compare to other trajectory-centric and deep RL algorithms in terms of final performance and sample efficiency? • Can we utilize linear-Gaussian policies trained using PILQR to obtain robust neural network policies using MDGPS? • Is our proposed algorithm capable of learning complex manipulation skills on a real robotic platform? 22 / 38
  • 23. Simulation Experiments Evaluate on 3 simulated robotic manipulation tasks 1 Pushing a block: Involves controlling a 4 DoF arm to push a white block to a red goal area. The cost function is a weighted combination of the distance from the gripper to the block and from the block to the goal. 2 Opening a door in 3D: Requires opening a door with a 6 DoF 3D arm. The cost function is a weighted combination of the distance of the end effector to the door handle and the angle of the door. 3 Reacher: Requires moving the end of a 2 DoF arm to a target position. The cost function is the distance from the end effector to the target. Figure: Simulated robotic manipulation tasks 23 / 38
  • 24. Pushing a block Comparison of PILQR to LQR-FLM and PI2 on the most difficult condition for the gripper pusher task. Figure: Average final distance Fast convergence and better performance 24 / 38
  • 25. Pushing a block PILQR solves all four conditions with 400 total episodes per condition, prove it is able to learn a diverse set of successful behaviors including flicking, guiding, and hitting the block 25 / 38
  • 26. Door opening task PILQR succeed at opening the door from each of the four initial robot positions, while LQR-FLM fails to open the door 26 / 38
  • 27. Neural network policies on the reacher task Results for MDGPS with each local policy method, as well as two prior deep RL methods TRPO and DDPG Figure: Final distance from the reacher end effector to the target averaged across 300 random test conditions per iteration Too simple task! But PILQR-MDGPS converge 25 and 150 times faster than DDPG and TRPO, while prior hybrid methods is about up to 5 times faster 27 / 38
  • 28. Neural network policies on the door opening Results for MDGPS with each local policy method, as well as two prior deep RL methods TRPO and DDPG Figure: Minimum angle in radians of the door hinge averaged across 100 random test conditions per iteration TRPO requires 20 times more samples. Others fail 28 / 38
  • 29. Task Description They conducted 2 kinds of robot experiments. Hockey Requires using a stick to hit a puck into a goal 1.4 m away. The cost function consists of two parts: the distance between the current position of the stick and a target pose that is close to the puck, and the distance between the position of the puck and the goal. Power plug plugging In this task, the robot must plug a power plug into an outlet. The cost function is the distance between the plug and a target location inside the outlet. 29 / 38
  • 30. Environment Figure: A PR2 robot is used to conduct 3 experiments, Left: a hockey task, Right: a power plug task In the hockey task, the puck and goal is tracked using a motion capture system. 30 / 38
  • 31. Hockey task1 Learn a policy that is able to hit the puck into the goal for a single position of the goal and the puck Figure: Single condition comparison of the hockey task 31 / 38
  • 32. Hockey task2 Learn a neural network policy using the MDGPS-PILQR algorithm that can hit the puck into different goal locations. Collected 30 rollouts for 3 different positions, and train neural net. Figure: Right: Experimental setup for hockey task, left: success rate for each position. 32 / 38
  • 33. Power plug plugging Requires very fine manipulation. While LQR-FLM succeeded only 60 % at convergence, their method was able to converge to a policy that plugged in the power plug on every rollout at convergence. 33 / 38
  • 34. 1 Introduction 2 Preliminary 3 Integrating Model-Based Updates into PI2 4 Experimental Evaluation 5 Future work 34 / 38
  • 35. Future work • Current algorithm requires to reset the environment into consistent initial state. Recent work proposes a clustering method for lifting this restriction by sampling trajectories from random initial states. Integrating this technique into our method would further improve its generality. • Their method requires a continuous action space. Extensions to discrete or hybrid action spaces would require some kind of continuous relaxation. 35 / 38
  • 36. References I Yevgen Chebotar, Mrinal Kalakrishnan, Ali Yahya, Adrian Li, Stefan Schaal, and Sergey Levine. Path integral guided policy search. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), April 2017. Shixiang Gu, Timothy Lillicrap, Ilya Sutskever, and Sergey Levine. Continuous deep q-learning with model-based acceleration. arXiv preprint arXiv:1603.00748, 2016. 36 / 38
  • 37. References II Nicolas Heess, Gregory Wayne, David Silver, Tim Lillicrap, Tom Erez, and Yuval Tassa. Learning continuous control policies by stochastic value gradients. In Advances in Neural Information Processing Systems, pages 2944–2952, 2015. Sergey Levine and Pieter Abbeel. Learning neural network policies with guided policy search under unknown dynamics. In Advances in Neural Information Processing Systems, pages 1071–1079, 2014. 37 / 38
  • 38. References III Yuval Tassa, Tom Erez, and Emanuel Todorov. Synthesis and stabilization of complex behaviors through online trajectory optimization. In IROS, pages 4906–4913. IEEE, 2012. Evangelos Theodorou, Jonas Buchli, and Stefan Schaal. A generalized path integral control approach to reinforcement learning. Journal of Machine Learning Research, 11(Nov):3137–3181, 2010. 38 / 38