論文紹介 Combining Model-Based and Model-Free Updates for Trajectory-Centric Reinforcement Learning

Combining Model-Based and
Model-Free Updates for
Trajectory-Centric Reinforcement
Learning (ICML2017)
Reader: Kusano Hitoshi (Kyoto University)
July 9, 2017
1 / 38

Index
1 Introduction
2 Preliminary
3 Integrating Model-Based Updates into PI2
4 Experimental Evaluation
5 Future work
2 / 38

About author
Yevgen Chebotar∗1
, Karol Hausman∗1
, Marvin Zhang∗2
,
Gaurav Sukhatme1
, Stefan Schaal1
, Sergey Levine2
∗
means Equal contribution
1
University of Southern California
2
University of California Berkeley
All of ∗
are Ph.D students.
3 / 38

1 Introduction
2 Preliminary
5 Future work
4 / 38

2 requirements for RL
• A goal of reinforcement learning(RL):
Learn motor skills through trial and error
• In RL, there are 2 requirements,
1 Data-efﬁciency, small sampling cost is better
2 Compatibility with unknown dynamical system
Figure: Example of environment used to evaluate their method
5 / 38

Trade-off between 2 types of RL and hybrid
There are 2 types of RL
• Model-Based: Dynamics are known or estimated
• Model-Free: we do not know the dynamics
Usually there is trade-off among them
Model-Based Model-free
Data-efficiency More efficient Less efficient
Compatibility Smaller Higher
In this research, they try to integrate and achieve both
data efficient and compatibility with unknown dynamics
6 / 38

Novelty of this research
• Prior works achieve modest gains in efficiency and
performance[3, 2]
Novelty of this research
• Proposal of integrated algorithm with the efficiency
and the generality in the context of specific policy
representation, TVLG
• Can solve complex continuous control tasks
that are infeasible for either of them
• Remaining orders of magnitude more efficient
than standard model-free
• Can extend to any policies via Guided Policy
Search [4]
7 / 38

1 Introduction
2 Preliminary
5 Future work
8 / 38

Optimize the expected cost of the policy
under TVLG controllers
• Goal of policy search: optimize θ, parameter of
p(ut|xt), where ut: actions, xt: state at time t
• Cost: J(θ) = Ep[c(τ)] =
∫
c(τ)p(τ)dτ,
where
• τ = (x1, u1, ..., xT .uT ): trajectory,
• c(τ) =
∑T
t=1 c(xt , ut ): trajectory cost,
• p(τ) = p(x1)
∏T
t=1 p(xt+1|xt , ut )p(ut |xt ): policy trajectory
distribution
• Assume time-varying linear gaussian, TVLG:
p(ut|xt) = N(Ktxt, Σt)
• Model-Based: LQR-FLM
• Model-Free: PI2
9 / 38

Model-based optimization of TVLG
controller, LQR-FLM
• LQR-FLM is a model-based RL algorithm based
on prior works [4, 5]
• Use samples to ﬁt a TVLG dynamics model
p(xt+1|xt, ut) = N(fx,txt + fu,tut, Ft) and assume a
twice-differentiable cost function.
• KL constraints is to deal with unknown dynamics[4]
• LQR-FLM is efﬁcient though highly dependent on
being able to model the system dynamics precisely
Optimize
min
p(i)
Ep(i)[Q(xt, ut)]s.t.Ep(i)[DKL (p(i)
∥p(i−1)
)] ≤ ϵt (1)
10 / 38

Model-free optimization via PI2
• PI2
is a model-free algorithm based on stochastic
optimal control [6]
• Let S(xi,t, ui,t) = c(xi,t, ui,t) +
∑T
j=t+1 c(xi,j, ui,j), the
cost-to-go of trajectory i ∈ 1, ..., N
• Probability P(xi,t, ui,t) =
exp(− 1
ηt
S(xi,t ,ui,t ))
∫
exp(− 1
ηt
S(xi,t ,ui,t ))dui,t
,
Intuition: the trajectories with lower costs receives
higher probabilities
• After computing new P, we update the policy
distribution by reweighing each sampled control ui,t
by P(xi,t, ui,t) and update policy parameters by a
MLE[1]
11 / 38

Model-free optimization via PI2
Theorem 1:
The PI2
update corresponds to
minp(i)Ep(i)[S(xt, ut)]s.t.Ep(i)[DKL (p(i)
∥p(i−1)
)] ≤ ϵ, (2)
where ϵ is the maximum KL-divergence between the
new policy p(i)
(ut|xt) and the old policy p(i−1)
(ut|xt)
PI2
was used to solve robotic tasks such as
door-opening, and achieves better ﬁnal performance
than LQR-FLM.
By minimizing (2), we can get below relationship
p(i)
(ut|xt) ∝ p(i−1)
(ut|xt)Ep(i−1)[exp(−
1
η
S(xt, ut))] (3)
12 / 38

1 Introduction
2 Preliminary
5 Future work
13 / 38

Overview
• Both PI2
and LQR-FLM can be used to learn TVLG
policies and both has strength and weaknesses
• Divide PI2
into 2 parts, a part using a model-based
cost approximation, another part using the residual
cost error after ﬁrst one
• Employ LQR-FLM as a model-based cost
approximation and integrate
• Demonstrate our method successfully achieves
strengths of PI2
and LQR-FLM while
compensating for their weaknesses
14 / 38

Two-Stage PI2
update
• Divide PI2
into 2 parts
• Let
• ˆc(xt , ut ): approx cost
• c(xt , ut ): real cost
• ˜c(xt , ut ): residual cost
• ˆSt : approx cost-to-go
• St : real cost-to-go
• ˜St : residual cost-to-go
• where, ˆSt = ˆSt (xt , ut ), St = St (xt , ut ), ˜St = ˜St (xt , ut )
˜c(xt , ut ) = c(xt , ut ) - ˆc(xt , ut ), ˜St = St - ˆSt
p(i)
(ut, xt) ∝ p(i−1)
1
η
(ˆSt + ˜St))]
∝ ˆp(ut|xt)Ep(i−1)[exp(−
1
η
˜St)]
(4)
where, ˆp(ut|xt) = p(i−1)
(ut|xt)Ep(i−1)[exp(−1
η
ˆSt)]
15 / 38

Two-Stage PI2
update: Summary
Whole Update Procedure
1 Update using approx costs ˆc(xt, ut) and samples
from the old policy p(i−1)
(ut|xt) to get ˆp(ut|xt)
2 Update p(i)
(ut, xt) using the residual costs ˜c(xt, ut)
and samples from ˆp(ut|xt)
ˆp(ut|xt) = p(i−1)
1
η
ˆSt)] (5)
p(i)
(ut, xt) ∝ ˆp(ut|xt)Ep(i−1)[exp(−
1
η
˜St)] (6)
16 / 38

Model-Based Substitution with LQR-FLM
We can use correspondence between Eq.(2) and
Eq.(3) to rewrite Eq.(5) as a optimization problem,
Rewrited Eq.(5)
min
ˆp
Eˆp[ˆS(xt, ut)] s.t. Ep(i−1)[DKL (ˆp∥p(i−1)
)] ≤ ϵ (7)
Thus, p(i)
(ut, xt) can be updated using any algorithm
that can solve this optimization problem
They employ LQR-FLM for this update
17 / 38

Optimizing Cost Residuals with PI2
Assume the same structure for both controllers
Controller for p(i)
(ut, xt)
ui,t = Ktxi,t + kt +
√
Σtξi,t (8)
where Kt, kt,
∑
t: parameters of p(i−1)
Controller for ˆp(ut, xt)
ˆui,t = ˆKtxi,t + ˆkt +
√
ˆΣtξi,t (9)
where ˆKt, ˆkt, ˆ∑
t: parameters of ˆp
ξi,t: corresponding noise to xi,t
18 / 38

Summary of PILQR algorithm
19 / 38

Training Parametric Policies with GPS
PILQR offers an approach to perform trajectory
optimization of only TVLG policies
We can extend PILQR to train parametric policy by
employing mirror descent guided policy search
Procedure
1 PILQR learn simple TVLG policies, local polices
p(ut, xt)
2 Optimized control are used to learn global policy πθ
of MDGPS in a supervised manner.
Hence, the global policy is generalized across multiple
global policies
20 / 38

1 Introduction
2 Preliminary
5 Future work
21 / 38

Aim of Experiment
To answer below question, conducted both simulated
comparison and real robot experiment
Key questions:
• How does our method compare to other
trajectory-centric and deep RL algorithms in terms
of ﬁnal performance and sample efﬁciency?
• Can we utilize linear-Gaussian policies trained
using PILQR to obtain robust neural network
policies using MDGPS?
• Is our proposed algorithm capable of learning
complex manipulation skills on a real robotic
platform?
22 / 38

Simulation Experiments
Evaluate on 3 simulated robotic manipulation tasks
1 Pushing a block: Involves controlling a 4 DoF arm to push a white block
to a red goal area. The cost function is a weighted combination of the
distance from the gripper to the block and from the block to the goal.
2 Opening a door in 3D: Requires opening a door with a 6 DoF 3D arm.
The cost function is a weighted combination of the distance of the end
effector to the door handle and the angle of the door.
3 Reacher: Requires moving the end of a 2 DoF arm to a target position.
The cost function is the distance from the end effector to the target.
Figure: Simulated robotic manipulation tasks
23 / 38

Pushing a block
Comparison of PILQR to LQR-FLM and PI2
on the
most difﬁcult condition for the gripper pusher task.
Figure: Average ﬁnal distance
Fast convergence and better performance
24 / 38

Pushing a block
PILQR solves all four conditions with 400 total episodes
per condition, prove it is able to learn a diverse set of
successful behaviors including ﬂicking, guiding, and
hitting the block
25 / 38

Door opening task
PILQR succeed at opening the door from each of the
four initial robot positions, while LQR-FLM fails to open
the door
26 / 38

Neural network policies on the reacher task
Results for MDGPS with each local policy method, as
well as two prior deep RL methods TRPO and DDPG
Figure: Final distance from the reacher end effector to the target
averaged across 300 random test conditions per iteration
Too simple task! But PILQR-MDGPS converge 25 and
150 times faster than DDPG and TRPO, while prior
hybrid methods is about up to 5 times faster 27 / 38

Neural network policies on the door opening
Results for MDGPS with each local policy method, as
well as two prior deep RL methods TRPO and DDPG
Figure: Minimum angle in radians of the door hinge averaged
across 100 random test conditions per iteration
TRPO requires 20 times more samples. Others fail
28 / 38

Task Description
They conducted 2 kinds of robot experiments.
Hockey
Requires using a stick to hit a puck into a goal 1.4 m
away. The cost function consists of two parts: the
distance between the current position of the stick and a
target pose that is close to the puck, and the distance
between the position of the puck and the goal.
Power plug plugging
In this task, the robot must plug a power plug into an
outlet. The cost function is the distance between the
plug and a target location inside the outlet.
29 / 38

Environment
Figure: A PR2 robot is used to conduct 3 experiments, Left: a
hockey task, Right: a power plug task
In the hockey task, the puck and goal is tracked using a
motion capture system.
30 / 38

Hockey task1
Learn a policy that is able to hit the puck into the goal
for a single position of the goal and the puck
Figure: Single condition comparison of the hockey task
31 / 38

Hockey task2
Learn a neural network policy using the
MDGPS-PILQR algorithm that can hit the puck into
different goal locations. Collected 30 rollouts for 3
different positions, and train neural net.
Figure: Right: Experimental setup for hockey task, left: success
rate for each position.
32 / 38

Power plug plugging
Requires very ﬁne manipulation. While LQR-FLM
succeeded only 60 % at convergence, their method
was able to converge to a policy that plugged in the
power plug on every rollout at convergence.
33 / 38

1 Introduction
2 Preliminary
5 Future work
34 / 38

Future work
• Current algorithm requires to reset the
environment into consistent initial state. Recent
work proposes a clustering method for lifting this
restriction by sampling trajectories from random
initial states. Integrating this technique into our
method would further improve its generality.
• Their method requires a continuous action space.
Extensions to discrete or hybrid action spaces
would require some kind of continuous relaxation.
35 / 38

References I
Yevgen Chebotar, Mrinal Kalakrishnan, Ali Yahya, Adrian Li,
Stefan Schaal, and Sergey Levine.
Path integral guided policy search.
In Proceedings of the IEEE International Conference on
Robotics and Automation (ICRA), April 2017.
Shixiang Gu, Timothy Lillicrap, Ilya Sutskever, and Sergey
Levine.
Continuous deep q-learning with model-based acceleration.
arXiv preprint arXiv:1603.00748, 2016.
36 / 38

References II
Nicolas Heess, Gregory Wayne, David Silver, Tim Lillicrap,
Tom Erez, and Yuval Tassa.
Learning continuous control policies by stochastic value
gradients.
In Advances in Neural Information Processing Systems, pages
2944–2952, 2015.
Sergey Levine and Pieter Abbeel.
Learning neural network policies with guided policy search
under unknown dynamics.
In Advances in Neural Information Processing Systems, pages
1071–1079, 2014.
37 / 38

References III
Yuval Tassa, Tom Erez, and Emanuel Todorov.
Synthesis and stabilization of complex behaviors through
online trajectory optimization.
In IROS, pages 4906–4913. IEEE, 2012.
Evangelos Theodorou, Jonas Buchli, and Stefan Schaal.
A generalized path integral control approach to reinforcement
learning.
Journal of Machine Learning Research, 11(Nov):3137–3181,
2010.
38 / 38

論文紹介 Combining Model-Based and Model-Free Updates for Trajectory-Centric Reinforcement Learning

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to 論文紹介 Combining Model-Based and Model-Free Updates for Trajectory-Centric Reinforcement Learning (20)

Recently uploaded (20)

論文紹介 Combining Model-Based and Model-Free Updates for Trajectory-Centric Reinforcement Learning