Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI

Deep Reinforcement Learning Through
Policy Optimization
John Schulman
October 20, 2016

What is Reinforcement Learning?
Branch of machine learning concerned with taking
sequences of actions
Usually described in terms of agent interacting with a
previously unknown environment, trying to maximize
cumulative reward
Agent Environment
action
observation, reward

What Is Deep Reinforcement Learning?
Reinforcement learning using neural networks to approximate
functions
Policies (select next action)
Value functions (measure goodness of states or
state-action pairs)
Models (predict next states and rewards)

Motor Control and Robotics
Robotics:
Observations: camera images, joint angles
Actions: joint torques
Rewards: stay balanced, navigate to target locations,
serve and protect humans

Business Operations
Inventory Management
Observations: current inventory levels
Actions: number of units of each item to purchase
Rewards: proﬁt

In Other ML Problems
Hard Attention1
Observation: current image window
Action: where to look
Reward: classiﬁcation
Sequential/structured prediction, e.g., machine
translation2
Observations: words in source language
Actions: emit word in target language
Rewards: sentence-level metric, e.g. BLEU score
1
V. Mnih et al. “Recurrent models of visual attention”. In: Advances in Neural Information Processing
Systems. 2014, pp. 2204–2212.
2
H. Daum´e Iii, J. Langford, and D. Marcu. “Search-based structured prediction”. In: Machine learning 75.3
(2009), pp. 297–325; S. Ross, G. J. Gordon, and D. Bagnell. “A Reduction of Imitation Learning and Structured
Prediction to No-Regret Online Learning.” In: AISTATS. vol. 1. 2. 2011, p. 6; M. Ranzato et al. “Sequence level
training with recurrent neural networks”. In: arXiv preprint arXiv:1511.06732 (2015).

How Does RL Relate to Other ML Problems?

Supervised learning:
Environment samples input-output pair (xt, yt) ∼ ρ
Agent predicts ˆyt = f (xt)
Agent receives loss (yt, ˆyt)
Environment asks agent a question, and then tells her the
right answer

Contextual bandits:
Environment samples input xt ∼ ρ
Agent takes action ˆyt = f (xt)
Agent receives cost ct ∼ P(ct | xt, ˆyt) where P is an
unknown probability distribution
Environment asks agent a question, and gives her a noisy
score on her answer
Application: personalized recommendations

Reinforcement learning:
Environment samples input xt ∼ P(xt | xt−1, yt−1)
Input depends on your previous actions!
Agent takes action ˆyt = f (xt)
Agent receives cost ct ∼ P(ct | xt, ˆyt) where P a
probability distribution unknown to the agent.

How Does RL Relate to Other Machine Learning
Problems?
Summary of diﬀerences between RL and supervised learning:
You don’t have full access to the function you’re trying to
optimize—must query it through interaction.
Interacting with a stateful world: input xt depend on your
previous actions

Should I Use Deep RL On My Practical Problem?
Might be overkill
Other methods worth investigating ﬁrst
Derivative-free optimization (simulated annealing, cross
entropy method, SPSA)
Is it a contextual bandit problem?
Non-deep RL methods developed by Operations
Research community3
3
W. B. Powell. Approximate Dynamic Programming: Solving the curses of dimensionality. Vol. 703. John
Wiley & Sons, 2007.

Recent Success Stories in Deep RL
ATARI using deep Q-learning4
, policy gradients5
,
DAGGER6
Superhuman Go using supervised learning + policy
gradients + Monte Carlo tree search + value functions7
Robotic manipulation using guided policy search8
Robotic locomotion using policy gradients9
3D games using policy gradients10
4
V. Mnih et al. “Playing Atari with Deep Reinforcement Learning”. In: arXiv preprint arXiv:1312.5602 (2013).
5
J. Schulman et al. “Trust Region Policy Optimization”. In: arXiv preprint arXiv:1502.05477 (2015).
6
X. Guo et al. “Deep learning for real-time Atari game play using oﬄine Monte-Carlo tree search planning”. In:
Advances in Neural Information Processing Systems. 2014, pp. 3338–3346.
7
D. Silver et al. “Mastering the game of Go with deep neural networks and tree search”. In: Nature 529.7587
(2016), pp. 484–489.
8
S. Levine et al. “End-to-end training of deep visuomotor policies”. In: arXiv preprint arXiv:1504.00702 (2015).
9
J. Schulman et al. “High-dimensional continuous control using generalized advantage estimation”. In: arXiv
preprint arXiv:1506.02438 (2015).
10
V. Mnih et al. “Asynchronous Methods for Deep Reinforcement Learning”. In: arXiv preprint
arXiv:1602.01783 (2016).

Definition
Markov Decision Process (MDP) defined by (S, A, P),
where
S: state space
A: action space
P(r, s | s, a): transition + reward probability distribution
Extra objects defined depending on problem setting
µ: Initial state distribution
Optimization problem: maximize expected cumulative
reward

Episodic Setting
In each episode, the initial state is sampled from µ, and
the agent acts until the terminal state is reached. For
example:
Taxi robot reaches its destination (termination = good)
Waiter robot ﬁnishes a shift (ﬁxed time)
Walking robot falls over (termination = bad)
Goal: maximize expected reward per episode

Policies
Deterministic policies: a = π(s)
Stochastic policies: a ∼ π(a | s)

Episodic Setting
μ0
a0
s0 s1
a1 aT-1
sT
π
P
Agent
r0 r1 rT-1
Environment
s2
Objective:
maximize η(π), where
η(π) = E[r0 + r1 + · · · + rT−1 | π]

Parameterized Policies
A family of policies indexed by parameter vector θ ∈ Rd
Deterministic: a = π(s, θ)
Stochastic: π(a | s, θ)
Analogous to classiﬁcation or regression with input s,
output a.
Discrete action space: network outputs vector of
probabilities
Continuous action space: network outputs mean and
diagonal covariance of Gaussian

Black-Box Optimization Methods

Derivative Free Optimization Approach
Objective:
maximize E[R | π(·, θ)]
View θ → → R as a black box
Ignore all other information other than R collected during
episode

Cross-Entropy Method
Evolutionary algorithm
Works embarrassingly well
I. Szita and A. L¨orincz. “Learning Tetris using
the noisy cross-entropy method”. In: Neural
computation 18.12 (2006), pp. 2936–2941
V. Gabillon, M. Ghavamzadeh, and
B. Scherrer. “Approximate Dynamic
Programming Finally Performs Well in the
Game of Tetris”. In: Advances in Neural
Information Processing Systems. 2013

Evolutionary algorithm
Works embarrassingly well
A similar algorithm, Covariance Matrix Adaptation, has
become standard in graphics:

Initialize µ ∈ Rd
, σ ∈ Rd
for iteration = 1, 2, . . . do
Collect n samples of θi ∼ N(µ, diag(σ))
Perform a noisy evaluation Ri ∼ θi
Select the top p% of samples (e.g. p = 20), which we’ll
call the elite set
Fit a Gaussian distribution, with diagonal covariance,
to the elite set, obtaining a new µ, σ.
end for
Return the ﬁnal µ.

Analysis: a very similar algorithm is an
minorization-maximization (MM) algorithm, guaranteed
to monotonically increase expected reward
Recall that Monte-Carlo EM algorithm collects samples,
reweights them, and them maximizes their logprob
We can derive MM algorithm where each iteration you
maximize i log p(θi )Ri

Policy Gradient Methods: Overview
Problem:
maximize E[R | πθ]
Intuitions: collect a bunch of trajectories, and ...
1. Make the good trajectories more probable
2. Make the good actions more probable
3. Push the actions towards good actions (DPG11
, SVG12
)
11
D. Silver et al. “Deterministic policy gradient algorithms”. In: ICML. 2014.
12
N. Heess et al. “Learning continuous control policies by stochastic value gradients”. In: Advances in Neural
Information Processing Systems. 2015, pp. 2926–2934.

Score Function Gradient Estimator: Intuition
ˆgi = f (xi ) θ log p(xi | θ)
Let’s say that f (x) measures how good the sample x is.
Moving in the direction ˆgi pushes up the logprob of the
sample, in proportion to how good it is
Valid even if f (x) is discontinuous, and unknown, or
sample space (containing x) is a discrete set

Score Function Gradient Estimator: Intuition
ˆgi = f (xi ) θ log p(xi | θ)

Score Function Gradient Estimator for Policies
Now random variable x is a whole trajectory
τ = (s0, a0, r0, s1, a1, r1, . . . , sT−1, aT−1, rT−1, sT )
θEτ [R(τ)] = Eτ [ θ log p(τ | θ)R(τ)]
Just need to write out p(τ | θ):
p(τ | θ) = µ(s0)
T−1
t=0
[π(at | st, θ)P(st+1, rt | st, at)]
log p(τ | θ) = log µ(s0) +
T−1
t=0
[log π(at | st, θ) + log P(st+1, rt | st, at)]
θ log p(τ | θ) = θ
T−1
t=0
log π(at | st, θ)
θEτ [R] = Eτ R θ
T−1
t=0
log π(at | st, θ)
Interpretation: using good trajectories (high R) as supervised
examples in classiﬁcation / regression

Policy Gradient: Use Temporal Structure
Previous slide:
θEτ [R] = Eτ
T−1
t=0
rt
T−1
t=0
θ log π(at | st, θ)
We can repeat the same argument to derive the gradient
estimator for a single reward term rt .
θE [rt ] = E rt
t
t=0
Sum this formula over t, we obtain
θE [R] = E
T−1
t=0
rt
t
t=0
= E
T−1
t=0
T−1
t =t
rt

Policy Gradient: Introduce Baseline
Further reduce variance by introducing a baseline b(s)
θEτ [R] = Eτ
T−1
t=0
T−1
t =t
rt − b(st)
For any choice of b, gradient estimator is unbiased.
Near optimal choice is expected return,
b(st) ≈ E [rt + rt+1 + rt+2 + · · · + rT−1]
Interpretation: increase logprob of action at proportionally
to how much returns T−1
t=t rt are better than expected

Discounts for Variance Reduction
Introduce discount factor γ, which ignores delayed effects
between actions and rewards
θEτ [R] ≈ Eτ
T−1
t=0
T−1
t =t
γt −t
rt − b(st)
Now, we want
b(st) ≈ E rt + γrt+1 + γ2
rt+2 + · · · + γT−1−t
rT−1
Write gradient estimator more generally as
θEτ [R] ≈ Eτ
T−1
t=0
θ log π(at | st, θ) Ât
Ât is the advantage estimate

“Vanilla” Policy Gradient Algorithm
Initialize policy parameter θ, baseline b
for iteration=1, 2, . . . do
Collect a set of trajectories by executing the current
policy
At each timestep in each trajectory, compute
the return Rt = T−1
t =t γt −t
rt , and
the advantage estimate Ât = Rt − b(st).
Re-fit the baseline, by minimizing b(st) − Rt
2
,
summed over all trajectories and timesteps.
Update the policy, using a policy gradient estimate ˆg,
which is a sum of terms θ log π(at | st, θ) Ât
end for

Extension: Step Sizes and Trust Regions
Why are step sizes a big deal in RL?
Supervised learning
Step too far → next update will ﬁx it
Reinforcement learning
Step too far → bad policy
Next batch: collected under bad policy
Can’t recover, collapse in performance!

Extension: Step Sizes and Trust Regions
Trust Region Policy Optimization: limit KL divergence
between action distribution of pre-update and
post-update policy13
Es DKL
(πold(· | s) π(· | s)) ≤ δ
Closely elated to previous natural policy gradient
methods14
13
J. Schulman et al. “Trust Region Policy Optimization”. In: arXiv preprint arXiv:1502.05477 (2015).
14
S. Kakade. “A Natural Policy Gradient.” In: NIPS. vol. 14. 2001, pp. 1531–1538; J. A. Bagnell and
J. Schneider. “Covariant policy search”. In: IJCAI. 2003; J. Peters and S. Schaal. “Natural actor-critic”. In:
Neurocomputing 71.7 (2008), pp. 1180–1190.

Extension: Further Variance Reduction
Use value functions for more variance reduction (at the
cost of bias): actor-critic methods15
Reparameterization trick: instead of increasing the
probability of the good actions, push the actions towards
(hopefully) better actions16
15
J. Schulman et al. “High-dimensional continuous control using generalized advantage estimation”. In: arXiv
preprint arXiv:1506.02438 (2015); V. Mnih et al. “Asynchronous Methods for Deep Reinforcement Learning”. In:
arXiv preprint arXiv:1602.01783 (2016).
16
D. Silver et al. “Deterministic policy gradient algorithms”. In: ICML. 2014; N. Heess et al. “Learning
continuous control policies by stochastic value gradients”. In: Advances in Neural Information Processing
Systems. 2015, pp. 2926–2934.

Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI

More Related Content

Viewers also liked (18)

Similar to Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI (20)

Recently uploaded (20)

Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI