SlideShare a Scribd company logo
Deep Reinforcement Learning Through
Policy Optimization
John Schulman
October 20, 2016
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Introduction and Overview
What is Reinforcement Learning?
Branch of machine learning concerned with taking
sequences of actions
Usually described in terms of agent interacting with a
previously unknown environment, trying to maximize
cumulative reward
Agent Environment
action
observation, reward
What Is Deep Reinforcement Learning?
Reinforcement learning using neural networks to approximate
functions
Policies (select next action)
Value functions (measure goodness of states or
state-action pairs)
Models (predict next states and rewards)
Motor Control and Robotics
Robotics:
Observations: camera images, joint angles
Actions: joint torques
Rewards: stay balanced, navigate to target locations,
serve and protect humans
Business Operations
Inventory Management
Observations: current inventory levels
Actions: number of units of each item to purchase
Rewards: profit
In Other ML Problems
Hard Attention1
Observation: current image window
Action: where to look
Reward: classification
Sequential/structured prediction, e.g., machine
translation2
Observations: words in source language
Actions: emit word in target language
Rewards: sentence-level metric, e.g. BLEU score
1
V. Mnih et al. “Recurrent models of visual attention”. In: Advances in Neural Information Processing
Systems. 2014, pp. 2204–2212.
2
H. Daum´e Iii, J. Langford, and D. Marcu. “Search-based structured prediction”. In: Machine learning 75.3
(2009), pp. 297–325; S. Ross, G. J. Gordon, and D. Bagnell. “A Reduction of Imitation Learning and Structured
Prediction to No-Regret Online Learning.” In: AISTATS. vol. 1. 2. 2011, p. 6; M. Ranzato et al. “Sequence level
training with recurrent neural networks”. In: arXiv preprint arXiv:1511.06732 (2015).
How Does RL Relate to Other ML Problems?
How Does RL Relate to Other ML Problems?
Supervised learning:
Environment samples input-output pair (xt, yt) ∼ ρ
Agent predicts ˆyt = f (xt)
Agent receives loss (yt, ˆyt)
Environment asks agent a question, and then tells her the
right answer
How Does RL Relate to Other ML Problems?
Contextual bandits:
Environment samples input xt ∼ ρ
Agent takes action ˆyt = f (xt)
Agent receives cost ct ∼ P(ct | xt, ˆyt) where P is an
unknown probability distribution
Environment asks agent a question, and gives her a noisy
score on her answer
Application: personalized recommendations
How Does RL Relate to Other ML Problems?
Reinforcement learning:
Environment samples input xt ∼ P(xt | xt−1, yt−1)
Input depends on your previous actions!
Agent takes action ˆyt = f (xt)
Agent receives cost ct ∼ P(ct | xt, ˆyt) where P a
probability distribution unknown to the agent.
How Does RL Relate to Other Machine Learning
Problems?
Summary of differences between RL and supervised learning:
You don’t have full access to the function you’re trying to
optimize—must query it through interaction.
Interacting with a stateful world: input xt depend on your
previous actions
Should I Use Deep RL On My Practical Problem?
Might be overkill
Other methods worth investigating first
Derivative-free optimization (simulated annealing, cross
entropy method, SPSA)
Is it a contextual bandit problem?
Non-deep RL methods developed by Operations
Research community3
3
W. B. Powell. Approximate Dynamic Programming: Solving the curses of dimensionality. Vol. 703. John
Wiley & Sons, 2007.
Recent Success Stories in Deep RL
ATARI using deep Q-learning4
, policy gradients5
,
DAGGER6
Superhuman Go using supervised learning + policy
gradients + Monte Carlo tree search + value functions7
Robotic manipulation using guided policy search8
Robotic locomotion using policy gradients9
3D games using policy gradients10
4
V. Mnih et al. “Playing Atari with Deep Reinforcement Learning”. In: arXiv preprint arXiv:1312.5602 (2013).
5
J. Schulman et al. “Trust Region Policy Optimization”. In: arXiv preprint arXiv:1502.05477 (2015).
6
X. Guo et al. “Deep learning for real-time Atari game play using offline Monte-Carlo tree search planning”. In:
Advances in Neural Information Processing Systems. 2014, pp. 3338–3346.
7
D. Silver et al. “Mastering the game of Go with deep neural networks and tree search”. In: Nature 529.7587
(2016), pp. 484–489.
8
S. Levine et al. “End-to-end training of deep visuomotor policies”. In: arXiv preprint arXiv:1504.00702 (2015).
9
J. Schulman et al. “High-dimensional continuous control using generalized advantage estimation”. In: arXiv
preprint arXiv:1506.02438 (2015).
10
V. Mnih et al. “Asynchronous Methods for Deep Reinforcement Learning”. In: arXiv preprint
arXiv:1602.01783 (2016).
Markov Decision Processes
Definition
Markov Decision Process (MDP) defined by (S, A, P),
where
S: state space
A: action space
P(r, s | s, a): transition + reward probability distribution
Extra objects defined depending on problem setting
µ: Initial state distribution
Optimization problem: maximize expected cumulative
reward
Episodic Setting
In each episode, the initial state is sampled from µ, and
the agent acts until the terminal state is reached. For
example:
Taxi robot reaches its destination (termination = good)
Waiter robot finishes a shift (fixed time)
Walking robot falls over (termination = bad)
Goal: maximize expected reward per episode
Policies
Deterministic policies: a = π(s)
Stochastic policies: a ∼ π(a | s)
Episodic Setting
s0 ∼ µ(s0)
a0 ∼ π(a0 | s0)
s1, r0 ∼ P(s1, r0 | s0, a0)
a1 ∼ π(a1 | s1)
s2, r1 ∼ P(s2, r1 | s1, a1)
. . .
aT−1 ∼ π(aT−1 | sT−1)
sT , rT−1 ∼ P(sT | sT−1, aT−1)
Objective:
maximize η(π), where
η(π) = E[r0 + r1 + · · · + rT−1 | π]
Episodic Setting
μ0
a0
s0 s1
a1 aT-1
sT
π
P
Agent
r0 r1 rT-1
Environment
s2
Objective:
maximize η(π), where
η(π) = E[r0 + r1 + · · · + rT−1 | π]
Parameterized Policies
A family of policies indexed by parameter vector θ ∈ Rd
Deterministic: a = π(s, θ)
Stochastic: π(a | s, θ)
Analogous to classification or regression with input s,
output a.
Discrete action space: network outputs vector of
probabilities
Continuous action space: network outputs mean and
diagonal covariance of Gaussian
Black-Box Optimization Methods
Derivative Free Optimization Approach
Objective:
maximize E[R | π(·, θ)]
View θ → → R as a black box
Ignore all other information other than R collected during
episode
Cross-Entropy Method
Evolutionary algorithm
Works embarrassingly well
I. Szita and A. L¨orincz. “Learning Tetris using
the noisy cross-entropy method”. In: Neural
computation 18.12 (2006), pp. 2936–2941
V. Gabillon, M. Ghavamzadeh, and
B. Scherrer. “Approximate Dynamic
Programming Finally Performs Well in the
Game of Tetris”. In: Advances in Neural
Information Processing Systems. 2013
Cross-Entropy Method
Evolutionary algorithm
Works embarrassingly well
A similar algorithm, Covariance Matrix Adaptation, has
become standard in graphics:
Cross-Entropy Method
Initialize µ ∈ Rd
, σ ∈ Rd
for iteration = 1, 2, . . . do
Collect n samples of θi ∼ N(µ, diag(σ))
Perform a noisy evaluation Ri ∼ θi
Select the top p% of samples (e.g. p = 20), which we’ll
call the elite set
Fit a Gaussian distribution, with diagonal covariance,
to the elite set, obtaining a new µ, σ.
end for
Return the final µ.
Cross-Entropy Method
Analysis: a very similar algorithm is an
minorization-maximization (MM) algorithm, guaranteed
to monotonically increase expected reward
Recall that Monte-Carlo EM algorithm collects samples,
reweights them, and them maximizes their logprob
We can derive MM algorithm where each iteration you
maximize i log p(θi )Ri
Policy Gradient Methods
Policy Gradient Methods: Overview
Problem:
maximize E[R | πθ]
Intuitions: collect a bunch of trajectories, and ...
1. Make the good trajectories more probable
2. Make the good actions more probable
3. Push the actions towards good actions (DPG11
, SVG12
)
11
D. Silver et al. “Deterministic policy gradient algorithms”. In: ICML. 2014.
12
N. Heess et al. “Learning continuous control policies by stochastic value gradients”. In: Advances in Neural
Information Processing Systems. 2015, pp. 2926–2934.
Score Function Gradient Estimator
Consider an expectation Ex∼p(x | θ)[f (x)]. Want to compute
gradient wrt θ
θEx [f (x)] = θ dx p(x | θ)f (x)
= dx θp(x | θ)f (x)
= dx p(x | θ)
θp(x | θ)
p(x | θ)
f (x)
= dx p(x | θ) θ log p(x | θ)f (x)
= Ex [f (x) θ log p(x | θ)].
Last expression gives us an unbiased gradient estimator. Just
sample xi ∼ p(x | θ), and compute ˆgi = f (xi ) θ log p(xi | θ).
Need to be able to compute and differentiate density p(x | θ)
wrt θ
Score Function Gradient Estimator: Intuition
ˆgi = f (xi ) θ log p(xi | θ)
Let’s say that f (x) measures how good the sample x is.
Moving in the direction ˆgi pushes up the logprob of the
sample, in proportion to how good it is
Valid even if f (x) is discontinuous, and unknown, or
sample space (containing x) is a discrete set
Score Function Gradient Estimator: Intuition
ˆgi = f (xi ) θ log p(xi | θ)
Score Function Gradient Estimator for Policies
Now random variable x is a whole trajectory
τ = (s0, a0, r0, s1, a1, r1, . . . , sT−1, aT−1, rT−1, sT )
θEτ [R(τ)] = Eτ [ θ log p(τ | θ)R(τ)]
Just need to write out p(τ | θ):
p(τ | θ) = µ(s0)
T−1
t=0
[π(at | st, θ)P(st+1, rt | st, at)]
log p(τ | θ) = log µ(s0) +
T−1
t=0
[log π(at | st, θ) + log P(st+1, rt | st, at)]
θ log p(τ | θ) = θ
T−1
t=0
log π(at | st, θ)
θEτ [R] = Eτ R θ
T−1
t=0
log π(at | st, θ)
Interpretation: using good trajectories (high R) as supervised
examples in classification / regression
Policy Gradient: Use Temporal Structure
Previous slide:
θEτ [R] = Eτ
T−1
t=0
rt
T−1
t=0
θ log π(at | st, θ)
We can repeat the same argument to derive the gradient
estimator for a single reward term rt .
θE [rt ] = E rt
t
t=0
θ log π(at | st, θ)
Sum this formula over t, we obtain
θE [R] = E
T−1
t=0
rt
t
t=0
θ log π(at | st, θ)
= E
T−1
t=0
θ log π(at | st, θ)
T−1
t =t
rt
Policy Gradient: Introduce Baseline
Further reduce variance by introducing a baseline b(s)
θEτ [R] = Eτ
T−1
t=0
θ log π(at | st, θ)
T−1
t =t
rt − b(st)
For any choice of b, gradient estimator is unbiased.
Near optimal choice is expected return,
b(st) ≈ E [rt + rt+1 + rt+2 + · · · + rT−1]
Interpretation: increase logprob of action at proportionally
to how much returns T−1
t=t rt are better than expected
Discounts for Variance Reduction
Introduce discount factor γ, which ignores delayed effects
between actions and rewards
θEτ [R] ≈ Eτ
T−1
t=0
θ log π(at | st, θ)
T−1
t =t
γt −t
rt − b(st)
Now, we want
b(st) ≈ E rt + γrt+1 + γ2
rt+2 + · · · + γT−1−t
rT−1
Write gradient estimator more generally as
θEτ [R] ≈ Eτ
T−1
t=0
θ log π(at | st, θ) ˆAt
ˆAt is the advantage estimate
“Vanilla” Policy Gradient Algorithm
Initialize policy parameter θ, baseline b
for iteration=1, 2, . . . do
Collect a set of trajectories by executing the current
policy
At each timestep in each trajectory, compute
the return Rt = T−1
t =t γt −t
rt , and
the advantage estimate ˆAt = Rt − b(st).
Re-fit the baseline, by minimizing b(st) − Rt
2
,
summed over all trajectories and timesteps.
Update the policy, using a policy gradient estimate ˆg,
which is a sum of terms θ log π(at | st, θ) ˆAt
end for
Extension: Step Sizes and Trust Regions
Why are step sizes a big deal in RL?
Supervised learning
Step too far → next update will fix it
Reinforcement learning
Step too far → bad policy
Next batch: collected under bad policy
Can’t recover, collapse in performance!
Extension: Step Sizes and Trust Regions
Trust Region Policy Optimization: limit KL divergence
between action distribution of pre-update and
post-update policy13
Es DKL
(πold(· | s) π(· | s)) ≤ δ
Closely elated to previous natural policy gradient
methods14
13
J. Schulman et al. “Trust Region Policy Optimization”. In: arXiv preprint arXiv:1502.05477 (2015).
14
S. Kakade. “A Natural Policy Gradient.” In: NIPS. vol. 14. 2001, pp. 1531–1538; J. A. Bagnell and
J. Schneider. “Covariant policy search”. In: IJCAI. 2003; J. Peters and S. Schaal. “Natural actor-critic”. In:
Neurocomputing 71.7 (2008), pp. 1180–1190.
Extension: Further Variance Reduction
Use value functions for more variance reduction (at the
cost of bias): actor-critic methods15
Reparameterization trick: instead of increasing the
probability of the good actions, push the actions towards
(hopefully) better actions16
15
J. Schulman et al. “High-dimensional continuous control using generalized advantage estimation”. In: arXiv
preprint arXiv:1506.02438 (2015); V. Mnih et al. “Asynchronous Methods for Deep Reinforcement Learning”. In:
arXiv preprint arXiv:1602.01783 (2016).
16
D. Silver et al. “Deterministic policy gradient algorithms”. In: ICML. 2014; N. Heess et al. “Learning
continuous control policies by stochastic value gradients”. In: Advances in Neural Information Processing
Systems. 2015, pp. 2926–2934.
Demo
Fin
Thank you. Questions?

More Related Content

PDF
Kevin Knight, Elaine Rich, B. Nair - Artificial Intelligence (2010, Tata McGr...
PPTX
Local search algorithms
PPTX
Reinforcement Learning : A Beginners Tutorial
PDF
Gibbs cloner を用いた組み合わせ最適化と cross-entropy を用いた期待値推計: 道路ネットワーク強靭化のための耐震化戦略を例として
PPTX
An introduction to reinforcement learning (rl)
PDF
스웨덴개황%282009.7%29
PDF
Simulation of rare events and optimisation with the cross-entropy method
Kevin Knight, Elaine Rich, B. Nair - Artificial Intelligence (2010, Tata McGr...
Local search algorithms
Reinforcement Learning : A Beginners Tutorial
Gibbs cloner を用いた組み合わせ最適化と cross-entropy を用いた期待値推計: 道路ネットワーク強靭化のための耐震化戦略を例として
An introduction to reinforcement learning (rl)
스웨덴개황%282009.7%29
Simulation of rare events and optimisation with the cross-entropy method

Viewers also liked (18)

PDF
「これからの強化学習」勉強会#2
PDF
High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 201...
PDF
「これからの強化学習」勉強会#1
PDF
Deep Q-Learning
PDF
Hierarchical Object Detection with Deep Reinforcement Learning
PPTX
Reinforcement Learning
PDF
Continuous control with deep reinforcement learning (DDPG)
PPT
Reinforcement learning
PDF
正則化つき線形モデル(「入門機械学習第6章」より)
PDF
NIPS 2016 Overview and Deep Learning Topics
PDF
PFN Spring Internship Final Report: Autonomous Drive by Deep RL
PDF
強化学習その3
PDF
ロジスティック回帰の考え方・使い方 - TokyoR #33
PDF
最近のDQN
PPT
Build Your Own 3D Scanner: 3D Scanning with Swept-Planes
PDF
Deep Learning and Reinforcement Learning
PPT
Presentation on driverless cars by shahin hussan
PDF
The Top Skills That Can Get You Hired in 2017
「これからの強化学習」勉強会#2
High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 201...
「これからの強化学習」勉強会#1
Deep Q-Learning
Hierarchical Object Detection with Deep Reinforcement Learning
Reinforcement Learning
Continuous control with deep reinforcement learning (DDPG)
Reinforcement learning
正則化つき線形モデル(「入門機械学習第6章」より)
NIPS 2016 Overview and Deep Learning Topics
PFN Spring Internship Final Report: Autonomous Drive by Deep RL
強化学習その3
ロジスティック回帰の考え方・使い方 - TokyoR #33
最近のDQN
Build Your Own 3D Scanner: 3D Scanning with Swept-Planes
Deep Learning and Reinforcement Learning
Presentation on driverless cars by shahin hussan
The Top Skills That Can Get You Hired in 2017
Ad

Similar to Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI (20)

PDF
Learning to discover monte carlo algorithm on spin ice manifold
PDF
Hierarchical Reinforcement Learning with Option-Critic Architecture
PDF
PMED Undergraduate Workshop - Introduction to Reinforcement Learning - Lili W...
PDF
ddpg seminar
PDF
Playing Atari with Deep Reinforcement Learning
PDF
MS CS - Selecting Machine Learning Algorithm
PPT
nnml.ppt
PDF
PDF
Counterfactual Learning for Recommendation
PDF
Introduction to Reinforcement Learning for Molecular Design
PPTX
Practical Reinforcement Learning with TensorFlow
PDF
Optimizing a New Nonlinear Reinforcement Scheme with Breeder genetic algorithm
PDF
safe and efficient off policy reinforcement learning
PPT
Machine Learning and Artificial Neural Networks.ppt
PDF
Uncertainty Awareness in Integrating Machine Learning and Game Theory
PDF
block-mdp-masters-defense.pdf
PPT
reiniforcement learning.ppt
PPT
Reinforcement learning presentation1.ppt
PDF
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
PDF
Chap 8. Optimization for training deep models
Learning to discover monte carlo algorithm on spin ice manifold
Hierarchical Reinforcement Learning with Option-Critic Architecture
PMED Undergraduate Workshop - Introduction to Reinforcement Learning - Lili W...
ddpg seminar
Playing Atari with Deep Reinforcement Learning
MS CS - Selecting Machine Learning Algorithm
nnml.ppt
Counterfactual Learning for Recommendation
Introduction to Reinforcement Learning for Molecular Design
Practical Reinforcement Learning with TensorFlow
Optimizing a New Nonlinear Reinforcement Scheme with Breeder genetic algorithm
safe and efficient off policy reinforcement learning
Machine Learning and Artificial Neural Networks.ppt
Uncertainty Awareness in Integrating Machine Learning and Game Theory
block-mdp-masters-defense.pdf
reiniforcement learning.ppt
Reinforcement learning presentation1.ppt
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Chap 8. Optimization for training deep models
Ad

Recently uploaded (20)

PPTX
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
PDF
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
PDF
Softaken Excel to vCard Converter Software.pdf
PPTX
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
PPTX
CHAPTER 2 - PM Management and IT Context
PDF
Which alternative to Crystal Reports is best for small or large businesses.pdf
PDF
Design an Analysis of Algorithms I-SECS-1021-03
PDF
How Creative Agencies Leverage Project Management Software.pdf
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
PDF
System and Network Administration Chapter 2
PPTX
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
PDF
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
PDF
top salesforce developer skills in 2025.pdf
PPTX
L1 - Introduction to python Backend.pptx
PDF
Understanding Forklifts - TECH EHS Solution
PPTX
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
PPTX
Introduction to Artificial Intelligence
PDF
Adobe Illustrator 28.6 Crack My Vision of Vector Design
PPTX
Essential Infomation Tech presentation.pptx
PDF
System and Network Administraation Chapter 3
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
Softaken Excel to vCard Converter Software.pdf
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
CHAPTER 2 - PM Management and IT Context
Which alternative to Crystal Reports is best for small or large businesses.pdf
Design an Analysis of Algorithms I-SECS-1021-03
How Creative Agencies Leverage Project Management Software.pdf
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
System and Network Administration Chapter 2
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
top salesforce developer skills in 2025.pdf
L1 - Introduction to python Backend.pptx
Understanding Forklifts - TECH EHS Solution
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
Introduction to Artificial Intelligence
Adobe Illustrator 28.6 Crack My Vision of Vector Design
Essential Infomation Tech presentation.pptx
System and Network Administraation Chapter 3

Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI

  • 1. Deep Reinforcement Learning Through Policy Optimization John Schulman October 20, 2016
  • 4. What is Reinforcement Learning? Branch of machine learning concerned with taking sequences of actions Usually described in terms of agent interacting with a previously unknown environment, trying to maximize cumulative reward Agent Environment action observation, reward
  • 5. What Is Deep Reinforcement Learning? Reinforcement learning using neural networks to approximate functions Policies (select next action) Value functions (measure goodness of states or state-action pairs) Models (predict next states and rewards)
  • 6. Motor Control and Robotics Robotics: Observations: camera images, joint angles Actions: joint torques Rewards: stay balanced, navigate to target locations, serve and protect humans
  • 7. Business Operations Inventory Management Observations: current inventory levels Actions: number of units of each item to purchase Rewards: profit
  • 8. In Other ML Problems Hard Attention1 Observation: current image window Action: where to look Reward: classification Sequential/structured prediction, e.g., machine translation2 Observations: words in source language Actions: emit word in target language Rewards: sentence-level metric, e.g. BLEU score 1 V. Mnih et al. “Recurrent models of visual attention”. In: Advances in Neural Information Processing Systems. 2014, pp. 2204–2212. 2 H. Daum´e Iii, J. Langford, and D. Marcu. “Search-based structured prediction”. In: Machine learning 75.3 (2009), pp. 297–325; S. Ross, G. J. Gordon, and D. Bagnell. “A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning.” In: AISTATS. vol. 1. 2. 2011, p. 6; M. Ranzato et al. “Sequence level training with recurrent neural networks”. In: arXiv preprint arXiv:1511.06732 (2015).
  • 9. How Does RL Relate to Other ML Problems?
  • 10. How Does RL Relate to Other ML Problems? Supervised learning: Environment samples input-output pair (xt, yt) ∼ ρ Agent predicts ˆyt = f (xt) Agent receives loss (yt, ˆyt) Environment asks agent a question, and then tells her the right answer
  • 11. How Does RL Relate to Other ML Problems? Contextual bandits: Environment samples input xt ∼ ρ Agent takes action ˆyt = f (xt) Agent receives cost ct ∼ P(ct | xt, ˆyt) where P is an unknown probability distribution Environment asks agent a question, and gives her a noisy score on her answer Application: personalized recommendations
  • 12. How Does RL Relate to Other ML Problems? Reinforcement learning: Environment samples input xt ∼ P(xt | xt−1, yt−1) Input depends on your previous actions! Agent takes action ˆyt = f (xt) Agent receives cost ct ∼ P(ct | xt, ˆyt) where P a probability distribution unknown to the agent.
  • 13. How Does RL Relate to Other Machine Learning Problems? Summary of differences between RL and supervised learning: You don’t have full access to the function you’re trying to optimize—must query it through interaction. Interacting with a stateful world: input xt depend on your previous actions
  • 14. Should I Use Deep RL On My Practical Problem? Might be overkill Other methods worth investigating first Derivative-free optimization (simulated annealing, cross entropy method, SPSA) Is it a contextual bandit problem? Non-deep RL methods developed by Operations Research community3 3 W. B. Powell. Approximate Dynamic Programming: Solving the curses of dimensionality. Vol. 703. John Wiley & Sons, 2007.
  • 15. Recent Success Stories in Deep RL ATARI using deep Q-learning4 , policy gradients5 , DAGGER6 Superhuman Go using supervised learning + policy gradients + Monte Carlo tree search + value functions7 Robotic manipulation using guided policy search8 Robotic locomotion using policy gradients9 3D games using policy gradients10 4 V. Mnih et al. “Playing Atari with Deep Reinforcement Learning”. In: arXiv preprint arXiv:1312.5602 (2013). 5 J. Schulman et al. “Trust Region Policy Optimization”. In: arXiv preprint arXiv:1502.05477 (2015). 6 X. Guo et al. “Deep learning for real-time Atari game play using offline Monte-Carlo tree search planning”. In: Advances in Neural Information Processing Systems. 2014, pp. 3338–3346. 7 D. Silver et al. “Mastering the game of Go with deep neural networks and tree search”. In: Nature 529.7587 (2016), pp. 484–489. 8 S. Levine et al. “End-to-end training of deep visuomotor policies”. In: arXiv preprint arXiv:1504.00702 (2015). 9 J. Schulman et al. “High-dimensional continuous control using generalized advantage estimation”. In: arXiv preprint arXiv:1506.02438 (2015). 10 V. Mnih et al. “Asynchronous Methods for Deep Reinforcement Learning”. In: arXiv preprint arXiv:1602.01783 (2016).
  • 17. Definition Markov Decision Process (MDP) defined by (S, A, P), where S: state space A: action space P(r, s | s, a): transition + reward probability distribution Extra objects defined depending on problem setting µ: Initial state distribution Optimization problem: maximize expected cumulative reward
  • 18. Episodic Setting In each episode, the initial state is sampled from µ, and the agent acts until the terminal state is reached. For example: Taxi robot reaches its destination (termination = good) Waiter robot finishes a shift (fixed time) Walking robot falls over (termination = bad) Goal: maximize expected reward per episode
  • 19. Policies Deterministic policies: a = π(s) Stochastic policies: a ∼ π(a | s)
  • 20. Episodic Setting s0 ∼ µ(s0) a0 ∼ π(a0 | s0) s1, r0 ∼ P(s1, r0 | s0, a0) a1 ∼ π(a1 | s1) s2, r1 ∼ P(s2, r1 | s1, a1) . . . aT−1 ∼ π(aT−1 | sT−1) sT , rT−1 ∼ P(sT | sT−1, aT−1) Objective: maximize η(π), where η(π) = E[r0 + r1 + · · · + rT−1 | π]
  • 21. Episodic Setting μ0 a0 s0 s1 a1 aT-1 sT π P Agent r0 r1 rT-1 Environment s2 Objective: maximize η(π), where η(π) = E[r0 + r1 + · · · + rT−1 | π]
  • 22. Parameterized Policies A family of policies indexed by parameter vector θ ∈ Rd Deterministic: a = π(s, θ) Stochastic: π(a | s, θ) Analogous to classification or regression with input s, output a. Discrete action space: network outputs vector of probabilities Continuous action space: network outputs mean and diagonal covariance of Gaussian
  • 24. Derivative Free Optimization Approach Objective: maximize E[R | π(·, θ)] View θ → → R as a black box Ignore all other information other than R collected during episode
  • 25. Cross-Entropy Method Evolutionary algorithm Works embarrassingly well I. Szita and A. L¨orincz. “Learning Tetris using the noisy cross-entropy method”. In: Neural computation 18.12 (2006), pp. 2936–2941 V. Gabillon, M. Ghavamzadeh, and B. Scherrer. “Approximate Dynamic Programming Finally Performs Well in the Game of Tetris”. In: Advances in Neural Information Processing Systems. 2013
  • 26. Cross-Entropy Method Evolutionary algorithm Works embarrassingly well A similar algorithm, Covariance Matrix Adaptation, has become standard in graphics:
  • 27. Cross-Entropy Method Initialize µ ∈ Rd , σ ∈ Rd for iteration = 1, 2, . . . do Collect n samples of θi ∼ N(µ, diag(σ)) Perform a noisy evaluation Ri ∼ θi Select the top p% of samples (e.g. p = 20), which we’ll call the elite set Fit a Gaussian distribution, with diagonal covariance, to the elite set, obtaining a new µ, σ. end for Return the final µ.
  • 28. Cross-Entropy Method Analysis: a very similar algorithm is an minorization-maximization (MM) algorithm, guaranteed to monotonically increase expected reward Recall that Monte-Carlo EM algorithm collects samples, reweights them, and them maximizes their logprob We can derive MM algorithm where each iteration you maximize i log p(θi )Ri
  • 30. Policy Gradient Methods: Overview Problem: maximize E[R | πθ] Intuitions: collect a bunch of trajectories, and ... 1. Make the good trajectories more probable 2. Make the good actions more probable 3. Push the actions towards good actions (DPG11 , SVG12 ) 11 D. Silver et al. “Deterministic policy gradient algorithms”. In: ICML. 2014. 12 N. Heess et al. “Learning continuous control policies by stochastic value gradients”. In: Advances in Neural Information Processing Systems. 2015, pp. 2926–2934.
  • 31. Score Function Gradient Estimator Consider an expectation Ex∼p(x | θ)[f (x)]. Want to compute gradient wrt θ θEx [f (x)] = θ dx p(x | θ)f (x) = dx θp(x | θ)f (x) = dx p(x | θ) θp(x | θ) p(x | θ) f (x) = dx p(x | θ) θ log p(x | θ)f (x) = Ex [f (x) θ log p(x | θ)]. Last expression gives us an unbiased gradient estimator. Just sample xi ∼ p(x | θ), and compute ˆgi = f (xi ) θ log p(xi | θ). Need to be able to compute and differentiate density p(x | θ) wrt θ
  • 32. Score Function Gradient Estimator: Intuition ˆgi = f (xi ) θ log p(xi | θ) Let’s say that f (x) measures how good the sample x is. Moving in the direction ˆgi pushes up the logprob of the sample, in proportion to how good it is Valid even if f (x) is discontinuous, and unknown, or sample space (containing x) is a discrete set
  • 33. Score Function Gradient Estimator: Intuition ˆgi = f (xi ) θ log p(xi | θ)
  • 34. Score Function Gradient Estimator for Policies Now random variable x is a whole trajectory τ = (s0, a0, r0, s1, a1, r1, . . . , sT−1, aT−1, rT−1, sT ) θEτ [R(τ)] = Eτ [ θ log p(τ | θ)R(τ)] Just need to write out p(τ | θ): p(τ | θ) = µ(s0) T−1 t=0 [π(at | st, θ)P(st+1, rt | st, at)] log p(τ | θ) = log µ(s0) + T−1 t=0 [log π(at | st, θ) + log P(st+1, rt | st, at)] θ log p(τ | θ) = θ T−1 t=0 log π(at | st, θ) θEτ [R] = Eτ R θ T−1 t=0 log π(at | st, θ) Interpretation: using good trajectories (high R) as supervised examples in classification / regression
  • 35. Policy Gradient: Use Temporal Structure Previous slide: θEτ [R] = Eτ T−1 t=0 rt T−1 t=0 θ log π(at | st, θ) We can repeat the same argument to derive the gradient estimator for a single reward term rt . θE [rt ] = E rt t t=0 θ log π(at | st, θ) Sum this formula over t, we obtain θE [R] = E T−1 t=0 rt t t=0 θ log π(at | st, θ) = E T−1 t=0 θ log π(at | st, θ) T−1 t =t rt
  • 36. Policy Gradient: Introduce Baseline Further reduce variance by introducing a baseline b(s) θEτ [R] = Eτ T−1 t=0 θ log π(at | st, θ) T−1 t =t rt − b(st) For any choice of b, gradient estimator is unbiased. Near optimal choice is expected return, b(st) ≈ E [rt + rt+1 + rt+2 + · · · + rT−1] Interpretation: increase logprob of action at proportionally to how much returns T−1 t=t rt are better than expected
  • 37. Discounts for Variance Reduction Introduce discount factor γ, which ignores delayed effects between actions and rewards θEτ [R] ≈ Eτ T−1 t=0 θ log π(at | st, θ) T−1 t =t γt −t rt − b(st) Now, we want b(st) ≈ E rt + γrt+1 + γ2 rt+2 + · · · + γT−1−t rT−1 Write gradient estimator more generally as θEτ [R] ≈ Eτ T−1 t=0 θ log π(at | st, θ) ˆAt ˆAt is the advantage estimate
  • 38. “Vanilla” Policy Gradient Algorithm Initialize policy parameter θ, baseline b for iteration=1, 2, . . . do Collect a set of trajectories by executing the current policy At each timestep in each trajectory, compute the return Rt = T−1 t =t γt −t rt , and the advantage estimate ˆAt = Rt − b(st). Re-fit the baseline, by minimizing b(st) − Rt 2 , summed over all trajectories and timesteps. Update the policy, using a policy gradient estimate ˆg, which is a sum of terms θ log π(at | st, θ) ˆAt end for
  • 39. Extension: Step Sizes and Trust Regions Why are step sizes a big deal in RL? Supervised learning Step too far → next update will fix it Reinforcement learning Step too far → bad policy Next batch: collected under bad policy Can’t recover, collapse in performance!
  • 40. Extension: Step Sizes and Trust Regions Trust Region Policy Optimization: limit KL divergence between action distribution of pre-update and post-update policy13 Es DKL (πold(· | s) π(· | s)) ≤ δ Closely elated to previous natural policy gradient methods14 13 J. Schulman et al. “Trust Region Policy Optimization”. In: arXiv preprint arXiv:1502.05477 (2015). 14 S. Kakade. “A Natural Policy Gradient.” In: NIPS. vol. 14. 2001, pp. 1531–1538; J. A. Bagnell and J. Schneider. “Covariant policy search”. In: IJCAI. 2003; J. Peters and S. Schaal. “Natural actor-critic”. In: Neurocomputing 71.7 (2008), pp. 1180–1190.
  • 41. Extension: Further Variance Reduction Use value functions for more variance reduction (at the cost of bias): actor-critic methods15 Reparameterization trick: instead of increasing the probability of the good actions, push the actions towards (hopefully) better actions16 15 J. Schulman et al. “High-dimensional continuous control using generalized advantage estimation”. In: arXiv preprint arXiv:1506.02438 (2015); V. Mnih et al. “Asynchronous Methods for Deep Reinforcement Learning”. In: arXiv preprint arXiv:1602.01783 (2016). 16 D. Silver et al. “Deterministic policy gradient algorithms”. In: ICML. 2014; N. Heess et al. “Learning continuous control policies by stochastic value gradients”. In: Advances in Neural Information Processing Systems. 2015, pp. 2926–2934.
  • 42. Demo