Lec2 sampling-based-approximations-and-function-fitting

Lecture 2: Sampling-based Approximations
And
Function Fitting
Yan (Rocky) Duan
Berkeley AI Research Lab
Many slides made with John Schulman, Xi (Peter) Chen and Pieter Abbeel

n Optimal Control
=
given an MDP (S, A, P, R, γ, H)
find the optimal policy π*
Quick One-Slide Recap
n Exact Methods:
n Value Iteration
n Policy Iteration
Limitations:
• Update equations require access to dynamics
model
• Iteration over / Storage for all states and actions:
requires small, discrete state-action space
-> sampling-based approximations
-> Q/V function fitting

n Q Value Iteration
n Value Iteration?
n Policy Iteration
n Policy Evaluation
n Policy Improvement?
Sampling-Based Approximation

Recap Q-Values
Q*(s, a) = expected utility starting in s, taking action a, and (thereafter)
acting optimally
Bellman Equation:
Q-Value Iteration:

n Q-value iteration:
n Rewrite as expectation:
n (Tabular) Q-Learning: replace expectation by samples
n For an state-action pair (s,a), receive:
n Consider your old estimate:
n Consider your new sample estimate:
n Incorporate the new estimate into a running average:
(Tabular) Q-Learning
Qk+1 Es0⇠P (s0|s,a)
h
R(s, a, s0
) + max
a0
Qk(s0
, a0
)
i
s0
⇠ P(s0
|s, a)
Qk(s, a)
Qk+1(s, a) (1 ↵)Qk(s, a) + ↵ [target(s0
)]

(Tabular) Q-Learning
Algorithm:
Start with for all s, a.
Get initial state s
For k = 1, 2, … till convergence
Sample action a, get next state s’
If s’ is terminal:
Sample new initial state s’
else:
Q0(s, a)
target = R(s, a, s0
) + max
a0
Qk(s0
, a0
)
target = R(s, a, s0
)
s s0
Qk+1(s, a) (1 ↵)Qk(s, a) + ↵ [target]

n Choose random actions?
n Choose action that maximizes (i.e. greedily)?
n ɛ-Greedy: choose random action with prob. ɛ, otherwise choose
action greedily
How to sample actions?
Qk(s, a)

n Amazing result: Q-learning converges to optimal policy --
even if you’re acting suboptimally!
n This is called off-policy learning
n Caveats:
n You have to explore enough
n You have to eventually make the learning rate
small enough
n … but not decrease it too quickly
Q-Learning Properties

n Technical requirements.
n All states and actions are visited infinitely often
n Basically, in the limit, it doesn’t matter how you select actions (!)
n Learning rate schedule such that for all state and action
pairs (s,a):
Q-Learning Properties
For details, see Tommi Jaakkola, Michael I. Jordan, and Satinder P. Singh. On the convergence of stochastic iterative
dynamic programming algorithms. Neural Computation, 6(6), November 1994.
1X
t=0
↵t(s, a) = 1
1X
t=0
↵2
t (s, a) < 1

Q-Learning Demo: Gridworld
• States: 11 cells
• Actions: {up, down, left, right}
• Deterministic transition function
• Learning rate: 0.5
• Discount: 1
• Reward: +1 for getting diamond, -1 for falling into trap

Q-Learning Demo: Crawler
• States: discretized value of 2d state: (arm angle, hand angle)
• Actions: Cartesian product of {arm up, arm down} and {hand up, hand down}
• Reward: speed in the forward direction

n Q Value Iteration à (Tabular) Q-learning
n Value Iteration?
n Policy Iteration
n Policy Evaluation

n Value Iteration
n unclear how to draw samples through max…...
Value Iteration w/ Samples?
V ⇤
i+1(s) max
a
Es0⇠P (s0|s,a) [R(s, a, s0
) + V ⇤
i (s0
)]

n Value Iteration?
n Policy Iteration
n Policy Evaluation

Recap: Policy Iteration
One iteration of policy iteration:
n Policy evaluation for current policy :
n Iterate until convergence
n Policy improvement: find the best action according to one-step
look-ahead
⇡k
Can be approximated by samples
This is called Temporal Difference (TD) Learning
Unclear what to do with the max (for now)
V ⇡k
i+1(s) Es0⇠P (s0|s,⇡k(s))[R(s, ⇡k(s), s0
) + V ⇡k
i (s0
)]
⇡k+1(s) arg max
a
Es0⇠P (s0|s,a)[R(s, a, s0
) + V ⇡k
(s0
)]

n Value Iteration?
n Policy Iteration
n Policy Evaluation à (Tabular) TD-learning
n Policy Improvement (for now)

n Discrete environments
Can tabular methods scale?
Tetris
10^60
Atari
10^308 (ram) 10^16992 (pixels)
Gridworld
10^1

n Continuous environments (by crude discretization)
Crawler
10^2
Hopper
10^10
Humanoid
10^100
Can tabular methods scale?

Generalizing Across States
n Basic Q-Learning keeps a table of all q-values
n In realistic situations, we cannot possibly learn
about every single state!
n Too many states to visit them all in training
n Too many states to hold the q-tables in memory
n Instead, we want to generalize:
n Learn about some small number of training states from
experience
n Generalize that experience to new, similar situations
n This is a fundamental idea in machine learning, and
we’ll see it over and over again

n Instead of a table, we have a parametrized Q function:
n Can be a linear function in features:
n Or a complicated neural net
n Learning rule:
n Remember:
n Update:
Approximate Q-Learning
Q✓(s, a)
Q✓(s, a) = ✓0f0(s, a) + ✓1f1(s, a) + · · · + ✓nfn(s, a)
target(s0
) = R(s, a, s0
) + max
a0
Q✓k
(s0
, a0
)
✓k+1 ✓k ↵r✓

1
2
(Q✓(s, a) target(s0
))2
✓=✓k

Connection to Tabular Q-Learning
n Suppose
n Plug into update:
n Compare with Tabular Q-Learning update:
✓ 2 R|S|⇥|A|
, Q✓(s, a) ⌘ ✓sa
r✓sa

1
2
(Q✓(s, a) target(s0
))2
= r✓sa

1
2
(✓sa target(s0
))2
= ✓sa target(s0
)
Qk+1(s, a) (1 ↵)Qk(s, a) + ↵ [target(s0
)]
✓sa ✓sa ↵(✓sa target(s0
))
= (1 ↵)✓sa + ↵[target(s0
)]

n state: naïve board configuration + shape of the falling piece ~1060 states!
n action: rotation and translation applied to the falling piece
n 22 features aka basis functions
n Ten basis functions, 0, . . . , 9, mapping the state to the height h[k] of each column.
n Nine basis functions, 10, . . . , 18, each mapping the state to the absolute difference
between heights of successive columns: |h[k+1] − h[k]|, k = 1, . . . , 9.
n One basis function, 19, that maps state to the maximum column height: maxk h[k]
n One basis function, 20, that maps state to the number of ‘holes’ in the board.
n One basis function, 21, that is equal to 1 in every state.
[Bertsekas & Ioffe, 1996 (TD); Bertsekas & Tsitsiklis 1996 (TD); Kakade 2002 (policy gradient); Farias & Van Roy, 2006 (approximate LP)]
ˆV (s) =
21X
i=0
i⇥i(s) = >
⇥(s)
i
Engineered Approximation Example: Tetris

Deep Reinforcement Learning
Pong Enduro Beamrider Q*bert
• From pixels to actions
• Same algorithm (with effective tricks)
• CNN function approximator, w/ 3M free parameters

n We have now covered enough materials for Lab 1.
n Will be released on Piazza by this afternoon.
n Covers value iteration, policy iteration, and tabular Q-learning.
Lab 1

Lec2 sampling-based-approximations-and-function-fitting

n The bad: it is not guaranteed to converge…
n Even if the function approximation is expressive enough to
represent the true Q function
Convergence of Approximate Q-Learning
Function approximator: [1 2] * θ
θ 2θ
x1 x2r=0
r=0

n Definition. An operator G is a non-expansion with respect to a norm || . || if
n Fact. If the operator F is a γ-contraction with respect to a norm || . || and the
operator G is a non-expansion with respect to the same norm, then the
sequential application of the operators G and F is a γ-contraction, i.e.,
n Corollary. If the supervised learning step is a non-expansion, then iteration in
value iteration with function approximation is a γ-contraction, and in this case
we have a convergence guarantee.
Composing Operators**

n Examples:
n nearest neighbor (aka state aggregation)
n linear interpolation over triangles
(tetrahedrons, …)
Averager Function Approximators Are Non-Expansions**

Averager Function Approximators Are Non-Expansions**

Example taken from Gordon, 1995
Linear Regression L **

n I.e., if we pick a non-expansion function approximator which can approximate
J* well, then we obtain a good value function estimate.
n To apply to discretization: use continuity assumptions to show that J* can be
approximated well by chosen discretization scheme.
Guarantees for Fixed Point**

Lec2 sampling-based-approximations-and-function-fitting

More Related Content

What's hot (6)

Similar to Lec2 sampling-based-approximations-and-function-fitting (20)

More from Ronald Teo (16)

Recently uploaded (20)

Lec2 sampling-based-approximations-and-function-fitting