Introducing Zap Q-Learning

Zap Q-Learning
Reinforcement Learning: Hidden Theory, and New Super-Fast Algorithms
Center for Systems and Control (CSC@USC)
and Ming Hsieh Institute for Electrical Engineering
February 21, 2018
Adithya M. Devraj Sean P. Meyn
Department of Electrical and Computer Engineering — University of Florida

Zap Q-Learning
Outline
1 Stochastic Approximation
2 Fastest Stochastic Approximation
3 Reinforcement Learning
4 Zap Q-Learning
5 Conclusions & Future Work
6 References

E[f(θ,W)]
θ=θ∗
= 0
Stochastic Approximation

Stochastic Approximation Basic Algorithm
What is Stochastic Approximation?
A simple goal: Find the solution θ∗ to
¯f(θ∗
) := E[f(θ, W)]
θ=θ∗
= 0
1 / 31

¯f(θ∗
) := E[f(θ, W)]
θ=θ∗
= 0
What makes this hard?
1 / 31

¯f(θ∗
) := E[f(θ, W)]
θ=θ∗
= 0
1 The function f and the distribution of the random vector W may not
be known
– we may only know something about the structure of the problem
1 / 31

¯f(θ∗
) := E[f(θ, W)]
θ=θ∗
= 0
be known
2 Even if everything is known, computation of the expectation may be
expensive. For root ﬁnding, we may need to compute the expectation
for many values of θ
1 / 31

¯f(θ∗
) := E[f(θ, W)]
θ=θ∗
= 0
be known
2 Even if everything is known, computation of the expectation may be
expensive. For root ﬁnding, we may need to compute the expectation
for many values of θ
3 Motivates stochastic approximation: θ(n + 1) = θ(n) + αnf(θ(n), W(n))
The recursive algorithms we come up with are often slow, and their
variance may be inﬁnite: typical in Q-learning [Devraj & M 2017]
1 / 31

Stochastic Approximation ODE Method
Algorithm and Convergence Analysis
Algorithm:
θ(n + 1) = θ(n) + αnf(θ(n), W(n))
Goal:
¯f(θ∗
) := E[f(θ, W)]
θ=θ∗
= 0
Interpretation: θ∗ ≡ stationary point of the ODE
d
dt
ϑ(t) = ¯f(ϑ(t))
2 / 31

Stochastic Approximation ODE Method
Algorithm and Convergence Analysis
Algorithm:
θ(n + 1) = θ(n) + αnf(θ(n), W(n))
Goal:
¯f(θ∗
) := E[f(θ, W)]
θ=θ∗
= 0
Interpretation: θ∗ ≡ stationary point of the ODE
d
dt
ϑ(t) = ¯f(ϑ(t))
Analysis: Stability of the ODE ⊕ (See Borkar’s monograph) =⇒
lim
n→∞
θ(n) = θ∗
2 / 31

Stochastic Approximation SA Example
Stochastic Approximation Example
Example: Monte-Carlo
Monte-Carlo Estimation
Estimate the mean η = c(X), where X is a random variable:
η = c(x) fX(x) dx
3 / 31

Estimate the mean η = c(X), where X is a random variable
SA interpretation: Find θ∗ solving 0 = E[f(θ, X)] = E[c(X) − θ]
Algorithm: θ(n) =
1
n
n
i=1
c(X(i))
3 / 31

Estimate the mean η = c(X), where X is a random variable
SA interpretation: Find θ∗ solving 0 = E[f(θ, X)] = E[c(X) − θ]
Algorithm: θ(n) =
1
n
n
i=1
c(X(i))
=⇒ (n + 1)θ(n + 1) =
n+1
i=1
c(X(i)) = nθ(n) + c(X(n + 1))
=⇒ (n + 1)θ(n + 1) = (n + 1)θ(n) + [c(X(n + 1)) − θ(n)]
SA Recursion: θ(n + 1) = θ(n) + αnf(θ(n), X(n + 1))
αn = ∞, α2
n < ∞
3 / 31

θ(k)
k
Fastest Stochastic Approximation

Fastest Stochastic Approximation Algorithm Performance
Performance Criteria
Two standard approaches to evaluate performance, ˜θ(n) := θ(n) − θ∗:
1 Finite-n bound:
P{ ˜θ(n) ≥ ε} ≤ exp(−I(ε, n)) , I(ε, n) = O(nε2
)
2 Asymptotic covariance:
Σ = lim
n→∞
nE ˜θ(n)˜θ(n)T
,
√
n˜θ(n) ≈ N(0, Σ)
4 / 31

Fastest Stochastic Approximation Algorithm Performance
Asymptotic Covariance
Σ = lim
n→∞
Σn = lim
n→∞
nE ˜θ(n)˜θ(n)T
,
√
n˜θ(n) ≈ N(0, Σ)
SA recursion for covariance:
Σn+1 ≈ Σn + 1
n (A + 1
2I)Σn + Σn(A + 1
2I)T
+ Σ∆
A = d
dθ
¯f (θ∗)
Conclusions
1 If Re λ(A) ≥ −1
2 for some eigenvalue then Σ is (typically) inﬁnite
2 If Re λ(A) < −1
2 for all, then Σ = limn→∞ Σn is the unique solution
to the Lyapunov equation:
0 = (A + 1
2I)Σ + Σ(A + 1
2I)T
+ Σ∆
5 / 31

Fastest Stochastic Approximation Stochastic Newton Raphson
Optimal Asymptotic Covariance
Introduce a d × d matrix gain sequence {Gn}:
θ(n + 1) = θ(n) +
1
n + 1
Gnf(θ(n), X(n))
6 / 31

θ(n + 1) = θ(n) +
1
n + 1
Gnf(θ(n), X(n))
Assume it converges, and linearize:
˜θ(n + 1) ≈ ˜θ(n) +
1
n + 1
G A˜θ(n) + ∆(n + 1) , A =
d
dθ
¯f (θ∗
) .
6 / 31

θ(n + 1) = θ(n) +
1
n + 1
Gnf(θ(n), X(n))
˜θ(n + 1) ≈ ˜θ(n) +
1
n + 1
G A˜θ(n) + ∆(n + 1) , A =
d
dθ
¯f (θ∗
) .
If G = G∗ := −A−1 then
Resembles Monte-Carlo estimate
Resembles Newton-Rapshon
It is optimal: Σ∗ = G∗Σ∆G∗T
≤ ΣG any other G
6 / 31

θ(n + 1) = θ(n) +
1
n + 1
Gnf(θ(n), X(n))
˜θ(n + 1) ≈ ˜θ(n) +
1
n + 1
G A˜θ(n) + ∆(n + 1) , A =
d
dθ
¯f (θ∗
) .
If G = G∗ := −A−1 then
Resembles Monte-Carlo estimate
Resembles Newton-Rapshon
It is optimal: Σ∗ = G∗Σ∆G∗T
≤ ΣG any other G
Polyak-Ruppert averaging is also optimal, but ﬁrst two bullets are missing.
6 / 31

Optimal Variance
Example: return to Monte-Carlo
θ(n + 1) = θ(n) +
g
n + 1
−θ(n) + X(n + 1)
7 / 31

Optimal Variance
Example: return to Monte-Carlo
θ(n + 1) = θ(n) +
g
n + 1
−θ(n) + X(n + 1)
∆(n) = X(n) − E[X(n)]
7 / 31

Optimal Variance
∆(n) = X(n) − E[X(n)]Normalization for analysis:
˜θ(n + 1) = ˜θ(n) +
g
n + 1
−˜θ(n) + ∆(n + 1)
7 / 31

Optimal Variance
˜θ(n + 1) = ˜θ(n) +
g
n + 1
−˜θ(n) + ∆(n + 1)
Example: X(n) = W2(n), W ∼ N(0, 1)
7 / 31

Optimal Variance
˜θ(n + 1) = ˜θ(n) +
g
n + 1
−˜θ(n) + ∆(n + 1)
Example: X(n) = W2(n), W ∼ N(0, 1)
0 1 2 3 4 5 g
σ2
∆
Σ =
σ2
∆
2
g2
g − 1/2
Asymptotic variance as a function of g
7 / 31

Optimal Variance
˜θ(n + 1) = ˜θ(n) +
g
n + 1
−˜θ(n) + ∆(n + 1)
Example: X(n) = W2(n), W ∼ N(0, 1)
0 1 2 3 4 5
t 104
0.4
0.6
0.8
1
1.2
(t)
20 30.8
10 15.8
1 3
0.5
0.1
g
SA estimates of E[W2
], W ∼ N(0, 1)
7 / 31

Optimal Asymptotic Covariance and Zap-SNR
Zap-SNR (designed to emulate deterministic Newton-Raphson)
Requires An ≈ A(θn) :=
d
dθ
¯f (θn)
8 / 31

θ(n + 1) = θ(n) + αn(−An)−1
f(θ(n), X(n))
An = An−1 + γn(An − An−1), An =
d
dθ
f(θ(n), X(n))
8 / 31

θ(n + 1) = θ(n) + αn(−An)−1
f(θ(n), X(n))
An = An−1 + γn(An − An−1), An =
d
dθ
f(θ(n), X(n))
An ≈ A(θn) requires high-gain,
γn
αn
→ ∞, n → ∞
8 / 31

θ(n + 1) = θ(n) + αn(−An)−1
f(θ(n), X(n))
An = An−1 + γn(An − An−1), An =
d
dθ
f(θ(n), X(n))
γn
αn
→ ∞, n → ∞
Always: αn = 1/n. Numerics that follow: γn = (1/n)ρ, ρ ∈ (0.5, 1)
8 / 31

θ(n + 1) = θ(n) + αn(−An)−1
f(θ(n), X(n))
An = An−1 + γn(An − An−1), An =
d
dθ
f(θ(n), X(n))
γn
αn
→ ∞, n → ∞
ODE for Zap-SNR
d
dt
xt = (−A(xt))−1 ¯f (xt), A(x) =
d
dx
¯f (x)
8 / 31

θ(n + 1) = θ(n) + αn(−An)−1
f(θ(n), X(n))
An = An−1 + γn(An − An−1), An =
d
dθ
f(θ(n), X(n))
γn
αn
→ ∞, n → ∞
ODE for Zap-SNR
d
dt
xt = (−A(xt))−1 ¯f (xt), A(x) =
d
dx
¯f (x)
Not necessarily stable (just like in deterministic Newton-Raphson)
General conditions for convergence is open 8 / 31

Reinforcement Learning
and Stochastic Approximation

Reinforcement Learning RL & SA
SA and RL Design
Functional equations in Stochastic Control
Always of the form
0 = E[F(h∗
, Φ(n + 1)) | Φ0 . . . Φ(n)] , h∗
= ?
9 / 31

SA and RL Design
Always of the form
0 = E[F(h∗
, Φ(n + 1)) | Φ0 . . . Φ(n)] , h∗
= ?
Φ(n) = (state, action)
9 / 31

SA and RL Design
Always of the form
0 = E[F(h∗
, Φ(n + 1)) | Φ0 . . . Φ(n)] , h∗
= ?
Galerkin relaxation:
0 = E[F(hθ∗
, Φ(n + 1))ζn] , θ∗
= ?
9 / 31

SA and RL Design
Always of the form
0 = E[F(h∗
, Φ(n + 1)) | Φ0 . . . Φ(n)] , h∗
= ?
0 = E[F(hθ∗
, Φ(n + 1))ζn] , θ∗
= ?
Necessary Ingredients:
Parameterized family {hθ : θ ∈ Rd}
Adapted, d-dimensional stochastic process {ζn}
Examples are TD- and Q-Learning
9 / 31

SA and RL Design
Always of the form
0 = E[F(h∗
, Φ(n + 1)) | Φ0 . . . Φ(n)] , h∗
= ?
0 = E[F(hθ∗
, Φ(n + 1))ζn] , θ∗
= ?
Necessary Ingredients:
Parameterized family {hθ : θ ∈ Rd}
Adapted, d-dimensional stochastic process {ζn}
Examples are TD- and Q-Learning
These algorithms are thus special cases of stochastic approximation
(as we all know)
9 / 31

Reinforcement Learning MDP Theory
Stochastic Optimal Control
MDP Model
X is a stationary controlled Markov chain, with input U
For all states x and sets A,
P{X(n + 1) ∈ A | X(n) = x, U(n) = u, and prior history} = Pu(x, A)
c: X × U → R is a cost function
β < 1 a discount factor
10 / 31

MDP Model
Value function:
h∗
(x) = min
U
∞
n=0
βn
E[c(X(n), U(n)) | X(0) = x]
10 / 31

MDP Model
Value function:
h∗
(x) = min
U
∞
n=0
βn
E[c(X(n), U(n)) | X(0) = x]
Bellman equation:
h∗
(x) = min
u
{c(x, u) + βE[h∗
(X(n + 1)) | X(n) = x, U(n) = u]}
10 / 31

Reinforcement Learning Q-Learning
Q-function
Trick to swap expectation and minimum
Bellman equation:
h∗
(x) = min
u
{c(x, u) + βE[h∗
(X(n + 1)) | X(n) = x, U(n) = u]}
11 / 31

Q-function
Bellman equation:
h∗
(x) = min
u
{c(x, u) + βE[h∗
(X(n + 1)) | X(n) = x, U(n) = u]}
Q-function:
Q∗
(x, u) := c(x, u) + βE[h∗
(X(n + 1)) | X(n) = x, U(n) = u]
11 / 31

Q-function
Bellman equation:
h∗
(x) = min
u
{c(x, u) + βE[h∗
(X(n + 1)) | X(n) = x, U(n) = u]}
Q-function:
Q∗
(x, u) := c(x, u) + βE[h∗
(X(n + 1)) | X(n) = x, U(n) = u]
h∗
(x) = min
u
Q∗
(x, u)
11 / 31

Q-function
Bellman equation:
h∗
(x) = min
u
{c(x, u) + βE[h∗
(X(n + 1)) | X(n) = x, U(n) = u]}
Q-function:
Q∗
(x, u) := c(x, u) + βE[h∗
(X(n + 1)) | X(n) = x, U(n) = u]
h∗
(x) = min
u
Q∗
(x, u)
Another Bellman equation:
Q∗
(x, u) = c(x, u) + βE[Q∗
(X(n + 1)) | X(n) = x, U(n) = u]
Q∗
(x) = min
u
Q∗
(x, u)
11 / 31

Q-Learning and Galerkin Relaxation
Dynamic programming
Find function Q∗ that solves
E c(X(n), U(n)) + βQ∗
(X(n + 1)) − Q∗
(X(n), U(n)) | Fn = 0
12 / 31

Dynamic programming
E c(X(n), U(n)) + βQ∗
(X(n + 1)) − Q∗
(X(n), U(n)) | Fn = 0
That is,
0 = E[F(Q∗
, Φ(n + 1)) | Φ0 . . . Φ(n)] ,
with Φ(n + 1) = (X(n + 1), X(n), U(n)).
12 / 31

Dynamic programming
E c(X(n), U(n)) + βQ∗
(X(n + 1)) − Q∗
(X(n), U(n)) | Fn = 0
Q-Learning
Find θ∗ that solves
E c(X(n), U(n)) + βQθ∗
((X(n + 1)) − Qθ∗
((X(n), U(n)) ζn = 0
The family {Qθ} and eligibility vectors {ζn} are part of algorithm design.
12 / 31

Watkins’ Q-learning
((X(n + 1)) − Qθ∗
((X(n), U(n)) ζn = 0
13 / 31

((X(n + 1)) − Qθ∗
((X(n), U(n)) ζn = 0
Watkin’s algorithm is Stochastic Approximation
The family {Qθ} and eligibility vectors {ζn} in this design:
Linearly parameterized family of functions: Qθ(x, u) = θT
ψ(x, u)
ζn ≡ ψ(Xn, Un) and
ψn(x, u) = 1{x = xn, u = un} (complete basis)
13 / 31

((X(n + 1)) − Qθ∗
((X(n), U(n)) ζn = 0
ψ(x, u)
Asymptotic covariance is typically inﬁnite
13 / 31

Big Question: Can we Zap Q-Learning?
((X(n + 1)) − Qθ∗
((X(n), U(n)) ζn = 0
ψ(x, u)
Asymptotic covariance is typically inﬁnite
13 / 31

0 1 2 3 4 5 6 7 8 9 10 105
0
20
40
60
80
100 Watkins, Speedy Q-learning,
Polyak-Ruppert Averaging
Zap
BellmanError
n
Zap Q-Learning

Zap Q-Learning
Asymptotic Covariance of Watkins’ Q-Learning
Improvements are needed!
1
4
65
3 2
Histogram of parameter estimates after 106 iterations.
1000 200 300 400 486.6
0
10
20
30
40
n = 106
Histogram for θ
θ*
n(15)
(15)
Example from Devraj & M 2017
14 / 31

Zap Q-Learning Watkin’s algorithm
Zap Q-learning
Zap Q-Learning ≡ Zap-SNR for Q-Learning
0 = ¯f(θ) = E f(θ, W(n))
:= E ζn c(X(n), U(n)) + βQθ
(X(n + 1)) − Qθ
(X(n), U(n))
15 / 31

Zap Q-learning
0 = ¯f(θ) = E f(θ, W(n))
(X(n + 1)) − Qθ
(X(n), U(n))
A(θ) = d
dθ
¯f (θ);
15 / 31

Zap Q-learning
0 = ¯f(θ) = E f(θ, W(n))
(X(n + 1)) − Qθ
(X(n), U(n))
A(θ) = d
dθ
¯f (θ); At points of diﬀerentiability:
A(θ) = E ζn βψ(X(n + 1), φθ
(X(n + 1))) − ψ(X(n), U(n))
T
φθ
(X(n + 1)) := arg min
u
Qθ
(X(n + 1), u)
15 / 31

Zap Q-learning
0 = ¯f(θ) = E f(θ, W(n))
(X(n + 1)) − Qθ
(X(n), U(n))
A(θ) = d
dθ
(X(n + 1))) − ψ(X(n), U(n))
T
φθ
(X(n + 1)) := arg min
u
Qθ
(X(n + 1), u)
Algorithm:
θ(n + 1)= θ(n) + αn(−An)−1
(f(θ(n), Φ(n))); An = An−1 + γn(An − An−1);
15 / 31

Zap Q-learning
0 = ¯f(θ) = E f(θ, W(n))
(X(n + 1)) − Qθ
(X(n), U(n))
A(θ) = d
dθ
(X(n + 1))) − ψ(X(n), U(n))
T
φθ
(X(n + 1)) := arg min
u
Qθ
(X(n + 1), u)
Algorithm:
θ(n + 1)= θ(n) + αn(−An)−1
(f(θ(n), Φ(n))); An = An−1 + γn(An − An−1);
An+1 :=
d
dθ
f (θn, Φ(n))
= ζn βψ(X(n + 1), φθn
(X(n + 1))) − ψ(X(n), U(n))
T
15 / 31

Zap Q-learning
ODE Analysis: change of variables q = Q∗(ς)
Functional Q∗ maps cost functions to Q-functions:
q(x, u) = ς(x, u) + β
x
Pu(x, x ) min
u
q(x , u )
16 / 31

Zap Q-learning
ODE Analysis: change of variables q = Q∗(ς)
Functional Q∗ maps cost functions to Q-functions:
q(x, u) = ς(x, u) + β
x
Pu(x, x ) min
u
q(x , u )
ODE for Zap-Q
qt = Q∗
(ςt),
d
dt
ςt = −ςt + c
⇒ convergence, optimal covariance, ...
16 / 31

Zap Q-Learning
Example: Stochastic Shortest Path
1
4
65
3 2
17 / 31

Zap Q-Learning
1
4
65
3 2
Convergence with Zap gain γn = n−0.85
Watkins’ algorithm has inﬁnite asymptotic covariance with αn = 1/n
0 1 2 3 4 5 6 7 8 9 10 105
0
20
40
60
80
Zap
BellmanError
n
Convergence of Zap-Q Learning
Discount factor: β = 0.99
17 / 31

Zap Q-Learning
1
4
65
3 2
0 1 2 3 4 5 6 7 8 9 10 105
0
20
40
60
80
Zap
BellmanError
n
Zap, γn = αn
17 / 31

Zap Q-Learning
1
4
65
3 2
Optimal scalar gain is approximately αn = 1500/n
0 1 2 3 4 5 6 7 8 9 10 105
0
20
40
60
80
Zap
BellmanError
n
Watkins, g = 1500
Zap, γn = αn
17 / 31

Zap Q-Learning
Optimize Walk to Cafe
1
4
65
3 2
-2 -1 0 1 -2 -1 0 1 -8 -6 -4 -2 0 2 4 6 8 103-8 -6 -4 -2 0 2 4 6 8
n = 104
n = 106
Theoritical pdf Experimental pdf Empirical: 1000 trialsWn =
√
n˜θn
Entry #18: n = 104
n = 106
Entry #10:
CLT gives good prediction of ﬁnite-n performance
18 / 31

Zap Q-Learning
1
4
65
3 2
-2 -1 0 1 -2 -1 0 1 -8 -6 -4 -2 0 2 4 6 8 103-8 -6 -4 -2 0 2 4 6 8
n = 104
n = 106
Theoritical pdf Experimental pdf Empirical: 1000 trialsWn =
√
n˜θn
Entry #18: n = 104
n = 106
Entry #10:
CLT gives good prediction of ﬁnite-n performance
19 / 31

Zap Q-Learning
1
4
65
3 2Local Convergence: θ(0) initialized in neighborhood of θ∗
g = 500
g = 1500
Speedy
Poly
g = 5000
Polyak-Ruppert
B0 10 20 30 40 50
0
1
2
0 20 40 60 80 100 120 140 160
0
0.5
Zap-Q:
Zap-Q: ≡ α0 85
n
γn ≡
γn
αn
Watkins
BellmanError
Histogramsn=106
20 / 31

Zap Q-Learning
1
4
65
3 2Local Convergence: θ(0) initialized in neighborhood of θ∗
g = 500
g = 1500
Speedy
Poly
g = 5000
g = 500
g = 1500
Speedy
Poly
g = 5000
10
3
10
4
10
5
10
6
100
101
102
103
104
10
3
10
4
10
5
10
6
n
Polyak-Ruppert
Polyak-RuppertB
B
n
0 10 20 30 40 50
0
1
2
0 20 40 60 80 100 120 140 160
0
0.5
Zap-Q:
Zap-Q: ≡ α0 85
n
γn ≡
γn
αn
Zap-Q:
Zap-Q: ≡ α0 85
n
γn ≡
γn
αn
WatkinsWatkins
BellmanErrorBellmanError
Histogramsn=106
2σ conﬁdence intervals for the Q-learning algorithms
20 / 31

Zap Q-Learning Optimal stopping
Zap Q-Learning
Model of Tsitsiklis and Van Roy: Optimal Stopping Time in Finance
State space: R100
Parameterized Q-function: Qθ with θ ∈ R10
i
0 1 2 3 4 5 6 7 8 9 10
-10
0
-10
-1
-10
-2
-10-3
-10-4
-10-5
-10
-6
Real for every eigenvalue λ
Asymptotic covariance is infinite
λ > −
1
2
Real λi(A)
21 / 31

Zap Q-Learning
State space: R100
i
0 1 2 3 4 5 6 7 8 9 10
-10
0
-10
-1
-10
-2
-10
-3
-10-4
-10-5
-10
-6
Real for every eigenvalue λ
Authors observed slow convergence
Proposed a matrix gain sequence
(see refs for details)
Asymptotic covariance is infinite
λ > −
1
2
Real λi(A)
{Gn}
21 / 31

Zap Q-Learning
State space: R100
i
0 1 2 3 4 5 6 7 8 9 10
-10
0
-10
-1
-10
-2
-10
-3
-10-4
-10-5
-10
-6
-0.525-30 -25 -20 -15 -10 -5
-10
-5
0
5
10
Re (λ(GA))
Co(λ(GA))
λi(GA)Real λi(A)
Eigenvalues of A and GA for the ﬁnance example
Favorite choice of gain in [23] barely meets the criterion Re(λ(GA)) < −1
2
21 / 31

Zap Q-Learning
State space: R100. Parameterized Q-function: Qθ with θ ∈ R10
Zap-Q
G-Q
-1000 0 1000 2000 3000 -600 -400 -200 0 200 400 600 800
-250 -200 -150 -100 -50 0 50 100 -200 -100 0 100 200 300
Theoritical pdf Experimental pdf Empirical: 1000 trials
Wn =
√
n˜θn
Entry #1: n = 2 × 106
Entry #7: n = 2 × 106
22 / 31

Zap Q-Learning
State space: R100.
Histograms of the average reward obtained using the diﬀerent algorithms:
1 1.05 1.1 1.15 1.2 1.25
0
20
40
60
80
100
1 1.05 1.1 1.15 1.2 1.25
0
100
200
300
400
500
600
1 1.05 1.1 1.15 1.2 1.25
0
5
10
15
20
25
30
35 G-Q(0)
G-Q(0)
Zap-Q
Zap-Q ρ = 0.8
ρ = 1.0
g = 100
g = 200
Zap-Q ρ = 0.85
n = 2 × 104
n = 2 × 105
n = 2 × 106
Zap-Q G-Q
23 / 31

Conclusions & Future Work
Conclusions
Reinforcement Learning is not just cursed by dimension,
but also by variance
We need better design tools to improve performance
24 / 31

Conclusions
The asymptotic covariance is an awesome design tool.
It is also predictive of ﬁnite-n performance.
Example: g∗
= 1500 was chosen based on asymptotic covariance
24 / 31

Conclusions
Example: g∗
Future work:
Q-learning with function-approximation
Obtain conditions for a stable algorithm in a general setting
Optimal stopping time problems
Adaptive optimization of algorithm parameters
24 / 31

Conclusions
Example: g∗
Future work:
Q-learning with function-approximation
Obtain conditions for a stable algorithm in a general setting
Optimal stopping time problems
Adaptive optimization of algorithm parameters
Finite-time analysis
24 / 31

Thank you!
thankful
25 / 31

References
Control Techniques
FOR
Complex Networks
Sean Meyn
Pre-publication version for on-line viewing. Monograph available for purchase at your favorite retailer
More information available at http://www.cambridge.org/us/catalogue/catalogue.asp?isbn=9780521884419
Markov Chains
and
Stochastic Stability
S. P. Meyn and R. L. Tweedie
August 2008 Pre-publication version for on-line viewing. Monograph to appear Februrary 2009
π(f)<∞
∆V (x) ≤ −f(x) + bIC(x)
Pn
(x, · ) − π f → 0
sup
C
Ex[SτC(f)]<∞
References
26 / 31

References
This lecture
A. M. Devraj and S. P. Meyn, Zap Q-learning. Advances in Neural
Information Processing Systems (NIPS). Dec. 2017.
A. M. Devraj and S. P. Meyn, Fastest convergence for Q-learning. Available
on ArXiv. Jul. 2017.
27 / 31

References
Selected References I
[1] A. M. Devraj and S. P. Meyn. Fastest convergence for Q-learning. ArXiv , July 2017.
[2] A. Benveniste, M. M´etivier, and P. Priouret. Adaptive algorithms and stochastic
approximations, volume 22 of Applications of Mathematics (New York). Springer-Verlag,
Berlin, 1990. Translated from the French by Stephen S. Wilson.
[3] V. S. Borkar. Stochastic Approximation: A Dynamical Systems Viewpoint. Hindustan
Book Agency and Cambridge University Press (jointly), Delhi, India and Cambridge, UK,
2008.
[4] V. S. Borkar and S. P. Meyn. The ODE method for convergence of stochastic
approximation and reinforcement learning. SIAM J. Control Optim., 38(2):447–469, 2000.
[5] S. P. Meyn and R. L. Tweedie. Markov chains and stochastic stability. Cambridge
University Press, Cambridge, second edition, 2009. Published in the Cambridge
Mathematical Library.
[6] S. P. Meyn. Control Techniques for Complex Networks. Cambridge University Press, 2007.
See last chapter on simulation and average-cost TD learning
28 / 31

References
Selected References II
[7] D. Ruppert. A Newton-Raphson version of the multivariate Robbins-Monro procedure.
The Annals of Statistics, 13(1):236–245, 1985.
[8] D. Ruppert. Eﬃcient estimators from a slowly convergent Robbins-Monro processes.
Technical Report Tech. Rept. No. 781, Cornell University, School of Operations Research
and Industrial Engineering, Ithaca, NY, 1988.
[9] B. T. Polyak. A new method of stochastic approximation type. Avtomatika i
telemekhanika (in Russian). translated in Automat. Remote Control, 51 (1991), pages
98–107, 1990.
[10] B. T. Polyak and A. B. Juditsky. Acceleration of stochastic approximation by averaging.
SIAM J. Control Optim., 30(4):838–855, 1992.
[11] V. R. Konda and J. N. Tsitsiklis. Convergence rate of linear two-time-scale stochastic
approximation. Ann. Appl. Probab., 14(2):796–819, 2004.
[12] E. Moulines and F. R. Bach. Non-asymptotic analysis of stochastic approximation
algorithms for machine learning. In Advances in Neural Information Processing Systems
24, pages 451–459. Curran Associates, Inc., 2011.
29 / 31

References
Selected References III
[13] C. Szepesvári. Algorithms for Reinforcement Learning. Synthesis Lectures on Artificial
Intelligence and Machine Learning. Morgan & Claypool Publishers, 2010.
[14] C. J. C. H. Watkins. Learning from Delayed Rewards. PhD thesis, King’s College,
Cambridge, Cambridge, UK, 1989.
[15] C. J. C. H. Watkins and P. Dayan. Q-learning. Machine Learning, 8(3-4):279–292, 1992.
[16] R. S. Sutton.Learning to predict by the methods of temporal differences. Mach. Learn.,
3(1):9–44, 1988.
[17] J. N. Tsitsiklis and B. Van Roy. An analysis of temporal-difference learning with function
approximation. IEEE Trans. Automat. Control, 42(5):674–690, 1997.
[18] C. Szepesvári. The asymptotic convergence-rate of Q-learning. In Proceedings of the 10th
Internat. Conf. on Neural Info. Proc. Systems, pages 1064–1070. MIT Press, 1997.
[19] M. G. Azar, R. Munos, M. Ghavamzadeh, and H. Kappen. Speedy Q-learning. In
Advances in Neural Information Processing Systems, 2011.
[20] E. Even-Dar and Y. Mansour. Learning rates for Q-learning. Journal of Machine Learning
Research, 5(Dec):1–25, 2003.
30 / 31

References
Selected References IV
[21] D. Huang, W. Chen, P. Mehta, S. Meyn, and A. Surana. Feature selection for
neuro-dynamic programming. In F. Lewis, editor, Reinforcement Learning and
Approximate Dynamic Programming for Feedback Control. Wiley, 2011.
[22] J. N. Tsitsiklis and B. Van Roy. Optimal stopping of Markov processes: Hilbert space
theory, approximation algorithms, and an application to pricing high-dimensional financial
derivatives. IEEE Trans. Automat. Control, 44(10):1840–1851, 1999.
[23] D. Choi and B. Van Roy. A generalized Kalman filter for fixed point approximation and
efficient temporal-difference learning. Discrete Event Dynamic Systems: Theory and
Applications, 16(2):207–239, 2006.
[24] S. J. Bradtke and A. G. Barto. Linear least-squares algorithms for temporal difference
learning. Mach. Learn., 22(1-3):33–57, 1996.
[25] J. A. Boyan. Technical update: Least-squares temporal difference learning. Mach. Learn.,
49(2-3):233–246, 2002.
[26] A. Nedic and D. Bertsekas. Least squares policy evaluation algorithms with linear function
approximation. Discrete Event Dyn. Systems: Theory and Appl., 13(1-2):79–110, 2003.
[27] P. G. Mehta and S. P. Meyn. Q-learning and Pontryagin’s minimum principle. In IEEE
Conference on Decision and Control, pages 3598–3605, Dec. 2009.
31 / 31

Introducing Zap Q-Learning

More Related Content

What's hot (20)

Similar to Introducing Zap Q-Learning (20)

More from Sean Meyn (20)

Recently uploaded (20)

Introducing Zap Q-Learning