SlideShare a Scribd company logo
Zap Q-Learning
Reinforcement Learning: Hidden Theory, and New Super-Fast Algorithms
Center for Systems and Control (CSC@USC)
and Ming Hsieh Institute for Electrical Engineering
February 21, 2018
Adithya M. Devraj Sean P. Meyn
Department of Electrical and Computer Engineering — University of Florida
Zap Q-Learning
Outline
1 Stochastic Approximation
2 Fastest Stochastic Approximation
3 Reinforcement Learning
4 Zap Q-Learning
5 Conclusions & Future Work
6 References
E[f(θ,W)]
θ=θ∗
= 0
Stochastic Approximation
Stochastic Approximation Basic Algorithm
What is Stochastic Approximation?
A simple goal: Find the solution θ∗ to
¯f(θ∗
) := E[f(θ, W)]
θ=θ∗
= 0
1 / 31
Stochastic Approximation Basic Algorithm
What is Stochastic Approximation?
A simple goal: Find the solution θ∗ to
¯f(θ∗
) := E[f(θ, W)]
θ=θ∗
= 0
What makes this hard?
1 / 31
Stochastic Approximation Basic Algorithm
What is Stochastic Approximation?
A simple goal: Find the solution θ∗ to
¯f(θ∗
) := E[f(θ, W)]
θ=θ∗
= 0
What makes this hard?
1 The function f and the distribution of the random vector W may not
be known
– we may only know something about the structure of the problem
1 / 31
Stochastic Approximation Basic Algorithm
What is Stochastic Approximation?
A simple goal: Find the solution θ∗ to
¯f(θ∗
) := E[f(θ, W)]
θ=θ∗
= 0
What makes this hard?
1 The function f and the distribution of the random vector W may not
be known
– we may only know something about the structure of the problem
2 Even if everything is known, computation of the expectation may be
expensive. For root finding, we may need to compute the expectation
for many values of θ
1 / 31
Stochastic Approximation Basic Algorithm
What is Stochastic Approximation?
A simple goal: Find the solution θ∗ to
¯f(θ∗
) := E[f(θ, W)]
θ=θ∗
= 0
What makes this hard?
1 The function f and the distribution of the random vector W may not
be known
– we may only know something about the structure of the problem
2 Even if everything is known, computation of the expectation may be
expensive. For root finding, we may need to compute the expectation
for many values of θ
3 Motivates stochastic approximation: θ(n + 1) = θ(n) + αnf(θ(n), W(n))
The recursive algorithms we come up with are often slow, and their
variance may be infinite: typical in Q-learning [Devraj & M 2017]
1 / 31
Stochastic Approximation ODE Method
Algorithm and Convergence Analysis
Algorithm:
θ(n + 1) = θ(n) + αnf(θ(n), W(n))
Goal:
¯f(θ∗
) := E[f(θ, W)]
θ=θ∗
= 0
Interpretation: θ∗ ≡ stationary point of the ODE
d
dt
ϑ(t) = ¯f(ϑ(t))
2 / 31
Stochastic Approximation ODE Method
Algorithm and Convergence Analysis
Algorithm:
θ(n + 1) = θ(n) + αnf(θ(n), W(n))
Goal:
¯f(θ∗
) := E[f(θ, W)]
θ=θ∗
= 0
Interpretation: θ∗ ≡ stationary point of the ODE
d
dt
ϑ(t) = ¯f(ϑ(t))
Analysis: Stability of the ODE ⊕ (See Borkar’s monograph) =⇒
lim
n→∞
θ(n) = θ∗
2 / 31
Stochastic Approximation SA Example
Stochastic Approximation Example
Example: Monte-Carlo
Monte-Carlo Estimation
Estimate the mean η = c(X), where X is a random variable:
η = c(x) fX(x) dx
3 / 31
Stochastic Approximation SA Example
Stochastic Approximation Example
Example: Monte-Carlo
Monte-Carlo Estimation
Estimate the mean η = c(X), where X is a random variable
SA interpretation: Find θ∗ solving 0 = E[f(θ, X)] = E[c(X) − θ]
Algorithm: θ(n) =
1
n
n
i=1
c(X(i))
3 / 31
Stochastic Approximation SA Example
Stochastic Approximation Example
Example: Monte-Carlo
Monte-Carlo Estimation
Estimate the mean η = c(X), where X is a random variable
SA interpretation: Find θ∗ solving 0 = E[f(θ, X)] = E[c(X) − θ]
Algorithm: θ(n) =
1
n
n
i=1
c(X(i))
=⇒ (n + 1)θ(n + 1) =
n+1
i=1
c(X(i)) = nθ(n) + c(X(n + 1))
=⇒ (n + 1)θ(n + 1) = (n + 1)θ(n) + [c(X(n + 1)) − θ(n)]
SA Recursion: θ(n + 1) = θ(n) + αnf(θ(n), X(n + 1))
αn = ∞, α2
n < ∞
3 / 31
θ(k)
k
Fastest Stochastic Approximation
Fastest Stochastic Approximation Algorithm Performance
Performance Criteria
Two standard approaches to evaluate performance, ˜θ(n) := θ(n) − θ∗:
1 Finite-n bound:
P{ ˜θ(n) ≥ ε} ≤ exp(−I(ε, n)) , I(ε, n) = O(nε2
)
2 Asymptotic covariance:
Σ = lim
n→∞
nE ˜θ(n)˜θ(n)T
,
√
n˜θ(n) ≈ N(0, Σ)
4 / 31
Fastest Stochastic Approximation Algorithm Performance
Asymptotic Covariance
Σ = lim
n→∞
Σn = lim
n→∞
nE ˜θ(n)˜θ(n)T
,
√
n˜θ(n) ≈ N(0, Σ)
SA recursion for covariance:
Σn+1 ≈ Σn + 1
n (A + 1
2I)Σn + Σn(A + 1
2I)T
+ Σ∆
A = d
dθ
¯f (θ∗)
Conclusions
1 If Re λ(A) ≥ −1
2 for some eigenvalue then Σ is (typically) infinite
2 If Re λ(A) < −1
2 for all, then Σ = limn→∞ Σn is the unique solution
to the Lyapunov equation:
0 = (A + 1
2I)Σ + Σ(A + 1
2I)T
+ Σ∆
5 / 31
Fastest Stochastic Approximation Stochastic Newton Raphson
Optimal Asymptotic Covariance
Introduce a d × d matrix gain sequence {Gn}:
θ(n + 1) = θ(n) +
1
n + 1
Gnf(θ(n), X(n))
6 / 31
Fastest Stochastic Approximation Stochastic Newton Raphson
Optimal Asymptotic Covariance
Introduce a d × d matrix gain sequence {Gn}:
θ(n + 1) = θ(n) +
1
n + 1
Gnf(θ(n), X(n))
Assume it converges, and linearize:
˜θ(n + 1) ≈ ˜θ(n) +
1
n + 1
G A˜θ(n) + ∆(n + 1) , A =
d
dθ
¯f (θ∗
) .
6 / 31
Fastest Stochastic Approximation Stochastic Newton Raphson
Optimal Asymptotic Covariance
Introduce a d × d matrix gain sequence {Gn}:
θ(n + 1) = θ(n) +
1
n + 1
Gnf(θ(n), X(n))
Assume it converges, and linearize:
˜θ(n + 1) ≈ ˜θ(n) +
1
n + 1
G A˜θ(n) + ∆(n + 1) , A =
d
dθ
¯f (θ∗
) .
If G = G∗ := −A−1 then
Resembles Monte-Carlo estimate
Resembles Newton-Rapshon
It is optimal: Σ∗ = G∗Σ∆G∗T
≤ ΣG any other G
6 / 31
Fastest Stochastic Approximation Stochastic Newton Raphson
Optimal Asymptotic Covariance
Introduce a d × d matrix gain sequence {Gn}:
θ(n + 1) = θ(n) +
1
n + 1
Gnf(θ(n), X(n))
Assume it converges, and linearize:
˜θ(n + 1) ≈ ˜θ(n) +
1
n + 1
G A˜θ(n) + ∆(n + 1) , A =
d
dθ
¯f (θ∗
) .
If G = G∗ := −A−1 then
Resembles Monte-Carlo estimate
Resembles Newton-Rapshon
It is optimal: Σ∗ = G∗Σ∆G∗T
≤ ΣG any other G
Polyak-Ruppert averaging is also optimal, but first two bullets are missing.
6 / 31
Fastest Stochastic Approximation Stochastic Newton Raphson
Optimal Variance
Example: return to Monte-Carlo
θ(n + 1) = θ(n) +
g
n + 1
−θ(n) + X(n + 1)
7 / 31
Fastest Stochastic Approximation Stochastic Newton Raphson
Optimal Variance
Example: return to Monte-Carlo
θ(n + 1) = θ(n) +
g
n + 1
−θ(n) + X(n + 1)
∆(n) = X(n) − E[X(n)]
7 / 31
Fastest Stochastic Approximation Stochastic Newton Raphson
Optimal Variance
∆(n) = X(n) − E[X(n)]Normalization for analysis:
˜θ(n + 1) = ˜θ(n) +
g
n + 1
−˜θ(n) + ∆(n + 1)
7 / 31
Fastest Stochastic Approximation Stochastic Newton Raphson
Optimal Variance
∆(n) = X(n) − E[X(n)]Normalization for analysis:
˜θ(n + 1) = ˜θ(n) +
g
n + 1
−˜θ(n) + ∆(n + 1)
Example: X(n) = W2(n), W ∼ N(0, 1)
7 / 31
Fastest Stochastic Approximation Stochastic Newton Raphson
Optimal Variance
∆(n) = X(n) − E[X(n)]Normalization for analysis:
˜θ(n + 1) = ˜θ(n) +
g
n + 1
−˜θ(n) + ∆(n + 1)
Example: X(n) = W2(n), W ∼ N(0, 1)
0 1 2 3 4 5 g
σ2
∆
Σ =
σ2
∆
2
g2
g − 1/2
Asymptotic variance as a function of g
7 / 31
Fastest Stochastic Approximation Stochastic Newton Raphson
Optimal Variance
∆(n) = X(n) − E[X(n)]Normalization for analysis:
˜θ(n + 1) = ˜θ(n) +
g
n + 1
−˜θ(n) + ∆(n + 1)
Example: X(n) = W2(n), W ∼ N(0, 1)
0 1 2 3 4 5
t 104
0.4
0.6
0.8
1
1.2
(t)
20 30.8
10 15.8
1 3
0.5
0.1
g
SA estimates of E[W2
], W ∼ N(0, 1)
7 / 31
Fastest Stochastic Approximation Stochastic Newton Raphson
Optimal Asymptotic Covariance and Zap-SNR
Zap-SNR (designed to emulate deterministic Newton-Raphson)
Requires An ≈ A(θn) :=
d
dθ
¯f (θn)
8 / 31
Fastest Stochastic Approximation Stochastic Newton Raphson
Optimal Asymptotic Covariance and Zap-SNR
Zap-SNR (designed to emulate deterministic Newton-Raphson)
θ(n + 1) = θ(n) + αn(−An)−1
f(θ(n), X(n))
An = An−1 + γn(An − An−1), An =
d
dθ
f(θ(n), X(n))
8 / 31
Fastest Stochastic Approximation Stochastic Newton Raphson
Optimal Asymptotic Covariance and Zap-SNR
Zap-SNR (designed to emulate deterministic Newton-Raphson)
θ(n + 1) = θ(n) + αn(−An)−1
f(θ(n), X(n))
An = An−1 + γn(An − An−1), An =
d
dθ
f(θ(n), X(n))
An ≈ A(θn) requires high-gain,
γn
αn
→ ∞, n → ∞
8 / 31
Fastest Stochastic Approximation Stochastic Newton Raphson
Optimal Asymptotic Covariance and Zap-SNR
Zap-SNR (designed to emulate deterministic Newton-Raphson)
θ(n + 1) = θ(n) + αn(−An)−1
f(θ(n), X(n))
An = An−1 + γn(An − An−1), An =
d
dθ
f(θ(n), X(n))
An ≈ A(θn) requires high-gain,
γn
αn
→ ∞, n → ∞
Always: αn = 1/n. Numerics that follow: γn = (1/n)ρ, ρ ∈ (0.5, 1)
8 / 31
Fastest Stochastic Approximation Stochastic Newton Raphson
Optimal Asymptotic Covariance and Zap-SNR
Zap-SNR (designed to emulate deterministic Newton-Raphson)
θ(n + 1) = θ(n) + αn(−An)−1
f(θ(n), X(n))
An = An−1 + γn(An − An−1), An =
d
dθ
f(θ(n), X(n))
An ≈ A(θn) requires high-gain,
γn
αn
→ ∞, n → ∞
Always: αn = 1/n. Numerics that follow: γn = (1/n)ρ, ρ ∈ (0.5, 1)
ODE for Zap-SNR
d
dt
xt = (−A(xt))−1 ¯f (xt), A(x) =
d
dx
¯f (x)
8 / 31
Fastest Stochastic Approximation Stochastic Newton Raphson
Optimal Asymptotic Covariance and Zap-SNR
Zap-SNR (designed to emulate deterministic Newton-Raphson)
θ(n + 1) = θ(n) + αn(−An)−1
f(θ(n), X(n))
An = An−1 + γn(An − An−1), An =
d
dθ
f(θ(n), X(n))
An ≈ A(θn) requires high-gain,
γn
αn
→ ∞, n → ∞
Always: αn = 1/n. Numerics that follow: γn = (1/n)ρ, ρ ∈ (0.5, 1)
ODE for Zap-SNR
d
dt
xt = (−A(xt))−1 ¯f (xt), A(x) =
d
dx
¯f (x)
Not necessarily stable (just like in deterministic Newton-Raphson)
General conditions for convergence is open 8 / 31
Reinforcement Learning
and Stochastic Approximation
Reinforcement Learning
and Stochastic Approximation
Reinforcement Learning RL & SA
SA and RL Design
Functional equations in Stochastic Control
Always of the form
0 = E[F(h∗
, Φ(n + 1)) | Φ0 . . . Φ(n)] , h∗
= ?
9 / 31
Reinforcement Learning RL & SA
SA and RL Design
Functional equations in Stochastic Control
Always of the form
0 = E[F(h∗
, Φ(n + 1)) | Φ0 . . . Φ(n)] , h∗
= ?
Φ(n) = (state, action)
9 / 31
Reinforcement Learning RL & SA
SA and RL Design
Functional equations in Stochastic Control
Always of the form
0 = E[F(h∗
, Φ(n + 1)) | Φ0 . . . Φ(n)] , h∗
= ?
Galerkin relaxation:
0 = E[F(hθ∗
, Φ(n + 1))ζn] , θ∗
= ?
9 / 31
Reinforcement Learning RL & SA
SA and RL Design
Functional equations in Stochastic Control
Always of the form
0 = E[F(h∗
, Φ(n + 1)) | Φ0 . . . Φ(n)] , h∗
= ?
Galerkin relaxation:
0 = E[F(hθ∗
, Φ(n + 1))ζn] , θ∗
= ?
Necessary Ingredients:
Parameterized family {hθ : θ ∈ Rd}
Adapted, d-dimensional stochastic process {ζn}
Examples are TD- and Q-Learning
9 / 31
Reinforcement Learning RL & SA
SA and RL Design
Functional equations in Stochastic Control
Always of the form
0 = E[F(h∗
, Φ(n + 1)) | Φ0 . . . Φ(n)] , h∗
= ?
Galerkin relaxation:
0 = E[F(hθ∗
, Φ(n + 1))ζn] , θ∗
= ?
Necessary Ingredients:
Parameterized family {hθ : θ ∈ Rd}
Adapted, d-dimensional stochastic process {ζn}
Examples are TD- and Q-Learning
These algorithms are thus special cases of stochastic approximation
(as we all know)
9 / 31
Reinforcement Learning MDP Theory
Stochastic Optimal Control
MDP Model
X is a stationary controlled Markov chain, with input U
For all states x and sets A,
P{X(n + 1) ∈ A | X(n) = x, U(n) = u, and prior history} = Pu(x, A)
c: X × U → R is a cost function
β < 1 a discount factor
10 / 31
Reinforcement Learning MDP Theory
Stochastic Optimal Control
MDP Model
X is a stationary controlled Markov chain, with input U
For all states x and sets A,
P{X(n + 1) ∈ A | X(n) = x, U(n) = u, and prior history} = Pu(x, A)
c: X × U → R is a cost function
β < 1 a discount factor
Value function:
h∗
(x) = min
U
∞
n=0
βn
E[c(X(n), U(n)) | X(0) = x]
10 / 31
Reinforcement Learning MDP Theory
Stochastic Optimal Control
MDP Model
X is a stationary controlled Markov chain, with input U
For all states x and sets A,
P{X(n + 1) ∈ A | X(n) = x, U(n) = u, and prior history} = Pu(x, A)
c: X × U → R is a cost function
β < 1 a discount factor
Value function:
h∗
(x) = min
U
∞
n=0
βn
E[c(X(n), U(n)) | X(0) = x]
Bellman equation:
h∗
(x) = min
u
{c(x, u) + βE[h∗
(X(n + 1)) | X(n) = x, U(n) = u]}
10 / 31
Reinforcement Learning Q-Learning
Q-function
Trick to swap expectation and minimum
Bellman equation:
h∗
(x) = min
u
{c(x, u) + βE[h∗
(X(n + 1)) | X(n) = x, U(n) = u]}
11 / 31
Reinforcement Learning Q-Learning
Q-function
Trick to swap expectation and minimum
Bellman equation:
h∗
(x) = min
u
{c(x, u) + βE[h∗
(X(n + 1)) | X(n) = x, U(n) = u]}
Q-function:
Q∗
(x, u) := c(x, u) + βE[h∗
(X(n + 1)) | X(n) = x, U(n) = u]
11 / 31
Reinforcement Learning Q-Learning
Q-function
Trick to swap expectation and minimum
Bellman equation:
h∗
(x) = min
u
{c(x, u) + βE[h∗
(X(n + 1)) | X(n) = x, U(n) = u]}
Q-function:
Q∗
(x, u) := c(x, u) + βE[h∗
(X(n + 1)) | X(n) = x, U(n) = u]
h∗
(x) = min
u
Q∗
(x, u)
11 / 31
Reinforcement Learning Q-Learning
Q-function
Trick to swap expectation and minimum
Bellman equation:
h∗
(x) = min
u
{c(x, u) + βE[h∗
(X(n + 1)) | X(n) = x, U(n) = u]}
Q-function:
Q∗
(x, u) := c(x, u) + βE[h∗
(X(n + 1)) | X(n) = x, U(n) = u]
h∗
(x) = min
u
Q∗
(x, u)
Another Bellman equation:
Q∗
(x, u) = c(x, u) + βE[Q∗
(X(n + 1)) | X(n) = x, U(n) = u]
Q∗
(x) = min
u
Q∗
(x, u)
11 / 31
Reinforcement Learning Q-Learning
Q-Learning and Galerkin Relaxation
Dynamic programming
Find function Q∗ that solves
E c(X(n), U(n)) + βQ∗
(X(n + 1)) − Q∗
(X(n), U(n)) | Fn = 0
12 / 31
Reinforcement Learning Q-Learning
Q-Learning and Galerkin Relaxation
Dynamic programming
Find function Q∗ that solves
E c(X(n), U(n)) + βQ∗
(X(n + 1)) − Q∗
(X(n), U(n)) | Fn = 0
That is,
0 = E[F(Q∗
, Φ(n + 1)) | Φ0 . . . Φ(n)] ,
with Φ(n + 1) = (X(n + 1), X(n), U(n)).
12 / 31
Reinforcement Learning Q-Learning
Q-Learning and Galerkin Relaxation
Dynamic programming
Find function Q∗ that solves
E c(X(n), U(n)) + βQ∗
(X(n + 1)) − Q∗
(X(n), U(n)) | Fn = 0
Q-Learning
Find θ∗ that solves
E c(X(n), U(n)) + βQθ∗
((X(n + 1)) − Qθ∗
((X(n), U(n)) ζn = 0
The family {Qθ} and eligibility vectors {ζn} are part of algorithm design.
12 / 31
Reinforcement Learning Q-Learning
Watkins’ Q-learning
Find θ∗ that solves
E c(X(n), U(n)) + βQθ∗
((X(n + 1)) − Qθ∗
((X(n), U(n)) ζn = 0
13 / 31
Reinforcement Learning Q-Learning
Watkins’ Q-learning
Find θ∗ that solves
E c(X(n), U(n)) + βQθ∗
((X(n + 1)) − Qθ∗
((X(n), U(n)) ζn = 0
Watkin’s algorithm is Stochastic Approximation
The family {Qθ} and eligibility vectors {ζn} in this design:
Linearly parameterized family of functions: Qθ(x, u) = θT
ψ(x, u)
ζn ≡ ψ(Xn, Un) and
ψn(x, u) = 1{x = xn, u = un} (complete basis)
13 / 31
Reinforcement Learning Q-Learning
Watkins’ Q-learning
Find θ∗ that solves
E c(X(n), U(n)) + βQθ∗
((X(n + 1)) − Qθ∗
((X(n), U(n)) ζn = 0
Watkin’s algorithm is Stochastic Approximation
The family {Qθ} and eligibility vectors {ζn} in this design:
Linearly parameterized family of functions: Qθ(x, u) = θT
ψ(x, u)
ζn ≡ ψ(Xn, Un) and
ψn(x, u) = 1{x = xn, u = un} (complete basis)
Asymptotic covariance is typically infinite
13 / 31
Reinforcement Learning Q-Learning
Watkins’ Q-learning
Big Question: Can we Zap Q-Learning?
Find θ∗ that solves
E c(X(n), U(n)) + βQθ∗
((X(n + 1)) − Qθ∗
((X(n), U(n)) ζn = 0
Watkin’s algorithm is Stochastic Approximation
The family {Qθ} and eligibility vectors {ζn} in this design:
Linearly parameterized family of functions: Qθ(x, u) = θT
ψ(x, u)
ζn ≡ ψ(Xn, Un) and
ψn(x, u) = 1{x = xn, u = un} (complete basis)
Asymptotic covariance is typically infinite
13 / 31
0 1 2 3 4 5 6 7 8 9 10 105
0
20
40
60
80
100 Watkins, Speedy Q-learning,
Polyak-Ruppert Averaging
Zap
BellmanError
n
Zap Q-Learning
Zap Q-Learning
Asymptotic Covariance of Watkins’ Q-Learning
Improvements are needed!
1
4
65
3 2
Histogram of parameter estimates after 106 iterations.
1000 200 300 400 486.6
0
10
20
30
40
n = 106
Histogram for θ
θ*
n(15)
(15)
Example from Devraj & M 2017
14 / 31
Zap Q-Learning Watkin’s algorithm
Zap Q-learning
Zap Q-Learning ≡ Zap-SNR for Q-Learning
0 = ¯f(θ) = E f(θ, W(n))
:= E ζn c(X(n), U(n)) + βQθ
(X(n + 1)) − Qθ
(X(n), U(n))
15 / 31
Zap Q-Learning Watkin’s algorithm
Zap Q-learning
Zap Q-Learning ≡ Zap-SNR for Q-Learning
0 = ¯f(θ) = E f(θ, W(n))
:= E ζn c(X(n), U(n)) + βQθ
(X(n + 1)) − Qθ
(X(n), U(n))
A(θ) = d
dθ
¯f (θ);
15 / 31
Zap Q-Learning Watkin’s algorithm
Zap Q-learning
Zap Q-Learning ≡ Zap-SNR for Q-Learning
0 = ¯f(θ) = E f(θ, W(n))
:= E ζn c(X(n), U(n)) + βQθ
(X(n + 1)) − Qθ
(X(n), U(n))
A(θ) = d
dθ
¯f (θ); At points of differentiability:
A(θ) = E ζn βψ(X(n + 1), φθ
(X(n + 1))) − ψ(X(n), U(n))
T
φθ
(X(n + 1)) := arg min
u
Qθ
(X(n + 1), u)
15 / 31
Zap Q-Learning Watkin’s algorithm
Zap Q-learning
Zap Q-Learning ≡ Zap-SNR for Q-Learning
0 = ¯f(θ) = E f(θ, W(n))
:= E ζn c(X(n), U(n)) + βQθ
(X(n + 1)) − Qθ
(X(n), U(n))
A(θ) = d
dθ
¯f (θ); At points of differentiability:
A(θ) = E ζn βψ(X(n + 1), φθ
(X(n + 1))) − ψ(X(n), U(n))
T
φθ
(X(n + 1)) := arg min
u
Qθ
(X(n + 1), u)
Algorithm:
θ(n + 1)= θ(n) + αn(−An)−1
(f(θ(n), Φ(n))); An = An−1 + γn(An − An−1);
15 / 31
Zap Q-Learning Watkin’s algorithm
Zap Q-learning
Zap Q-Learning ≡ Zap-SNR for Q-Learning
0 = ¯f(θ) = E f(θ, W(n))
:= E ζn c(X(n), U(n)) + βQθ
(X(n + 1)) − Qθ
(X(n), U(n))
A(θ) = d
dθ
¯f (θ); At points of differentiability:
A(θ) = E ζn βψ(X(n + 1), φθ
(X(n + 1))) − ψ(X(n), U(n))
T
φθ
(X(n + 1)) := arg min
u
Qθ
(X(n + 1), u)
Algorithm:
θ(n + 1)= θ(n) + αn(−An)−1
(f(θ(n), Φ(n))); An = An−1 + γn(An − An−1);
An+1 :=
d
dθ
f (θn, Φ(n))
= ζn βψ(X(n + 1), φθn
(X(n + 1))) − ψ(X(n), U(n))
T
15 / 31
Zap Q-Learning Watkin’s algorithm
Zap Q-learning
Zap Q-Learning ≡ Zap-SNR for Q-Learning
ODE Analysis: change of variables q = Q∗(ς)
Functional Q∗ maps cost functions to Q-functions:
q(x, u) = ς(x, u) + β
x
Pu(x, x ) min
u
q(x , u )
16 / 31
Zap Q-Learning Watkin’s algorithm
Zap Q-learning
Zap Q-Learning ≡ Zap-SNR for Q-Learning
ODE Analysis: change of variables q = Q∗(ς)
Functional Q∗ maps cost functions to Q-functions:
q(x, u) = ς(x, u) + β
x
Pu(x, x ) min
u
q(x , u )
ODE for Zap-Q
qt = Q∗
(ςt),
d
dt
ςt = −ςt + c
⇒ convergence, optimal covariance, ...
16 / 31
Zap Q-Learning Watkin’s algorithm
Zap Q-Learning
Example: Stochastic Shortest Path
1
4
65
3 2
17 / 31
Zap Q-Learning Watkin’s algorithm
Zap Q-Learning
Example: Stochastic Shortest Path
1
4
65
3 2
Convergence with Zap gain γn = n−0.85
Watkins’ algorithm has infinite asymptotic covariance with αn = 1/n
0 1 2 3 4 5 6 7 8 9 10 105
0
20
40
60
80
100 Watkins, Speedy Q-learning,
Polyak-Ruppert Averaging
Zap
BellmanError
n
Convergence of Zap-Q Learning
Discount factor: β = 0.99
17 / 31
Zap Q-Learning Watkin’s algorithm
Zap Q-Learning
Example: Stochastic Shortest Path
1
4
65
3 2
Convergence with Zap gain γn = n−0.85
Watkins’ algorithm has infinite asymptotic covariance with αn = 1/n
0 1 2 3 4 5 6 7 8 9 10 105
0
20
40
60
80
100 Watkins, Speedy Q-learning,
Polyak-Ruppert Averaging
Zap
BellmanError
n
Zap, γn = αn
Convergence of Zap-Q Learning
Discount factor: β = 0.99
17 / 31
Zap Q-Learning Watkin’s algorithm
Zap Q-Learning
Example: Stochastic Shortest Path
1
4
65
3 2
Convergence with Zap gain γn = n−0.85
Watkins’ algorithm has infinite asymptotic covariance with αn = 1/n
Optimal scalar gain is approximately αn = 1500/n
0 1 2 3 4 5 6 7 8 9 10 105
0
20
40
60
80
100 Watkins, Speedy Q-learning,
Polyak-Ruppert Averaging
Zap
BellmanError
n
Watkins, g = 1500
Zap, γn = αn
Convergence of Zap-Q Learning
Discount factor: β = 0.99
17 / 31
Zap Q-Learning Watkin’s algorithm
Zap Q-Learning
Optimize Walk to Cafe
1
4
65
3 2
Convergence with Zap gain γn = n−0.85
-2 -1 0 1 -2 -1 0 1 -8 -6 -4 -2 0 2 4 6 8 103-8 -6 -4 -2 0 2 4 6 8
n = 104
n = 106
Theoritical pdf Experimental pdf Empirical: 1000 trialsWn =
√
n˜θn
Entry #18: n = 104
n = 106
Entry #10:
CLT gives good prediction of finite-n performance
18 / 31
Zap Q-Learning Watkin’s algorithm
Zap Q-Learning
Optimize Walk to Cafe
1
4
65
3 2
Convergence with Zap gain γn = n−0.85
-2 -1 0 1 -2 -1 0 1 -8 -6 -4 -2 0 2 4 6 8 103-8 -6 -4 -2 0 2 4 6 8
n = 104
n = 106
Theoritical pdf Experimental pdf Empirical: 1000 trialsWn =
√
n˜θn
Entry #18: n = 104
n = 106
Entry #10:
CLT gives good prediction of finite-n performance
Discount factor: β = 0.99
19 / 31
Zap Q-Learning Watkin’s algorithm
Zap Q-Learning
Optimize Walk to Cafe
1
4
65
3 2Local Convergence: θ(0) initialized in neighborhood of θ∗
g = 500
g = 1500
Speedy
Poly
g = 5000
Polyak-Ruppert
B0 10 20 30 40 50
0
1
2
0 20 40 60 80 100 120 140 160
0
0.5
Zap-Q:
Zap-Q: ≡ α0 85
n
γn ≡
γn
αn
Watkins
BellmanError
Histogramsn=106
20 / 31
Zap Q-Learning Watkin’s algorithm
Zap Q-Learning
Optimize Walk to Cafe
1
4
65
3 2Local Convergence: θ(0) initialized in neighborhood of θ∗
g = 500
g = 1500
Speedy
Poly
g = 5000
g = 500
g = 1500
Speedy
Poly
g = 5000
10
3
10
4
10
5
10
6
100
101
102
103
104
10
3
10
4
10
5
10
6
n
Polyak-Ruppert
Polyak-RuppertB
B
n
0 10 20 30 40 50
0
1
2
0 20 40 60 80 100 120 140 160
0
0.5
Zap-Q:
Zap-Q: ≡ α0 85
n
γn ≡
γn
αn
Zap-Q:
Zap-Q: ≡ α0 85
n
γn ≡
γn
αn
WatkinsWatkins
BellmanErrorBellmanError
Histogramsn=106
2σ confidence intervals for the Q-learning algorithms
20 / 31
Zap Q-Learning Optimal stopping
Zap Q-Learning
Model of Tsitsiklis and Van Roy: Optimal Stopping Time in Finance
State space: R100
Parameterized Q-function: Qθ with θ ∈ R10
i
0 1 2 3 4 5 6 7 8 9 10
-10
0
-10
-1
-10
-2
-10-3
-10-4
-10-5
-10
-6
Real for every eigenvalue λ
Asymptotic covariance is infinite
λ > −
1
2
Real λi(A)
21 / 31
Zap Q-Learning Optimal stopping
Zap Q-Learning
Model of Tsitsiklis and Van Roy: Optimal Stopping Time in Finance
State space: R100
Parameterized Q-function: Qθ with θ ∈ R10
i
0 1 2 3 4 5 6 7 8 9 10
-10
0
-10
-1
-10
-2
-10
-3
-10-4
-10-5
-10
-6
Real for every eigenvalue λ
Authors observed slow convergence
Proposed a matrix gain sequence
(see refs for details)
Asymptotic covariance is infinite
λ > −
1
2
Real λi(A)
{Gn}
21 / 31
Zap Q-Learning Optimal stopping
Zap Q-Learning
Model of Tsitsiklis and Van Roy: Optimal Stopping Time in Finance
State space: R100
Parameterized Q-function: Qθ with θ ∈ R10
i
0 1 2 3 4 5 6 7 8 9 10
-10
0
-10
-1
-10
-2
-10
-3
-10-4
-10-5
-10
-6
-0.525-30 -25 -20 -15 -10 -5
-10
-5
0
5
10
Re (λ(GA))
Co(λ(GA))
λi(GA)Real λi(A)
Eigenvalues of A and GA for the finance example
Favorite choice of gain in [23] barely meets the criterion Re(λ(GA)) < −1
2
21 / 31
Zap Q-Learning Optimal stopping
Zap Q-Learning
Model of Tsitsiklis and Van Roy: Optimal Stopping Time in Finance
State space: R100. Parameterized Q-function: Qθ with θ ∈ R10
Zap-Q
G-Q
-1000 0 1000 2000 3000 -600 -400 -200 0 200 400 600 800
-250 -200 -150 -100 -50 0 50 100 -200 -100 0 100 200 300
Theoritical pdf Experimental pdf Empirical: 1000 trials
Wn =
√
n˜θn
Entry #1: n = 2 × 106
Entry #7: n = 2 × 106
22 / 31
Zap Q-Learning Optimal stopping
Zap Q-Learning
Model of Tsitsiklis and Van Roy: Optimal Stopping Time in Finance
State space: R100.
Parameterized Q-function: Qθ with θ ∈ R10
Histograms of the average reward obtained using the different algorithms:
1 1.05 1.1 1.15 1.2 1.25
0
20
40
60
80
100
1 1.05 1.1 1.15 1.2 1.25
0
100
200
300
400
500
600
1 1.05 1.1 1.15 1.2 1.25
0
5
10
15
20
25
30
35 G-Q(0)
G-Q(0)
Zap-Q
Zap-Q ρ = 0.8
ρ = 1.0
g = 100
g = 200
Zap-Q ρ = 0.85
n = 2 × 104
n = 2 × 105
n = 2 × 106
Zap-Q G-Q
23 / 31
Conclusions & Future Work
Conclusions & Future Work
Conclusions
Reinforcement Learning is not just cursed by dimension,
but also by variance
We need better design tools to improve performance
24 / 31
Conclusions & Future Work
Conclusions & Future Work
Conclusions
Reinforcement Learning is not just cursed by dimension,
but also by variance
We need better design tools to improve performance
The asymptotic covariance is an awesome design tool.
It is also predictive of finite-n performance.
Example: g∗
= 1500 was chosen based on asymptotic covariance
24 / 31
Conclusions & Future Work
Conclusions & Future Work
Conclusions
Reinforcement Learning is not just cursed by dimension,
but also by variance
We need better design tools to improve performance
The asymptotic covariance is an awesome design tool.
It is also predictive of finite-n performance.
Example: g∗
= 1500 was chosen based on asymptotic covariance
Future work:
Q-learning with function-approximation
Obtain conditions for a stable algorithm in a general setting
Optimal stopping time problems
Adaptive optimization of algorithm parameters
24 / 31
Conclusions & Future Work
Conclusions & Future Work
Conclusions
Reinforcement Learning is not just cursed by dimension,
but also by variance
We need better design tools to improve performance
The asymptotic covariance is an awesome design tool.
It is also predictive of finite-n performance.
Example: g∗
= 1500 was chosen based on asymptotic covariance
Future work:
Q-learning with function-approximation
Obtain conditions for a stable algorithm in a general setting
Optimal stopping time problems
Adaptive optimization of algorithm parameters
Finite-time analysis
24 / 31
Conclusions & Future Work
Thank you!
thankful
25 / 31
References
Control Techniques
FOR
Complex Networks
Sean Meyn
Pre-publication version for on-line viewing. Monograph available for purchase at your favorite retailer
More information available at http://www.cambridge.org/us/catalogue/catalogue.asp?isbn=9780521884419
Markov Chains
and
Stochastic Stability
S. P. Meyn and R. L. Tweedie
August 2008 Pre-publication version for on-line viewing. Monograph to appear Februrary 2009
π(f)<∞
∆V (x) ≤ −f(x) + bIC(x)
Pn
(x, · ) − π f → 0
sup
C
Ex[SτC(f)]<∞
References
26 / 31
References
This lecture
A. M. Devraj and S. P. Meyn, Zap Q-learning. Advances in Neural
Information Processing Systems (NIPS). Dec. 2017.
A. M. Devraj and S. P. Meyn, Fastest convergence for Q-learning. Available
on ArXiv. Jul. 2017.
27 / 31
References
Selected References I
[1] A. M. Devraj and S. P. Meyn. Fastest convergence for Q-learning. ArXiv , July 2017.
[2] A. Benveniste, M. M´etivier, and P. Priouret. Adaptive algorithms and stochastic
approximations, volume 22 of Applications of Mathematics (New York). Springer-Verlag,
Berlin, 1990. Translated from the French by Stephen S. Wilson.
[3] V. S. Borkar. Stochastic Approximation: A Dynamical Systems Viewpoint. Hindustan
Book Agency and Cambridge University Press (jointly), Delhi, India and Cambridge, UK,
2008.
[4] V. S. Borkar and S. P. Meyn. The ODE method for convergence of stochastic
approximation and reinforcement learning. SIAM J. Control Optim., 38(2):447–469, 2000.
[5] S. P. Meyn and R. L. Tweedie. Markov chains and stochastic stability. Cambridge
University Press, Cambridge, second edition, 2009. Published in the Cambridge
Mathematical Library.
[6] S. P. Meyn. Control Techniques for Complex Networks. Cambridge University Press, 2007.
See last chapter on simulation and average-cost TD learning
28 / 31
References
Selected References II
[7] D. Ruppert. A Newton-Raphson version of the multivariate Robbins-Monro procedure.
The Annals of Statistics, 13(1):236–245, 1985.
[8] D. Ruppert. Efficient estimators from a slowly convergent Robbins-Monro processes.
Technical Report Tech. Rept. No. 781, Cornell University, School of Operations Research
and Industrial Engineering, Ithaca, NY, 1988.
[9] B. T. Polyak. A new method of stochastic approximation type. Avtomatika i
telemekhanika (in Russian). translated in Automat. Remote Control, 51 (1991), pages
98–107, 1990.
[10] B. T. Polyak and A. B. Juditsky. Acceleration of stochastic approximation by averaging.
SIAM J. Control Optim., 30(4):838–855, 1992.
[11] V. R. Konda and J. N. Tsitsiklis. Convergence rate of linear two-time-scale stochastic
approximation. Ann. Appl. Probab., 14(2):796–819, 2004.
[12] E. Moulines and F. R. Bach. Non-asymptotic analysis of stochastic approximation
algorithms for machine learning. In Advances in Neural Information Processing Systems
24, pages 451–459. Curran Associates, Inc., 2011.
29 / 31
References
Selected References III
[13] C. Szepesv´ari. Algorithms for Reinforcement Learning. Synthesis Lectures on Artificial
Intelligence and Machine Learning. Morgan & Claypool Publishers, 2010.
[14] C. J. C. H. Watkins. Learning from Delayed Rewards. PhD thesis, King’s College,
Cambridge, Cambridge, UK, 1989.
[15] C. J. C. H. Watkins and P. Dayan. Q-learning. Machine Learning, 8(3-4):279–292, 1992.
[16] R. S. Sutton.Learning to predict by the methods of temporal differences. Mach. Learn.,
3(1):9–44, 1988.
[17] J. N. Tsitsiklis and B. Van Roy. An analysis of temporal-difference learning with function
approximation. IEEE Trans. Automat. Control, 42(5):674–690, 1997.
[18] C. Szepesv´ari. The asymptotic convergence-rate of Q-learning. In Proceedings of the 10th
Internat. Conf. on Neural Info. Proc. Systems, pages 1064–1070. MIT Press, 1997.
[19] M. G. Azar, R. Munos, M. Ghavamzadeh, and H. Kappen. Speedy Q-learning. In
Advances in Neural Information Processing Systems, 2011.
[20] E. Even-Dar and Y. Mansour. Learning rates for Q-learning. Journal of Machine Learning
Research, 5(Dec):1–25, 2003.
30 / 31
References
Selected References IV
[21] D. Huang, W. Chen, P. Mehta, S. Meyn, and A. Surana. Feature selection for
neuro-dynamic programming. In F. Lewis, editor, Reinforcement Learning and
Approximate Dynamic Programming for Feedback Control. Wiley, 2011.
[22] J. N. Tsitsiklis and B. Van Roy. Optimal stopping of Markov processes: Hilbert space
theory, approximation algorithms, and an application to pricing high-dimensional financial
derivatives. IEEE Trans. Automat. Control, 44(10):1840–1851, 1999.
[23] D. Choi and B. Van Roy. A generalized Kalman filter for fixed point approximation and
efficient temporal-difference learning. Discrete Event Dynamic Systems: Theory and
Applications, 16(2):207–239, 2006.
[24] S. J. Bradtke and A. G. Barto. Linear least-squares algorithms for temporal difference
learning. Mach. Learn., 22(1-3):33–57, 1996.
[25] J. A. Boyan. Technical update: Least-squares temporal difference learning. Mach. Learn.,
49(2-3):233–246, 2002.
[26] A. Nedic and D. Bertsekas. Least squares policy evaluation algorithms with linear function
approximation. Discrete Event Dyn. Systems: Theory and Appl., 13(1-2):79–110, 2003.
[27] P. G. Mehta and S. P. Meyn. Q-learning and Pontryagin’s minimum principle. In IEEE
Conference on Decision and Control, pages 3598–3605, Dec. 2009.
31 / 31

More Related Content

PDF
1시간만에 GAN(Generative Adversarial Network) 완전 정복하기
PDF
NGSを用いたジェノタイピングを様々な解析に用いるには?
PDF
PR 127: FaceNet
PDF
Why Batch Normalization Works so Well
PDF
Faster R-CNN: Towards real-time object detection with region proposal network...
PDF
Introduction to A3C model
PDF
Distributed deep learning
PPT
AlphaGo Zero 解説
1시간만에 GAN(Generative Adversarial Network) 완전 정복하기
NGSを用いたジェノタイピングを様々な解析に用いるには?
PR 127: FaceNet
Why Batch Normalization Works so Well
Faster R-CNN: Towards real-time object detection with region proposal network...
Introduction to A3C model
Distributed deep learning
AlphaGo Zero 解説

What's hot (20)

PPTX
3D Gaussian Splatting
PDF
画像認識のための深層学習
PDF
Single Shot Multibox Detector
PPTX
[DL輪読会]Approximating CNNs with Bag-of-local-Features models works surprisingl...
PPTX
Faster rcnn
PDF
直交領域探索
PDF
高速フーリエ変換
PDF
Introduction of Faster R-CNN
PDF
Deep Learning for Computer Vision: Object Detection (UPC 2016)
PDF
An introduction to deep reinforcement learning
PDF
Object Detection Using R-CNN Deep Learning Framework
PPTX
Deep Reinforcement Learning
PPTX
Convolutional neural network from VGG to DenseNet
PDF
Kaggleのテクニック
PDF
画像キャプションと動作認識の最前線 〜データセットに注目して〜(第17回ステアラボ人工知能セミナー)
PDF
Actor critic algorithm
PDF
アドテクにおけるBandit Algorithmの活用
PPTX
Spatial Temporal Graph Convolutional Networks for Skeleton-Based Action Recog...
PPTX
AlexNet, VGG, GoogleNet, Resnet
PDF
Brief introduction on GAN
3D Gaussian Splatting
画像認識のための深層学習
Single Shot Multibox Detector
[DL輪読会]Approximating CNNs with Bag-of-local-Features models works surprisingl...
Faster rcnn
直交領域探索
高速フーリエ変換
Introduction of Faster R-CNN
Deep Learning for Computer Vision: Object Detection (UPC 2016)
An introduction to deep reinforcement learning
Object Detection Using R-CNN Deep Learning Framework
Deep Reinforcement Learning
Convolutional neural network from VGG to DenseNet
Kaggleのテクニック
画像キャプションと動作認識の最前線 〜データセットに注目して〜(第17回ステアラボ人工知能セミナー)
Actor critic algorithm
アドテクにおけるBandit Algorithmの活用
Spatial Temporal Graph Convolutional Networks for Skeleton-Based Action Recog...
AlexNet, VGG, GoogleNet, Resnet
Brief introduction on GAN
Ad

Similar to Introducing Zap Q-Learning (20)

PDF
Zap Q-Learning - ISMP 2018
PDF
Reinforcement Learning: Hidden Theory and New Super-Fast Algorithms
PDF
DeepLearn2022 2. Variance Matters
PDF
Quasi-Stochastic Approximation: Algorithm Design Principles with Applications...
PDF
QMC: Operator Splitting Workshop, Stochastic Block-Coordinate Fixed Point Alg...
PDF
A Stochastic Iteration Method for A Class of Monotone Variational Inequalitie...
PDF
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
PDF
MUMS: Bayesian, Fiducial, and Frequentist Conference - Coverage of Credible I...
PDF
QMC: Transition Workshop - Approximating Multivariate Functions When Function...
PDF
Stochastic optimal control &amp; rl
PDF
2019 PMED Spring Course - SMARTs-Part II - Eric Laber, April 10, 2019
PDF
Doubly Accelerated Stochastic Variance Reduced Gradient Methods for Regulariz...
PDF
Minimax statistical learning with Wasserstein distances (NeurIPS2018 Reading ...
PDF
Stochastic Approximation and Simulated Annealing
PDF
Overview of Stochastic Calculus Foundations
PDF
Asymptotics of ABC, lecture, Collège de France
PDF
Statistical Inference Using Stochastic Gradient Descent
PDF
Statistical Inference Using Stochastic Gradient Descent
PDF
KAUST_talk_short.pdf
PDF
Lec_13.pdf
Zap Q-Learning - ISMP 2018
Reinforcement Learning: Hidden Theory and New Super-Fast Algorithms
DeepLearn2022 2. Variance Matters
Quasi-Stochastic Approximation: Algorithm Design Principles with Applications...
QMC: Operator Splitting Workshop, Stochastic Block-Coordinate Fixed Point Alg...
A Stochastic Iteration Method for A Class of Monotone Variational Inequalitie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
MUMS: Bayesian, Fiducial, and Frequentist Conference - Coverage of Credible I...
QMC: Transition Workshop - Approximating Multivariate Functions When Function...
Stochastic optimal control &amp; rl
2019 PMED Spring Course - SMARTs-Part II - Eric Laber, April 10, 2019
Doubly Accelerated Stochastic Variance Reduced Gradient Methods for Regulariz...
Minimax statistical learning with Wasserstein distances (NeurIPS2018 Reading ...
Stochastic Approximation and Simulated Annealing
Overview of Stochastic Calculus Foundations
Asymptotics of ABC, lecture, Collège de France
Statistical Inference Using Stochastic Gradient Descent
Statistical Inference Using Stochastic Gradient Descent
KAUST_talk_short.pdf
Lec_13.pdf
Ad

More from Sean Meyn (20)

PDF
DeepLearn2022 1. Goals & AlgorithmDesign.pdf
PDF
DeepLearn2022 3. TD and Q Learning
PDF
Smart Grid Tutorial - January 2019
PDF
State Space Collapse in Resource Allocation for Demand Dispatch - May 2019
PDF
Irrational Agents and the Power Grid
PDF
State estimation and Mean-Field Control with application to demand dispatch
PDF
Demand-Side Flexibility for Reliable Ancillary Services
PDF
Spectral Decomposition of Demand-Side Flexibility for Reliable Ancillary Serv...
PDF
Demand-Side Flexibility for Reliable Ancillary Services in a Smart Grid: Elim...
PDF
Why Do We Ignore Risk in Power Economics?
PDF
Distributed Randomized Control for Ancillary Service to the Power Grid
PDF
Ancillary service to the grid from deferrable loads: the case for intelligent...
PDF
2012 Tutorial: Markets for Differentiated Electric Power Products
PDF
Control Techniques for Complex Systems
PDF
Tutorial for Energy Systems Week - Cambridge 2010
PDF
Panel Lecture for Energy Systems Week
PDF
The Value of Volatile Resources... Caltech, May 6 2010
PDF
Approximate dynamic programming using fluid and diffusion approximations with...
PDF
Anomaly Detection Using Projective Markov Models
PDF
Markov Tutorial CDC Shanghai 2009
DeepLearn2022 1. Goals & AlgorithmDesign.pdf
DeepLearn2022 3. TD and Q Learning
Smart Grid Tutorial - January 2019
State Space Collapse in Resource Allocation for Demand Dispatch - May 2019
Irrational Agents and the Power Grid
State estimation and Mean-Field Control with application to demand dispatch
Demand-Side Flexibility for Reliable Ancillary Services
Spectral Decomposition of Demand-Side Flexibility for Reliable Ancillary Serv...
Demand-Side Flexibility for Reliable Ancillary Services in a Smart Grid: Elim...
Why Do We Ignore Risk in Power Economics?
Distributed Randomized Control for Ancillary Service to the Power Grid
Ancillary service to the grid from deferrable loads: the case for intelligent...
2012 Tutorial: Markets for Differentiated Electric Power Products
Control Techniques for Complex Systems
Tutorial for Energy Systems Week - Cambridge 2010
Panel Lecture for Energy Systems Week
The Value of Volatile Resources... Caltech, May 6 2010
Approximate dynamic programming using fluid and diffusion approximations with...
Anomaly Detection Using Projective Markov Models
Markov Tutorial CDC Shanghai 2009

Recently uploaded (20)

PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
PPTX
climate analysis of Dhaka ,Banglades.pptx
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PPT
ISS -ESG Data flows What is ESG and HowHow
PPTX
Qualitative Qantitative and Mixed Methods.pptx
PDF
annual-report-2024-2025 original latest.
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PDF
Business Analytics and business intelligence.pdf
PDF
Foundation of Data Science unit number two notes
PPTX
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PDF
.pdf is not working space design for the following data for the following dat...
PDF
Clinical guidelines as a resource for EBP(1).pdf
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PDF
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
climate analysis of Dhaka ,Banglades.pptx
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
ISS -ESG Data flows What is ESG and HowHow
Qualitative Qantitative and Mixed Methods.pptx
annual-report-2024-2025 original latest.
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
oil_refinery_comprehensive_20250804084928 (1).pptx
Business Analytics and business intelligence.pdf
Foundation of Data Science unit number two notes
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
.pdf is not working space design for the following data for the following dat...
Clinical guidelines as a resource for EBP(1).pdf
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf

Introducing Zap Q-Learning

  • 1. Zap Q-Learning Reinforcement Learning: Hidden Theory, and New Super-Fast Algorithms Center for Systems and Control (CSC@USC) and Ming Hsieh Institute for Electrical Engineering February 21, 2018 Adithya M. Devraj Sean P. Meyn Department of Electrical and Computer Engineering — University of Florida
  • 2. Zap Q-Learning Outline 1 Stochastic Approximation 2 Fastest Stochastic Approximation 3 Reinforcement Learning 4 Zap Q-Learning 5 Conclusions & Future Work 6 References
  • 4. Stochastic Approximation Basic Algorithm What is Stochastic Approximation? A simple goal: Find the solution θ∗ to ¯f(θ∗ ) := E[f(θ, W)] θ=θ∗ = 0 1 / 31
  • 5. Stochastic Approximation Basic Algorithm What is Stochastic Approximation? A simple goal: Find the solution θ∗ to ¯f(θ∗ ) := E[f(θ, W)] θ=θ∗ = 0 What makes this hard? 1 / 31
  • 6. Stochastic Approximation Basic Algorithm What is Stochastic Approximation? A simple goal: Find the solution θ∗ to ¯f(θ∗ ) := E[f(θ, W)] θ=θ∗ = 0 What makes this hard? 1 The function f and the distribution of the random vector W may not be known – we may only know something about the structure of the problem 1 / 31
  • 7. Stochastic Approximation Basic Algorithm What is Stochastic Approximation? A simple goal: Find the solution θ∗ to ¯f(θ∗ ) := E[f(θ, W)] θ=θ∗ = 0 What makes this hard? 1 The function f and the distribution of the random vector W may not be known – we may only know something about the structure of the problem 2 Even if everything is known, computation of the expectation may be expensive. For root finding, we may need to compute the expectation for many values of θ 1 / 31
  • 8. Stochastic Approximation Basic Algorithm What is Stochastic Approximation? A simple goal: Find the solution θ∗ to ¯f(θ∗ ) := E[f(θ, W)] θ=θ∗ = 0 What makes this hard? 1 The function f and the distribution of the random vector W may not be known – we may only know something about the structure of the problem 2 Even if everything is known, computation of the expectation may be expensive. For root finding, we may need to compute the expectation for many values of θ 3 Motivates stochastic approximation: θ(n + 1) = θ(n) + αnf(θ(n), W(n)) The recursive algorithms we come up with are often slow, and their variance may be infinite: typical in Q-learning [Devraj & M 2017] 1 / 31
  • 9. Stochastic Approximation ODE Method Algorithm and Convergence Analysis Algorithm: θ(n + 1) = θ(n) + αnf(θ(n), W(n)) Goal: ¯f(θ∗ ) := E[f(θ, W)] θ=θ∗ = 0 Interpretation: θ∗ ≡ stationary point of the ODE d dt ϑ(t) = ¯f(ϑ(t)) 2 / 31
  • 10. Stochastic Approximation ODE Method Algorithm and Convergence Analysis Algorithm: θ(n + 1) = θ(n) + αnf(θ(n), W(n)) Goal: ¯f(θ∗ ) := E[f(θ, W)] θ=θ∗ = 0 Interpretation: θ∗ ≡ stationary point of the ODE d dt ϑ(t) = ¯f(ϑ(t)) Analysis: Stability of the ODE ⊕ (See Borkar’s monograph) =⇒ lim n→∞ θ(n) = θ∗ 2 / 31
  • 11. Stochastic Approximation SA Example Stochastic Approximation Example Example: Monte-Carlo Monte-Carlo Estimation Estimate the mean η = c(X), where X is a random variable: η = c(x) fX(x) dx 3 / 31
  • 12. Stochastic Approximation SA Example Stochastic Approximation Example Example: Monte-Carlo Monte-Carlo Estimation Estimate the mean η = c(X), where X is a random variable SA interpretation: Find θ∗ solving 0 = E[f(θ, X)] = E[c(X) − θ] Algorithm: θ(n) = 1 n n i=1 c(X(i)) 3 / 31
  • 13. Stochastic Approximation SA Example Stochastic Approximation Example Example: Monte-Carlo Monte-Carlo Estimation Estimate the mean η = c(X), where X is a random variable SA interpretation: Find θ∗ solving 0 = E[f(θ, X)] = E[c(X) − θ] Algorithm: θ(n) = 1 n n i=1 c(X(i)) =⇒ (n + 1)θ(n + 1) = n+1 i=1 c(X(i)) = nθ(n) + c(X(n + 1)) =⇒ (n + 1)θ(n + 1) = (n + 1)θ(n) + [c(X(n + 1)) − θ(n)] SA Recursion: θ(n + 1) = θ(n) + αnf(θ(n), X(n + 1)) αn = ∞, α2 n < ∞ 3 / 31
  • 15. Fastest Stochastic Approximation Algorithm Performance Performance Criteria Two standard approaches to evaluate performance, ˜θ(n) := θ(n) − θ∗: 1 Finite-n bound: P{ ˜θ(n) ≥ ε} ≤ exp(−I(ε, n)) , I(ε, n) = O(nε2 ) 2 Asymptotic covariance: Σ = lim n→∞ nE ˜θ(n)˜θ(n)T , √ n˜θ(n) ≈ N(0, Σ) 4 / 31
  • 16. Fastest Stochastic Approximation Algorithm Performance Asymptotic Covariance Σ = lim n→∞ Σn = lim n→∞ nE ˜θ(n)˜θ(n)T , √ n˜θ(n) ≈ N(0, Σ) SA recursion for covariance: Σn+1 ≈ Σn + 1 n (A + 1 2I)Σn + Σn(A + 1 2I)T + Σ∆ A = d dθ ¯f (θ∗) Conclusions 1 If Re λ(A) ≥ −1 2 for some eigenvalue then Σ is (typically) infinite 2 If Re λ(A) < −1 2 for all, then Σ = limn→∞ Σn is the unique solution to the Lyapunov equation: 0 = (A + 1 2I)Σ + Σ(A + 1 2I)T + Σ∆ 5 / 31
  • 17. Fastest Stochastic Approximation Stochastic Newton Raphson Optimal Asymptotic Covariance Introduce a d × d matrix gain sequence {Gn}: θ(n + 1) = θ(n) + 1 n + 1 Gnf(θ(n), X(n)) 6 / 31
  • 18. Fastest Stochastic Approximation Stochastic Newton Raphson Optimal Asymptotic Covariance Introduce a d × d matrix gain sequence {Gn}: θ(n + 1) = θ(n) + 1 n + 1 Gnf(θ(n), X(n)) Assume it converges, and linearize: ˜θ(n + 1) ≈ ˜θ(n) + 1 n + 1 G A˜θ(n) + ∆(n + 1) , A = d dθ ¯f (θ∗ ) . 6 / 31
  • 19. Fastest Stochastic Approximation Stochastic Newton Raphson Optimal Asymptotic Covariance Introduce a d × d matrix gain sequence {Gn}: θ(n + 1) = θ(n) + 1 n + 1 Gnf(θ(n), X(n)) Assume it converges, and linearize: ˜θ(n + 1) ≈ ˜θ(n) + 1 n + 1 G A˜θ(n) + ∆(n + 1) , A = d dθ ¯f (θ∗ ) . If G = G∗ := −A−1 then Resembles Monte-Carlo estimate Resembles Newton-Rapshon It is optimal: Σ∗ = G∗Σ∆G∗T ≤ ΣG any other G 6 / 31
  • 20. Fastest Stochastic Approximation Stochastic Newton Raphson Optimal Asymptotic Covariance Introduce a d × d matrix gain sequence {Gn}: θ(n + 1) = θ(n) + 1 n + 1 Gnf(θ(n), X(n)) Assume it converges, and linearize: ˜θ(n + 1) ≈ ˜θ(n) + 1 n + 1 G A˜θ(n) + ∆(n + 1) , A = d dθ ¯f (θ∗ ) . If G = G∗ := −A−1 then Resembles Monte-Carlo estimate Resembles Newton-Rapshon It is optimal: Σ∗ = G∗Σ∆G∗T ≤ ΣG any other G Polyak-Ruppert averaging is also optimal, but first two bullets are missing. 6 / 31
  • 21. Fastest Stochastic Approximation Stochastic Newton Raphson Optimal Variance Example: return to Monte-Carlo θ(n + 1) = θ(n) + g n + 1 −θ(n) + X(n + 1) 7 / 31
  • 22. Fastest Stochastic Approximation Stochastic Newton Raphson Optimal Variance Example: return to Monte-Carlo θ(n + 1) = θ(n) + g n + 1 −θ(n) + X(n + 1) ∆(n) = X(n) − E[X(n)] 7 / 31
  • 23. Fastest Stochastic Approximation Stochastic Newton Raphson Optimal Variance ∆(n) = X(n) − E[X(n)]Normalization for analysis: ˜θ(n + 1) = ˜θ(n) + g n + 1 −˜θ(n) + ∆(n + 1) 7 / 31
  • 24. Fastest Stochastic Approximation Stochastic Newton Raphson Optimal Variance ∆(n) = X(n) − E[X(n)]Normalization for analysis: ˜θ(n + 1) = ˜θ(n) + g n + 1 −˜θ(n) + ∆(n + 1) Example: X(n) = W2(n), W ∼ N(0, 1) 7 / 31
  • 25. Fastest Stochastic Approximation Stochastic Newton Raphson Optimal Variance ∆(n) = X(n) − E[X(n)]Normalization for analysis: ˜θ(n + 1) = ˜θ(n) + g n + 1 −˜θ(n) + ∆(n + 1) Example: X(n) = W2(n), W ∼ N(0, 1) 0 1 2 3 4 5 g σ2 ∆ Σ = σ2 ∆ 2 g2 g − 1/2 Asymptotic variance as a function of g 7 / 31
  • 26. Fastest Stochastic Approximation Stochastic Newton Raphson Optimal Variance ∆(n) = X(n) − E[X(n)]Normalization for analysis: ˜θ(n + 1) = ˜θ(n) + g n + 1 −˜θ(n) + ∆(n + 1) Example: X(n) = W2(n), W ∼ N(0, 1) 0 1 2 3 4 5 t 104 0.4 0.6 0.8 1 1.2 (t) 20 30.8 10 15.8 1 3 0.5 0.1 g SA estimates of E[W2 ], W ∼ N(0, 1) 7 / 31
  • 27. Fastest Stochastic Approximation Stochastic Newton Raphson Optimal Asymptotic Covariance and Zap-SNR Zap-SNR (designed to emulate deterministic Newton-Raphson) Requires An ≈ A(θn) := d dθ ¯f (θn) 8 / 31
  • 28. Fastest Stochastic Approximation Stochastic Newton Raphson Optimal Asymptotic Covariance and Zap-SNR Zap-SNR (designed to emulate deterministic Newton-Raphson) θ(n + 1) = θ(n) + αn(−An)−1 f(θ(n), X(n)) An = An−1 + γn(An − An−1), An = d dθ f(θ(n), X(n)) 8 / 31
  • 29. Fastest Stochastic Approximation Stochastic Newton Raphson Optimal Asymptotic Covariance and Zap-SNR Zap-SNR (designed to emulate deterministic Newton-Raphson) θ(n + 1) = θ(n) + αn(−An)−1 f(θ(n), X(n)) An = An−1 + γn(An − An−1), An = d dθ f(θ(n), X(n)) An ≈ A(θn) requires high-gain, γn αn → ∞, n → ∞ 8 / 31
  • 30. Fastest Stochastic Approximation Stochastic Newton Raphson Optimal Asymptotic Covariance and Zap-SNR Zap-SNR (designed to emulate deterministic Newton-Raphson) θ(n + 1) = θ(n) + αn(−An)−1 f(θ(n), X(n)) An = An−1 + γn(An − An−1), An = d dθ f(θ(n), X(n)) An ≈ A(θn) requires high-gain, γn αn → ∞, n → ∞ Always: αn = 1/n. Numerics that follow: γn = (1/n)ρ, ρ ∈ (0.5, 1) 8 / 31
  • 31. Fastest Stochastic Approximation Stochastic Newton Raphson Optimal Asymptotic Covariance and Zap-SNR Zap-SNR (designed to emulate deterministic Newton-Raphson) θ(n + 1) = θ(n) + αn(−An)−1 f(θ(n), X(n)) An = An−1 + γn(An − An−1), An = d dθ f(θ(n), X(n)) An ≈ A(θn) requires high-gain, γn αn → ∞, n → ∞ Always: αn = 1/n. Numerics that follow: γn = (1/n)ρ, ρ ∈ (0.5, 1) ODE for Zap-SNR d dt xt = (−A(xt))−1 ¯f (xt), A(x) = d dx ¯f (x) 8 / 31
  • 32. Fastest Stochastic Approximation Stochastic Newton Raphson Optimal Asymptotic Covariance and Zap-SNR Zap-SNR (designed to emulate deterministic Newton-Raphson) θ(n + 1) = θ(n) + αn(−An)−1 f(θ(n), X(n)) An = An−1 + γn(An − An−1), An = d dθ f(θ(n), X(n)) An ≈ A(θn) requires high-gain, γn αn → ∞, n → ∞ Always: αn = 1/n. Numerics that follow: γn = (1/n)ρ, ρ ∈ (0.5, 1) ODE for Zap-SNR d dt xt = (−A(xt))−1 ¯f (xt), A(x) = d dx ¯f (x) Not necessarily stable (just like in deterministic Newton-Raphson) General conditions for convergence is open 8 / 31
  • 35. Reinforcement Learning RL & SA SA and RL Design Functional equations in Stochastic Control Always of the form 0 = E[F(h∗ , Φ(n + 1)) | Φ0 . . . Φ(n)] , h∗ = ? 9 / 31
  • 36. Reinforcement Learning RL & SA SA and RL Design Functional equations in Stochastic Control Always of the form 0 = E[F(h∗ , Φ(n + 1)) | Φ0 . . . Φ(n)] , h∗ = ? Φ(n) = (state, action) 9 / 31
  • 37. Reinforcement Learning RL & SA SA and RL Design Functional equations in Stochastic Control Always of the form 0 = E[F(h∗ , Φ(n + 1)) | Φ0 . . . Φ(n)] , h∗ = ? Galerkin relaxation: 0 = E[F(hθ∗ , Φ(n + 1))ζn] , θ∗ = ? 9 / 31
  • 38. Reinforcement Learning RL & SA SA and RL Design Functional equations in Stochastic Control Always of the form 0 = E[F(h∗ , Φ(n + 1)) | Φ0 . . . Φ(n)] , h∗ = ? Galerkin relaxation: 0 = E[F(hθ∗ , Φ(n + 1))ζn] , θ∗ = ? Necessary Ingredients: Parameterized family {hθ : θ ∈ Rd} Adapted, d-dimensional stochastic process {ζn} Examples are TD- and Q-Learning 9 / 31
  • 39. Reinforcement Learning RL & SA SA and RL Design Functional equations in Stochastic Control Always of the form 0 = E[F(h∗ , Φ(n + 1)) | Φ0 . . . Φ(n)] , h∗ = ? Galerkin relaxation: 0 = E[F(hθ∗ , Φ(n + 1))ζn] , θ∗ = ? Necessary Ingredients: Parameterized family {hθ : θ ∈ Rd} Adapted, d-dimensional stochastic process {ζn} Examples are TD- and Q-Learning These algorithms are thus special cases of stochastic approximation (as we all know) 9 / 31
  • 40. Reinforcement Learning MDP Theory Stochastic Optimal Control MDP Model X is a stationary controlled Markov chain, with input U For all states x and sets A, P{X(n + 1) ∈ A | X(n) = x, U(n) = u, and prior history} = Pu(x, A) c: X × U → R is a cost function β < 1 a discount factor 10 / 31
  • 41. Reinforcement Learning MDP Theory Stochastic Optimal Control MDP Model X is a stationary controlled Markov chain, with input U For all states x and sets A, P{X(n + 1) ∈ A | X(n) = x, U(n) = u, and prior history} = Pu(x, A) c: X × U → R is a cost function β < 1 a discount factor Value function: h∗ (x) = min U ∞ n=0 βn E[c(X(n), U(n)) | X(0) = x] 10 / 31
  • 42. Reinforcement Learning MDP Theory Stochastic Optimal Control MDP Model X is a stationary controlled Markov chain, with input U For all states x and sets A, P{X(n + 1) ∈ A | X(n) = x, U(n) = u, and prior history} = Pu(x, A) c: X × U → R is a cost function β < 1 a discount factor Value function: h∗ (x) = min U ∞ n=0 βn E[c(X(n), U(n)) | X(0) = x] Bellman equation: h∗ (x) = min u {c(x, u) + βE[h∗ (X(n + 1)) | X(n) = x, U(n) = u]} 10 / 31
  • 43. Reinforcement Learning Q-Learning Q-function Trick to swap expectation and minimum Bellman equation: h∗ (x) = min u {c(x, u) + βE[h∗ (X(n + 1)) | X(n) = x, U(n) = u]} 11 / 31
  • 44. Reinforcement Learning Q-Learning Q-function Trick to swap expectation and minimum Bellman equation: h∗ (x) = min u {c(x, u) + βE[h∗ (X(n + 1)) | X(n) = x, U(n) = u]} Q-function: Q∗ (x, u) := c(x, u) + βE[h∗ (X(n + 1)) | X(n) = x, U(n) = u] 11 / 31
  • 45. Reinforcement Learning Q-Learning Q-function Trick to swap expectation and minimum Bellman equation: h∗ (x) = min u {c(x, u) + βE[h∗ (X(n + 1)) | X(n) = x, U(n) = u]} Q-function: Q∗ (x, u) := c(x, u) + βE[h∗ (X(n + 1)) | X(n) = x, U(n) = u] h∗ (x) = min u Q∗ (x, u) 11 / 31
  • 46. Reinforcement Learning Q-Learning Q-function Trick to swap expectation and minimum Bellman equation: h∗ (x) = min u {c(x, u) + βE[h∗ (X(n + 1)) | X(n) = x, U(n) = u]} Q-function: Q∗ (x, u) := c(x, u) + βE[h∗ (X(n + 1)) | X(n) = x, U(n) = u] h∗ (x) = min u Q∗ (x, u) Another Bellman equation: Q∗ (x, u) = c(x, u) + βE[Q∗ (X(n + 1)) | X(n) = x, U(n) = u] Q∗ (x) = min u Q∗ (x, u) 11 / 31
  • 47. Reinforcement Learning Q-Learning Q-Learning and Galerkin Relaxation Dynamic programming Find function Q∗ that solves E c(X(n), U(n)) + βQ∗ (X(n + 1)) − Q∗ (X(n), U(n)) | Fn = 0 12 / 31
  • 48. Reinforcement Learning Q-Learning Q-Learning and Galerkin Relaxation Dynamic programming Find function Q∗ that solves E c(X(n), U(n)) + βQ∗ (X(n + 1)) − Q∗ (X(n), U(n)) | Fn = 0 That is, 0 = E[F(Q∗ , Φ(n + 1)) | Φ0 . . . Φ(n)] , with Φ(n + 1) = (X(n + 1), X(n), U(n)). 12 / 31
  • 49. Reinforcement Learning Q-Learning Q-Learning and Galerkin Relaxation Dynamic programming Find function Q∗ that solves E c(X(n), U(n)) + βQ∗ (X(n + 1)) − Q∗ (X(n), U(n)) | Fn = 0 Q-Learning Find θ∗ that solves E c(X(n), U(n)) + βQθ∗ ((X(n + 1)) − Qθ∗ ((X(n), U(n)) ζn = 0 The family {Qθ} and eligibility vectors {ζn} are part of algorithm design. 12 / 31
  • 50. Reinforcement Learning Q-Learning Watkins’ Q-learning Find θ∗ that solves E c(X(n), U(n)) + βQθ∗ ((X(n + 1)) − Qθ∗ ((X(n), U(n)) ζn = 0 13 / 31
  • 51. Reinforcement Learning Q-Learning Watkins’ Q-learning Find θ∗ that solves E c(X(n), U(n)) + βQθ∗ ((X(n + 1)) − Qθ∗ ((X(n), U(n)) ζn = 0 Watkin’s algorithm is Stochastic Approximation The family {Qθ} and eligibility vectors {ζn} in this design: Linearly parameterized family of functions: Qθ(x, u) = θT ψ(x, u) ζn ≡ ψ(Xn, Un) and ψn(x, u) = 1{x = xn, u = un} (complete basis) 13 / 31
  • 52. Reinforcement Learning Q-Learning Watkins’ Q-learning Find θ∗ that solves E c(X(n), U(n)) + βQθ∗ ((X(n + 1)) − Qθ∗ ((X(n), U(n)) ζn = 0 Watkin’s algorithm is Stochastic Approximation The family {Qθ} and eligibility vectors {ζn} in this design: Linearly parameterized family of functions: Qθ(x, u) = θT ψ(x, u) ζn ≡ ψ(Xn, Un) and ψn(x, u) = 1{x = xn, u = un} (complete basis) Asymptotic covariance is typically infinite 13 / 31
  • 53. Reinforcement Learning Q-Learning Watkins’ Q-learning Big Question: Can we Zap Q-Learning? Find θ∗ that solves E c(X(n), U(n)) + βQθ∗ ((X(n + 1)) − Qθ∗ ((X(n), U(n)) ζn = 0 Watkin’s algorithm is Stochastic Approximation The family {Qθ} and eligibility vectors {ζn} in this design: Linearly parameterized family of functions: Qθ(x, u) = θT ψ(x, u) ζn ≡ ψ(Xn, Un) and ψn(x, u) = 1{x = xn, u = un} (complete basis) Asymptotic covariance is typically infinite 13 / 31
  • 54. 0 1 2 3 4 5 6 7 8 9 10 105 0 20 40 60 80 100 Watkins, Speedy Q-learning, Polyak-Ruppert Averaging Zap BellmanError n Zap Q-Learning
  • 55. Zap Q-Learning Asymptotic Covariance of Watkins’ Q-Learning Improvements are needed! 1 4 65 3 2 Histogram of parameter estimates after 106 iterations. 1000 200 300 400 486.6 0 10 20 30 40 n = 106 Histogram for θ θ* n(15) (15) Example from Devraj & M 2017 14 / 31
  • 56. Zap Q-Learning Watkin’s algorithm Zap Q-learning Zap Q-Learning ≡ Zap-SNR for Q-Learning 0 = ¯f(θ) = E f(θ, W(n)) := E ζn c(X(n), U(n)) + βQθ (X(n + 1)) − Qθ (X(n), U(n)) 15 / 31
  • 57. Zap Q-Learning Watkin’s algorithm Zap Q-learning Zap Q-Learning ≡ Zap-SNR for Q-Learning 0 = ¯f(θ) = E f(θ, W(n)) := E ζn c(X(n), U(n)) + βQθ (X(n + 1)) − Qθ (X(n), U(n)) A(θ) = d dθ ¯f (θ); 15 / 31
  • 58. Zap Q-Learning Watkin’s algorithm Zap Q-learning Zap Q-Learning ≡ Zap-SNR for Q-Learning 0 = ¯f(θ) = E f(θ, W(n)) := E ζn c(X(n), U(n)) + βQθ (X(n + 1)) − Qθ (X(n), U(n)) A(θ) = d dθ ¯f (θ); At points of differentiability: A(θ) = E ζn βψ(X(n + 1), φθ (X(n + 1))) − ψ(X(n), U(n)) T φθ (X(n + 1)) := arg min u Qθ (X(n + 1), u) 15 / 31
  • 59. Zap Q-Learning Watkin’s algorithm Zap Q-learning Zap Q-Learning ≡ Zap-SNR for Q-Learning 0 = ¯f(θ) = E f(θ, W(n)) := E ζn c(X(n), U(n)) + βQθ (X(n + 1)) − Qθ (X(n), U(n)) A(θ) = d dθ ¯f (θ); At points of differentiability: A(θ) = E ζn βψ(X(n + 1), φθ (X(n + 1))) − ψ(X(n), U(n)) T φθ (X(n + 1)) := arg min u Qθ (X(n + 1), u) Algorithm: θ(n + 1)= θ(n) + αn(−An)−1 (f(θ(n), Φ(n))); An = An−1 + γn(An − An−1); 15 / 31
  • 60. Zap Q-Learning Watkin’s algorithm Zap Q-learning Zap Q-Learning ≡ Zap-SNR for Q-Learning 0 = ¯f(θ) = E f(θ, W(n)) := E ζn c(X(n), U(n)) + βQθ (X(n + 1)) − Qθ (X(n), U(n)) A(θ) = d dθ ¯f (θ); At points of differentiability: A(θ) = E ζn βψ(X(n + 1), φθ (X(n + 1))) − ψ(X(n), U(n)) T φθ (X(n + 1)) := arg min u Qθ (X(n + 1), u) Algorithm: θ(n + 1)= θ(n) + αn(−An)−1 (f(θ(n), Φ(n))); An = An−1 + γn(An − An−1); An+1 := d dθ f (θn, Φ(n)) = ζn βψ(X(n + 1), φθn (X(n + 1))) − ψ(X(n), U(n)) T 15 / 31
  • 61. Zap Q-Learning Watkin’s algorithm Zap Q-learning Zap Q-Learning ≡ Zap-SNR for Q-Learning ODE Analysis: change of variables q = Q∗(ς) Functional Q∗ maps cost functions to Q-functions: q(x, u) = ς(x, u) + β x Pu(x, x ) min u q(x , u ) 16 / 31
  • 62. Zap Q-Learning Watkin’s algorithm Zap Q-learning Zap Q-Learning ≡ Zap-SNR for Q-Learning ODE Analysis: change of variables q = Q∗(ς) Functional Q∗ maps cost functions to Q-functions: q(x, u) = ς(x, u) + β x Pu(x, x ) min u q(x , u ) ODE for Zap-Q qt = Q∗ (ςt), d dt ςt = −ςt + c ⇒ convergence, optimal covariance, ... 16 / 31
  • 63. Zap Q-Learning Watkin’s algorithm Zap Q-Learning Example: Stochastic Shortest Path 1 4 65 3 2 17 / 31
  • 64. Zap Q-Learning Watkin’s algorithm Zap Q-Learning Example: Stochastic Shortest Path 1 4 65 3 2 Convergence with Zap gain γn = n−0.85 Watkins’ algorithm has infinite asymptotic covariance with αn = 1/n 0 1 2 3 4 5 6 7 8 9 10 105 0 20 40 60 80 100 Watkins, Speedy Q-learning, Polyak-Ruppert Averaging Zap BellmanError n Convergence of Zap-Q Learning Discount factor: β = 0.99 17 / 31
  • 65. Zap Q-Learning Watkin’s algorithm Zap Q-Learning Example: Stochastic Shortest Path 1 4 65 3 2 Convergence with Zap gain γn = n−0.85 Watkins’ algorithm has infinite asymptotic covariance with αn = 1/n 0 1 2 3 4 5 6 7 8 9 10 105 0 20 40 60 80 100 Watkins, Speedy Q-learning, Polyak-Ruppert Averaging Zap BellmanError n Zap, γn = αn Convergence of Zap-Q Learning Discount factor: β = 0.99 17 / 31
  • 66. Zap Q-Learning Watkin’s algorithm Zap Q-Learning Example: Stochastic Shortest Path 1 4 65 3 2 Convergence with Zap gain γn = n−0.85 Watkins’ algorithm has infinite asymptotic covariance with αn = 1/n Optimal scalar gain is approximately αn = 1500/n 0 1 2 3 4 5 6 7 8 9 10 105 0 20 40 60 80 100 Watkins, Speedy Q-learning, Polyak-Ruppert Averaging Zap BellmanError n Watkins, g = 1500 Zap, γn = αn Convergence of Zap-Q Learning Discount factor: β = 0.99 17 / 31
  • 67. Zap Q-Learning Watkin’s algorithm Zap Q-Learning Optimize Walk to Cafe 1 4 65 3 2 Convergence with Zap gain γn = n−0.85 -2 -1 0 1 -2 -1 0 1 -8 -6 -4 -2 0 2 4 6 8 103-8 -6 -4 -2 0 2 4 6 8 n = 104 n = 106 Theoritical pdf Experimental pdf Empirical: 1000 trialsWn = √ n˜θn Entry #18: n = 104 n = 106 Entry #10: CLT gives good prediction of finite-n performance 18 / 31
  • 68. Zap Q-Learning Watkin’s algorithm Zap Q-Learning Optimize Walk to Cafe 1 4 65 3 2 Convergence with Zap gain γn = n−0.85 -2 -1 0 1 -2 -1 0 1 -8 -6 -4 -2 0 2 4 6 8 103-8 -6 -4 -2 0 2 4 6 8 n = 104 n = 106 Theoritical pdf Experimental pdf Empirical: 1000 trialsWn = √ n˜θn Entry #18: n = 104 n = 106 Entry #10: CLT gives good prediction of finite-n performance Discount factor: β = 0.99 19 / 31
  • 69. Zap Q-Learning Watkin’s algorithm Zap Q-Learning Optimize Walk to Cafe 1 4 65 3 2Local Convergence: θ(0) initialized in neighborhood of θ∗ g = 500 g = 1500 Speedy Poly g = 5000 Polyak-Ruppert B0 10 20 30 40 50 0 1 2 0 20 40 60 80 100 120 140 160 0 0.5 Zap-Q: Zap-Q: ≡ α0 85 n γn ≡ γn αn Watkins BellmanError Histogramsn=106 20 / 31
  • 70. Zap Q-Learning Watkin’s algorithm Zap Q-Learning Optimize Walk to Cafe 1 4 65 3 2Local Convergence: θ(0) initialized in neighborhood of θ∗ g = 500 g = 1500 Speedy Poly g = 5000 g = 500 g = 1500 Speedy Poly g = 5000 10 3 10 4 10 5 10 6 100 101 102 103 104 10 3 10 4 10 5 10 6 n Polyak-Ruppert Polyak-RuppertB B n 0 10 20 30 40 50 0 1 2 0 20 40 60 80 100 120 140 160 0 0.5 Zap-Q: Zap-Q: ≡ α0 85 n γn ≡ γn αn Zap-Q: Zap-Q: ≡ α0 85 n γn ≡ γn αn WatkinsWatkins BellmanErrorBellmanError Histogramsn=106 2σ confidence intervals for the Q-learning algorithms 20 / 31
  • 71. Zap Q-Learning Optimal stopping Zap Q-Learning Model of Tsitsiklis and Van Roy: Optimal Stopping Time in Finance State space: R100 Parameterized Q-function: Qθ with θ ∈ R10 i 0 1 2 3 4 5 6 7 8 9 10 -10 0 -10 -1 -10 -2 -10-3 -10-4 -10-5 -10 -6 Real for every eigenvalue λ Asymptotic covariance is infinite λ > − 1 2 Real λi(A) 21 / 31
  • 72. Zap Q-Learning Optimal stopping Zap Q-Learning Model of Tsitsiklis and Van Roy: Optimal Stopping Time in Finance State space: R100 Parameterized Q-function: Qθ with θ ∈ R10 i 0 1 2 3 4 5 6 7 8 9 10 -10 0 -10 -1 -10 -2 -10 -3 -10-4 -10-5 -10 -6 Real for every eigenvalue λ Authors observed slow convergence Proposed a matrix gain sequence (see refs for details) Asymptotic covariance is infinite λ > − 1 2 Real λi(A) {Gn} 21 / 31
  • 73. Zap Q-Learning Optimal stopping Zap Q-Learning Model of Tsitsiklis and Van Roy: Optimal Stopping Time in Finance State space: R100 Parameterized Q-function: Qθ with θ ∈ R10 i 0 1 2 3 4 5 6 7 8 9 10 -10 0 -10 -1 -10 -2 -10 -3 -10-4 -10-5 -10 -6 -0.525-30 -25 -20 -15 -10 -5 -10 -5 0 5 10 Re (λ(GA)) Co(λ(GA)) λi(GA)Real λi(A) Eigenvalues of A and GA for the finance example Favorite choice of gain in [23] barely meets the criterion Re(λ(GA)) < −1 2 21 / 31
  • 74. Zap Q-Learning Optimal stopping Zap Q-Learning Model of Tsitsiklis and Van Roy: Optimal Stopping Time in Finance State space: R100. Parameterized Q-function: Qθ with θ ∈ R10 Zap-Q G-Q -1000 0 1000 2000 3000 -600 -400 -200 0 200 400 600 800 -250 -200 -150 -100 -50 0 50 100 -200 -100 0 100 200 300 Theoritical pdf Experimental pdf Empirical: 1000 trials Wn = √ n˜θn Entry #1: n = 2 × 106 Entry #7: n = 2 × 106 22 / 31
  • 75. Zap Q-Learning Optimal stopping Zap Q-Learning Model of Tsitsiklis and Van Roy: Optimal Stopping Time in Finance State space: R100. Parameterized Q-function: Qθ with θ ∈ R10 Histograms of the average reward obtained using the different algorithms: 1 1.05 1.1 1.15 1.2 1.25 0 20 40 60 80 100 1 1.05 1.1 1.15 1.2 1.25 0 100 200 300 400 500 600 1 1.05 1.1 1.15 1.2 1.25 0 5 10 15 20 25 30 35 G-Q(0) G-Q(0) Zap-Q Zap-Q ρ = 0.8 ρ = 1.0 g = 100 g = 200 Zap-Q ρ = 0.85 n = 2 × 104 n = 2 × 105 n = 2 × 106 Zap-Q G-Q 23 / 31
  • 76. Conclusions & Future Work Conclusions & Future Work Conclusions Reinforcement Learning is not just cursed by dimension, but also by variance We need better design tools to improve performance 24 / 31
  • 77. Conclusions & Future Work Conclusions & Future Work Conclusions Reinforcement Learning is not just cursed by dimension, but also by variance We need better design tools to improve performance The asymptotic covariance is an awesome design tool. It is also predictive of finite-n performance. Example: g∗ = 1500 was chosen based on asymptotic covariance 24 / 31
  • 78. Conclusions & Future Work Conclusions & Future Work Conclusions Reinforcement Learning is not just cursed by dimension, but also by variance We need better design tools to improve performance The asymptotic covariance is an awesome design tool. It is also predictive of finite-n performance. Example: g∗ = 1500 was chosen based on asymptotic covariance Future work: Q-learning with function-approximation Obtain conditions for a stable algorithm in a general setting Optimal stopping time problems Adaptive optimization of algorithm parameters 24 / 31
  • 79. Conclusions & Future Work Conclusions & Future Work Conclusions Reinforcement Learning is not just cursed by dimension, but also by variance We need better design tools to improve performance The asymptotic covariance is an awesome design tool. It is also predictive of finite-n performance. Example: g∗ = 1500 was chosen based on asymptotic covariance Future work: Q-learning with function-approximation Obtain conditions for a stable algorithm in a general setting Optimal stopping time problems Adaptive optimization of algorithm parameters Finite-time analysis 24 / 31
  • 80. Conclusions & Future Work Thank you! thankful 25 / 31
  • 81. References Control Techniques FOR Complex Networks Sean Meyn Pre-publication version for on-line viewing. Monograph available for purchase at your favorite retailer More information available at http://www.cambridge.org/us/catalogue/catalogue.asp?isbn=9780521884419 Markov Chains and Stochastic Stability S. P. Meyn and R. L. Tweedie August 2008 Pre-publication version for on-line viewing. Monograph to appear Februrary 2009 π(f)<∞ ∆V (x) ≤ −f(x) + bIC(x) Pn (x, · ) − π f → 0 sup C Ex[SτC(f)]<∞ References 26 / 31
  • 82. References This lecture A. M. Devraj and S. P. Meyn, Zap Q-learning. Advances in Neural Information Processing Systems (NIPS). Dec. 2017. A. M. Devraj and S. P. Meyn, Fastest convergence for Q-learning. Available on ArXiv. Jul. 2017. 27 / 31
  • 83. References Selected References I [1] A. M. Devraj and S. P. Meyn. Fastest convergence for Q-learning. ArXiv , July 2017. [2] A. Benveniste, M. M´etivier, and P. Priouret. Adaptive algorithms and stochastic approximations, volume 22 of Applications of Mathematics (New York). Springer-Verlag, Berlin, 1990. Translated from the French by Stephen S. Wilson. [3] V. S. Borkar. Stochastic Approximation: A Dynamical Systems Viewpoint. Hindustan Book Agency and Cambridge University Press (jointly), Delhi, India and Cambridge, UK, 2008. [4] V. S. Borkar and S. P. Meyn. The ODE method for convergence of stochastic approximation and reinforcement learning. SIAM J. Control Optim., 38(2):447–469, 2000. [5] S. P. Meyn and R. L. Tweedie. Markov chains and stochastic stability. Cambridge University Press, Cambridge, second edition, 2009. Published in the Cambridge Mathematical Library. [6] S. P. Meyn. Control Techniques for Complex Networks. Cambridge University Press, 2007. See last chapter on simulation and average-cost TD learning 28 / 31
  • 84. References Selected References II [7] D. Ruppert. A Newton-Raphson version of the multivariate Robbins-Monro procedure. The Annals of Statistics, 13(1):236–245, 1985. [8] D. Ruppert. Efficient estimators from a slowly convergent Robbins-Monro processes. Technical Report Tech. Rept. No. 781, Cornell University, School of Operations Research and Industrial Engineering, Ithaca, NY, 1988. [9] B. T. Polyak. A new method of stochastic approximation type. Avtomatika i telemekhanika (in Russian). translated in Automat. Remote Control, 51 (1991), pages 98–107, 1990. [10] B. T. Polyak and A. B. Juditsky. Acceleration of stochastic approximation by averaging. SIAM J. Control Optim., 30(4):838–855, 1992. [11] V. R. Konda and J. N. Tsitsiklis. Convergence rate of linear two-time-scale stochastic approximation. Ann. Appl. Probab., 14(2):796–819, 2004. [12] E. Moulines and F. R. Bach. Non-asymptotic analysis of stochastic approximation algorithms for machine learning. In Advances in Neural Information Processing Systems 24, pages 451–459. Curran Associates, Inc., 2011. 29 / 31
  • 85. References Selected References III [13] C. Szepesv´ari. Algorithms for Reinforcement Learning. Synthesis Lectures on Artificial Intelligence and Machine Learning. Morgan & Claypool Publishers, 2010. [14] C. J. C. H. Watkins. Learning from Delayed Rewards. PhD thesis, King’s College, Cambridge, Cambridge, UK, 1989. [15] C. J. C. H. Watkins and P. Dayan. Q-learning. Machine Learning, 8(3-4):279–292, 1992. [16] R. S. Sutton.Learning to predict by the methods of temporal differences. Mach. Learn., 3(1):9–44, 1988. [17] J. N. Tsitsiklis and B. Van Roy. An analysis of temporal-difference learning with function approximation. IEEE Trans. Automat. Control, 42(5):674–690, 1997. [18] C. Szepesv´ari. The asymptotic convergence-rate of Q-learning. In Proceedings of the 10th Internat. Conf. on Neural Info. Proc. Systems, pages 1064–1070. MIT Press, 1997. [19] M. G. Azar, R. Munos, M. Ghavamzadeh, and H. Kappen. Speedy Q-learning. In Advances in Neural Information Processing Systems, 2011. [20] E. Even-Dar and Y. Mansour. Learning rates for Q-learning. Journal of Machine Learning Research, 5(Dec):1–25, 2003. 30 / 31
  • 86. References Selected References IV [21] D. Huang, W. Chen, P. Mehta, S. Meyn, and A. Surana. Feature selection for neuro-dynamic programming. In F. Lewis, editor, Reinforcement Learning and Approximate Dynamic Programming for Feedback Control. Wiley, 2011. [22] J. N. Tsitsiklis and B. Van Roy. Optimal stopping of Markov processes: Hilbert space theory, approximation algorithms, and an application to pricing high-dimensional financial derivatives. IEEE Trans. Automat. Control, 44(10):1840–1851, 1999. [23] D. Choi and B. Van Roy. A generalized Kalman filter for fixed point approximation and efficient temporal-difference learning. Discrete Event Dynamic Systems: Theory and Applications, 16(2):207–239, 2006. [24] S. J. Bradtke and A. G. Barto. Linear least-squares algorithms for temporal difference learning. Mach. Learn., 22(1-3):33–57, 1996. [25] J. A. Boyan. Technical update: Least-squares temporal difference learning. Mach. Learn., 49(2-3):233–246, 2002. [26] A. Nedic and D. Bertsekas. Least squares policy evaluation algorithms with linear function approximation. Discrete Event Dyn. Systems: Theory and Appl., 13(1-2):79–110, 2003. [27] P. G. Mehta and S. P. Meyn. Q-learning and Pontryagin’s minimum principle. In IEEE Conference on Decision and Control, pages 3598–3605, Dec. 2009. 31 / 31