Zap Q-Learning
Fastest Convergent Q-Learning
Advances in Reinforcement Learning Algorithms
Sean Meyn
Department of Electrical and Computer Engineering — University of Florida
Based on joint research with Adithya M. Devraj Ana Buˇsi´c
Thanks to to the National Science Foundation
Zap Q-Learning
Essential References
Simons tutorial, March 2018
Part I (Basics, with focus on variance of algorithms)
https://www.youtube.com/watch?v=dhEF5pfYmvc
Part II (Zap Q-learning)
https://www.youtube.com/watch?v=Y3w8f1xIb6s
A. M. Devraj and S. P. Meyn. Fastest convergence for Q-learning. ArXiv , July 2017.
(full tutorial on stochastic approximation)
A. M. Devraj and S. Meyn. Zap Q-Learning. In Advances in Neural Information Processing
Systems, pages 2235–2244, 2017.
A. M. Devraj, A. Buˇsi´c, and S. Meyn. Zap Q Learning — a user’s guide. Proceedings of
the fifth Indian Control Conference, 9-11 January 2019.
1 / 18
Zap Q-Learning
Outline
1 Stochastic Approximation
2 Remedies for S L O W Convergence
3 Reinforcement Learning with Momentum
4 Examples
5 Conclusions & Future Work
6 References
E[f(θ,W)]
θ=θ∗
= 0
Stochastic Approximation
Stochastic Approximation Problem, Algorithm, and Issues
Stochastic Approximation
A simple goal: Find the solution θ∗ ∈ Rd to
¯f(θ∗
) := E[f(θ, W)]
θ=θ∗
= 0
1 / 18
Stochastic Approximation Problem, Algorithm, and Issues
Stochastic Approximation
A simple goal: Find the solution θ∗ ∈ Rd to
¯f(θ∗
) := E[f(θ, W)]
θ=θ∗
= 0
Linear example illustrates challenges:
E[f(θ, W)]
θ=θ∗
= Aθ∗
− b := E[A(W)]θ∗
− E[b(W)] = 0
1 / 18
Stochastic Approximation Problem, Algorithm, and Issues
Stochastic Approximation
A simple goal: Find the solution θ∗ ∈ Rd to
¯f(θ∗
) := E[f(θ, W)]
θ=θ∗
= 0
Linear example illustrates challenges:
E[f(θ, W)]
θ=θ∗
= Aθ∗
− b := E[A(W)]θ∗
− E[b(W)] = 0
Algorithm: Observe An+1 := A(Wn+1), bn+1 := b(Wn+1),
θn+1 = θn + αn+1(An+1θn − bn+1)
1 / 18
Stochastic Approximation Problem, Algorithm, and Issues
Stochastic Approximation
A simple goal: Find the solution θ∗ ∈ Rd to
¯f(θ∗
) := E[f(θ, W)]
θ=θ∗
= 0
Linear example illustrates challenges:
E[f(θ, W)]
θ=θ∗
= Aθ∗
− b := E[A(W)]θ∗
− E[b(W)] = 0
Algorithm: Observe An+1 := A(Wn+1), bn+1 := b(Wn+1),
θn+1 = θn + αn+1(An+1θn − bn+1)
CLT Variance:
√
n(θn −θ∗) → N(0, Σθ)
1 / 18
Stochastic Approximation Problem, Algorithm, and Issues
Stochastic Approximation
A simple goal: Find the solution θ∗ ∈ Rd to
¯f(θ∗
) := E[f(θ, W)]
θ=θ∗
= 0
Linear example illustrates challenges:
E[f(θ, W)]
θ=θ∗
= Aθ∗
− b := E[A(W)]θ∗
− E[b(W)] = 0
Algorithm: Observe An+1 := A(Wn+1), bn+1 := b(Wn+1),
θn+1 = θn + αn+1(An+1θn − bn+1)
CLT Variance:
√
n(θn −θ∗) → N(0, Σθ)
Typically, Σθ = ∞
[Devraj & M, 2017] -5000 0 5000 10000
0
2
4
6
8
10
θn(15)
θ∗
(15)≈500
Histogram of
Time horizon n = one million
1 / 18
θ(k)
k
Remedies for Slow Convergence
Remedies for S L O W Convergence Stochastic Newton Raphson
Fixing (Optimizing) the Variance
Goal: Find θ∗ such that Aθ∗ − b = 0 fn+1(θn) = An+1θn − bn+1
Stochastic Approximation :
θn+1 = θn + αn+1fn+1(θn) , we take αn = 1/n
2 / 18
Remedies for S L O W Convergence Stochastic Newton Raphson
Fixing (Optimizing) the Variance
Goal: Find θ∗ such that Aθ∗ − b = 0 fn+1(θn) = An+1θn − bn+1
Stochastic Approximation :
θn+1 = θn + αn+1fn+1(θn) , we take αn = 1/n
Optimal Asymptotic Variance:
θn+1 = θn − αn+1A−1
fn+1(θn)
2 / 18
Remedies for S L O W Convergence Stochastic Newton Raphson
Fixing (Optimizing) the Variance
Goal: Find θ∗ such that Aθ∗ − b = 0 fn+1(θn) = An+1θn − bn+1
Stochastic Approximation :
θn+1 = θn + αn+1fn+1(θn)
Optimal Asymptotic Variance:
θn+1 = θn − αn+1A−1
fn+1(θn)
Don’t know A
2 / 18
Remedies for S L O W Convergence Stochastic Newton Raphson
Fixing (Optimizing) the Variance
Goal: Find θ∗ such that Aθ∗ − b = 0 fn+1(θn) = An+1θn − bn+1
Stochastic Approximation :
θn+1 = θn + αn+1fn+1(θn)
Optimal Asymptotic Variance:
θn+1 = θn − αn+1A−1
fn+1(θn)
Don’t know A – estimate (Monte-Carlo): An+1 ≈ A
Stochastic Newton-Raphson :
θn+1 = θn − αn+1A−1
n+1fn+1(θn)
2 / 18
Remedies for S L O W Convergence Stochastic Newton Raphson
Fixing (Optimizing) the Variance
Goal: Find θ∗ such that Aθ∗ − b = 0 fn+1(θn) = An+1θn − bn+1
Stochastic Approximation (TD-learning!):
θn+1 = θn + αn+1fn+1(θn)
Optimal Asymptotic Variance:
θn+1 = θn − αn+1A−1
fn+1(θn)
Don’t know A – estimate (Monte-Carlo): An+1 ≈ A
Stochastic Newton-Raphson (Least Squares TD-learning!):
θn+1 = θn − αn+1A−1
n+1fn+1(θn)
2 / 18
Remedies for S L O W Convergence Stochastic Newton Raphson
Fixing (Optimizing) the Variance
Goal: Find θ∗ such that ¯f(θ∗) = 0 fn+1 no longer linear
Stochastic Approximation (Q-learning!):
θn+1 = θn + αn+1fn+1(θn)
Optimal Asymptotic Variance:
θn+1 = θn − αn+1A(θ∗
)−1
fn+1(θn)
Don’t know A(θ∗) – estimate (2-time-scale): An+1 ≈ A(θn) = ∂ ¯f(θn)
Zap Stochastic Newton-Raphson (Zap Q-learning!):
θn+1 = θn − αn+1A−1
n+1fn+1(θn)
2 / 18
Remedies for S L O W Convergence Convergence for Zapped Watkins
Zap Q-learning
Discounted-cost optimal control problem: Q∗
= Q∗
(c)
Watkins’ algorithm: variance is infinite for β > 1
2 [Devraj and M., 2017]
3 / 18
Remedies for S L O W Convergence Convergence for Zapped Watkins
Zap Q-learning
Discounted-cost optimal control problem: Q∗
= Q∗
(c)
Zap Q-learning has optimal variance:
3 / 18
Remedies for S L O W Convergence Convergence for Zapped Watkins
Zap Q-learning
Discounted-cost optimal control problem: Q∗
= Q∗
(c)
Zap Q-learning has optimal variance:
ODE Analysis: change of variables q = Q∗(ς)
Functional Q∗ maps cost functions to Q-functions:
q(x, u) = ς(x, u) + β
x
Pu(x, x ) min
u
q(x , u )
3 / 18
Remedies for S L O W Convergence Convergence for Zapped Watkins
Zap Q-learning
Discounted-cost optimal control problem: Q∗
= Q∗
(c)
Zap Q-learning has optimal variance:
ODE Analysis: change of variables q = Q∗(ς)
Functional Q∗ maps cost functions to Q-functions:
q(x, u) = ς(x, u) + β
x
Pu(x, x ) min
u
q(x , u )
ODE for Zap-Q (for Watkins’ setting)
qt = Q∗
(ςt),
d
dt
ςt = −ςt + c
⇒ convergence, optimal covariance, ...
3 / 18
Zap Q-Learning
0 1 2 3 4 5 6 7 8 9 10
Iterations 105
0
20
40
60
80
100BellmanError
Reinforcement Learning
with Momentum
Reinforcement Learning with Momentum Matrix Momentum
Momentum based Stochastic Approximation
∆θ(n) := θ(n) − θ(n − 1)
Matrix Gain Stochastic Approximation:
∆θ(n + 1) = αnGn+1fn+1(θ(n))
4 / 18
Reinforcement Learning with Momentum Matrix Momentum
Momentum based Stochastic Approximation
∆θ(n) := θ(n) − θ(n − 1)
Matrix Gain Stochastic Approximation:
∆θ(n + 1) = αnGn+1fn+1(θ(n))
Heavy Ball Stochastic Approximation:
∆θ(n + 1) = m∆θ(n) + αnfn+1(θ(n))
4 / 18
Reinforcement Learning with Momentum Matrix Momentum
Momentum based Stochastic Approximation
∆θ(n) := θ(n) − θ(n − 1)
Matrix Gain Stochastic Approximation:
∆θ(n + 1) = αnGn+1fn+1(θ(n))
Heavy Ball Stochastic Approximation:
∆θ(n + 1) = m∆θ(n) + αnfn+1(θ(n))
Matirx Heavy Ball Stochastic Approximation:
∆θ(n + 1) = Mn+1∆θ(n) + αnGn+1fn+1(θ(n))
4 / 18
Reinforcement Learning with Momentum Matrix Momentum
Matrix Heavy Ball Stochastic Approximation
Optimizing {Mn+1} and {Gn+1}
Matirx Heavy Ball Stochastic Approximation:
∆θ(n + 1) = Mn+1∆θ(n) + αnGn+1fn+1(θ(n))
Heuristic: Assume ∆θn → 0 much faster than θn → θ∗
5 / 18
Reinforcement Learning with Momentum Matrix Momentum
Matrix Heavy Ball Stochastic Approximation
Optimizing {Mn+1} and {Gn+1}
Matirx Heavy Ball Stochastic Approximation:
∆θ(n + 1) = Mn+1∆θ(n) + αnGn+1fn+1(θ(n))
Heuristic: Assume ∆θn → 0 much faster than θn → θ∗
∆θ(n + 1) ≈ Mn+1∆θ(n + 1) + αnGn+1fn+1(θ(n))
5 / 18
Reinforcement Learning with Momentum Matrix Momentum
Matrix Heavy Ball Stochastic Approximation
Optimizing {Mn+1} and {Gn+1}
Matirx Heavy Ball Stochastic Approximation:
∆θ(n + 1) = Mn+1∆θ(n) + αnGn+1fn+1(θ(n))
Heuristic: Assume ∆θn → 0 much faster than θn → θ∗
∆θ(n + 1) ≈ Mn+1∆θ(n + 1) + αnGn+1fn+1(θ(n))
≈ [I − Mn+1]−1
αnGn+1fn+1(θ(n))
5 / 18
Reinforcement Learning with Momentum Matrix Momentum
Matrix Heavy Ball Stochastic Approximation
Optimizing {Mn+1} and {Gn+1}
Matirx Heavy Ball Stochastic Approximation:
∆θ(n + 1) = Mn+1∆θ(n) + αnGn+1fn+1(θ(n))
Heuristic: Assume ∆θn → 0 much faster than θn → θ∗
∆θ(n + 1) ≈ Mn+1∆θ(n + 1) + αnGn+1fn+1(θ(n))
≈ [I − Mn+1]−1
αnGn+1fn+1(θ(n))
= −αnA−1
n+1fn+1(θ(n)) Zap!
Mn+1 = (I + ζAn+1) and Gn+1 = ζI with ζ > 0
5 / 18
Reinforcement Learning with Momentum Coupling
Momentum based Stochastic Approximation
An+1 ≈ d
dθ
¯f (θn)
SNR: ∆θ(n + 1) = −αnA−1
n+1fn+1(θ(n))
6 / 18
Reinforcement Learning with Momentum Coupling
Momentum based Stochastic Approximation
An+1 ≈ d
dθ
¯f (θn)
SNR: ∆θ(n + 1) = −αnA−1
n+1fn+1(θ(n))
PolSA: ∆θ(n + 1) = [I + ζAn+1]∆θ(n) + αnζfn+1(θ(n))
(linearized) NeSA: ∆θ(n + 1) = [I + ζAn+1]∆θ(n) + αnζfn+1(θ(n))
6 / 18
Reinforcement Learning with Momentum Coupling
Momentum based Stochastic Approximation
An+1 ≈ d
dθ
¯f (θn)
SNR: ∆θ(n + 1) = −αnA−1
n+1fn+1(θ(n))
PolSA: ∆θ(n + 1) = [I + ζAn+1]∆θ(n) + αnζfn+1(θ(n))
(linearized) NeSA: ∆θ(n + 1) = [I + ζAn+1]∆θ(n) + αnζfn+1(θ(n))
Coupling of SNR and PolSA: θSNR
n − θPolSA
n = O(n−1)
6 / 18
Reinforcement Learning with Momentum Coupling
Momentum based Stochastic Approximation
An+1 ≈ d
dθ
¯f (θn)
SNR: ∆θ(n + 1) = −αnA−1
n+1fn+1(θ(n))
PolSA: ∆θ(n + 1) = [I + ζAn+1]∆θ(n) + αnζfn+1(θ(n))
(linearized) NeSA: ∆θ(n + 1) = [I + ζAn+1]∆θ(n) + αnζfn+1(θ(n))
Coupling of SNR and PolSA: θSNR
n − θPolSA
n = O(n−1)
Linear model: fn+1(θ) = A(θ − θ∗) + ∆n+1
{∆n+1} square-integrable martingale difference sequence
A Hurwitz and eig(I + ζA) ∈ open unit disk
6 / 18
Reinforcement Learning with Momentum Coupling
Momentum based Stochastic Approximation
An+1 ≈ d
dθ
¯f (θn)
SNR: ∆θ(n + 1) = −αnA−1
n+1fn+1(θ(n))
PolSA: ∆θ(n + 1) = [I + ζAn+1]∆θ(n) + αnζfn+1(θ(n))
(linearized) NeSA: ∆θ(n + 1) = [I + ζAn+1]∆θ(n) + αnζfn+1(θ(n))
Coupling of SNR and PolSA: θSNR
n − θPolSA
n = O(n−1)
Linear model: fn+1(θ) = A(θ − θ∗) + ∆n+1
{∆n+1} square-integrable martingale difference sequence
A Hurwitz and eig(I + ζA) ∈ open unit disk
PolSA has optimal asymptotic variance
6 / 18
0 1 2 3 4 5 6 7 8 9 10 105
0
20
40
60
80
100 Watkins, Speedy Q-learning,
Polyak-Ruppert Averaging
Zap
BellmanError
n
Examples
Examples Much like Newton-Raphson
Zap Q-Learning: Fastest Convergent Q-Learning
Classical Q-Learning
Speedy Q-Learning
R-P Averaging
Polynomial Learning Rate
Q-Learning: Optimal Scalar Gain
Zap O(d)
Zap Q-Learning
0 1 2 3 4 5 6 7 8 9 10
n : number of iterations 105
0
20
40
60
80
100
BellmanError
7 / 18
Examples Much like Newton-Raphson
Zap Q-Learning: Fastest Convergent Q-Learning
Classical Q-Learning
Speedy Q-Learning
R-P Averaging
Polynomial Learning Rate
Q-Learning: Optimal Scalar Gain
Zap O(d)
Zap Q-Learning PolSA
NeSA
0 1 2 3 4 5 6 7 8 9 10
n : number of iterations 105
0
20
40
60
80
100
BellmanError
7 / 18
Examples Much like Newton-Raphson
Zap Q-Learning: Fastest Convergent Q-Learning
Convergence for larger models:
103
104
105
106
107
100
10
1
102
103
104
10
5
103
104
105
106
107
MaxBelmanError
nn
10
-1
100
10
1
10
2
10
3
10
4
MaxBelmanError
Online Clock Sampling
Watkins
Watkins:
Zap
PolSA
NeSA
1/(1−β)
d=19d=117
Coupling is amazing
8 / 18
Examples Much like Newton-Raphson
Zap Q-Learning: Fastest Convergent Q-Learning
Convergence for larger models:
103
104
105
106
107
100
10
1
102
103
104
10
5
103
104
105
106
107
MaxBelmanError
nn
10
-1
100
10
1
10
2
10
3
10
4
MaxBelmanError
Online Clock Sampling
Watkins
Watkins:
Zap
PolSA
NeSA
1/(1−β)
d=19d=117
Avoid randomized policies if possible
8 / 18
Examples Optimal stopping
Zap Q-Learning
Model of Tsitsiklis and Van Roy: Optimal Stopping Time in Finance
State space: R100
Parameterized Q-function: Qθ with θ ∈ R10
i
0 1 2 3 4 5 6 7 8 9 10
-10
0
-10
-1
-10
-2
-10-3
-10-4
-10-5
-10
-6
Real for every eigenvalue λ
Asymptotic covariance is infinite
λ > −
1
2
Real λi(A)
9 / 18
Examples Optimal stopping
Zap Q-Learning
Model of Tsitsiklis and Van Roy: Optimal Stopping Time in Finance
State space: R100
Parameterized Q-function: Qθ with θ ∈ R10
i
0 1 2 3 4 5 6 7 8 9 10
-10
0
-10
-1
-10
-2
-10
-3
-10-4
-10-5
-10
-6
Real for every eigenvalue λ
Authors observed slow convergence
Proposed a matrix gain sequence
(see refs for details)
Asymptotic covariance is infinite
λ > −
1
2
Real λi(A)
{Gn}
9 / 18
Examples Optimal stopping
Zap Q-Learning
Model of Tsitsiklis and Van Roy: Optimal Stopping Time in Finance
State space: R100
Parameterized Q-function: Qθ with θ ∈ R10
i
0 1 2 3 4 5 6 7 8 9 10
-10
0
-10
-1
-10
-2
-10
-3
-10-4
-10-5
-10
-6
-0.525-30 -25 -20 -15 -10 -5
-10
-5
0
5
10
Re (λ(GA))
Co(λ(GA))
λi(GA)Real λi(A)
Eigenvalues of A and GA for the finance example
Favorite choice of gain in [22] barely meets the criterion Re(λ(GA)) < −1
2
9 / 18
Examples Optimal stopping
Zap Q-Learning
Model of Tsitsiklis and Van Roy: Optimal Stopping Time in Finance
State space: R100.
Parameterized Q-function: Qθ with θ ∈ R10
Histograms of the average reward obtained using the different algorithms:
1 1.05 1.1 1.15 1.2 1.25
0
20
40
60
80
100
1 1.05 1.1 1.15 1.2 1.25
0
100
200
300
400
500
600
1 1.05 1.1 1.15 1.2 1.25
0
5
10
15
20
25
30
35 G-Q(0)
G-Q(0)
Zap-Q
Zap-Q ρ = 0.8
ρ = 1.0
(gain doubled)
Zap-Q ρ = 0.85
n = 2 × 104
n = 2 × 105
n = 2 × 106
Zap-Q G-Q
10 / 18
Conclusions & Future Work
Conclusions & Future Work
Conclusions
Reinforcement Learning is not just cursed by dimension,
but also by variance
We need better design tools to improve performance
11 / 18
Conclusions & Future Work
Conclusions & Future Work
Conclusions
Reinforcement Learning is not just cursed by dimension,
but also by variance
We need better design tools to improve performance
The asymptotic covariance is an awesome design tool.
It is also predictive of finite-n performance.
Example: optimal g∗
was chosen based on asymptotic covariance
11 / 18
Conclusions & Future Work
Conclusions & Future Work
Conclusions
Reinforcement Learning is not just cursed by dimension,
but also by variance
We need better design tools to improve performance
The asymptotic covariance is an awesome design tool.
It is also predictive of finite-n performance.
Example: optimal g∗
was chosen based on asymptotic covariance
PolSA and NeSA based RL provide efficient alternatives to the
expensive stochastic Newton-Raphson
PolSA : same asymptotic variance as Zap
without matrix inversion
11 / 18
Conclusions & Future Work
Conclusions & Future Work
Conclusions
Reinforcement Learning is not just cursed by dimension,
but also by variance
We need better design tools to improve performance
The asymptotic covariance is an awesome design tool.
It is also predictive of finite-n performance.
Example: optimal g∗
was chosen based on asymptotic covariance
PolSA and NeSA based RL provide efficient alternatives to the
expensive stochastic Newton-Raphson
PolSA : same asymptotic variance as Zap
without matrix inversion
Future work:
Q-learning with function-approximation
Obtain conditions for a stable algorithm in a general setting
Adaptive optimization of algorithm parameters: g∗, ζ∗, G∗, M∗
11 / 18
Conclusions & Future Work
Thank you!
thankful
12 / 18
References
Control Techniques
FOR
Complex Networks
Sean Meyn
Pre-publication version for on-line viewing. Monograph available for purchase at your favorite retailer
More information available at http://www.cambridge.org/us/catalogue/catalogue.asp?isbn=9780521884419
Markov Chains
and
Stochastic Stability
S. P. Meyn and R. L. Tweedie
August 2008 Pre-publication version for on-line viewing. Monograph to appear Februrary 2009
π(f)<∞
∆V (x) ≤ −f(x) + bIC(x)
Pn
(x, · ) − π f → 0
sup
C
Ex[SτC(f)]<∞
References
13 / 18
References
This lecture
A. M. Devraj and S. P. Meyn. Fastest convergence for Q-learning. ArXiv , July 2017.
A. M. Devraj and S. Meyn. Zap Q-Learning. In Advances in Neural Information Processing
Systems, pages 2235–2244, 2017.
A. M. Devraj, A. Buˇsi´c, and S. Meyn. Zap Q Learning — a user’s guide. In Proceedings of
the fifth Indian Control Conference, 9-11 January 2019.
A. M. Devraj, A. Buˇsi´c, and S. Meyn. (stay tuned)
14 / 18
References
Selected References I
[1] A. Benveniste, M. M´etivier, and P. Priouret. Adaptive algorithms and stochastic
approximations, volume 22 of Applications of Mathematics (New York). Springer-Verlag,
Berlin, 1990. Translated from the French by Stephen S. Wilson.
[2] V. S. Borkar. Stochastic Approximation: A Dynamical Systems Viewpoint. Hindustan
Book Agency and Cambridge University Press (jointly), 2008.
[3] V. S. Borkar and S. P. Meyn. The ODE method for convergence of stochastic
approximation and reinforcement learning. SIAM J. Control Optim., 38(2):447–469, 2000.
[4] S. P. Meyn and R. L. Tweedie. Markov chains and stochastic stability. Cambridge
Mathematical Library, second edition, 2009.
[5] S. P. Meyn. Control Techniques for Complex Networks. Cambridge University Press, 2007.
See last chapter on simulation and average-cost TD learning
[6] D. Ruppert. A Newton-Raphson version of the multivariate Robbins-Monro procedure.
The Annals of Statistics, 13(1):236–245, 1985.
[7] D. Ruppert. Efficient estimators from a slowly convergent Robbins-Monro processes.
Technical Report Tech. Rept. No. 781, Cornell University, School of Operations Research
and Industrial Engineering, Ithaca, NY, 1988.
15 / 18
References
Selected References II
[8] B. T. Polyak. A new method of stochastic approximation type. Avtomatika i
telemekhanika. Translated in Automat. Remote Control, 51 (1991), pages 98–107, 1990.
[9] B. T. Polyak and A. B. Juditsky. Acceleration of stochastic approximation by averaging.
SIAM J. Control Optim., 30(4):838–855, 1992.
[10] V. R. Konda and J. N. Tsitsiklis. Convergence rate of linear two-time-scale stochastic
approximation. Ann. Appl. Probab., 14(2):796–819, 2004.
[11] E. Moulines and F. R. Bach. Non-asymptotic analysis of stochastic approximation
algorithms for machine learning. In Advances in Neural Information Processing Systems
24, pages 451–459. Curran Associates, Inc., 2011.
[12] C. Szepesv´ari. Algorithms for Reinforcement Learning. Synthesis Lectures on Artificial
Intelligence and Machine Learning. Morgan & Claypool Publishers, 2010.
[13] C. J. C. H. Watkins. Learning from Delayed Rewards. PhD thesis, King’s College,
Cambridge, Cambridge, UK, 1989.
[14] C. J. C. H. Watkins and P. Dayan. Q-learning. Machine Learning, 8(3-4):279–292, 1992.
16 / 18
References
Selected References III
[15] R. S. Sutton.Learning to predict by the methods of temporal differences. Mach. Learn.,
3(1):9–44, 1988.
[16] J. N. Tsitsiklis and B. Van Roy. An analysis of temporal-difference learning with function
approximation. IEEE Trans. Automat. Control, 42(5):674–690, 1997.
[17] C. Szepesv´ari. The asymptotic convergence-rate of Q-learning. In Proceedings of the 10th
Internat. Conf. on Neural Info. Proc. Systems, pages 1064–1070. MIT Press, 1997.
[18] M. G. Azar, R. Munos, M. Ghavamzadeh, and H. Kappen. Speedy Q-learning. In
Advances in Neural Information Processing Systems, 2011.
[19] E. Even-Dar and Y. Mansour. Learning rates for Q-learning. Journal of Machine Learning
Research, 5(Dec):1–25, 2003.
[20] D. Huang, W. Chen, P. Mehta, S. Meyn, and A. Surana. Feature selection for
neuro-dynamic programming. In F. Lewis, editor, Reinforcement Learning and
Approximate Dynamic Programming for Feedback Control. Wiley, 2011.
[21] J. N. Tsitsiklis and B. Van Roy. Optimal stopping of Markov processes: Hilbert space
theory, approximation algorithms, and an application to pricing high-dimensional financial
derivatives. IEEE Trans. Automat. Control, 44(10):1840–1851, 1999.
17 / 18
References
Selected References IV
[22] D. Choi and B. Van Roy. A generalized Kalman filter for fixed point approximation and
efficient temporal-difference learning. Discrete Event Dynamic Systems: Theory and
Applications, 16(2):207–239, 2006.
[23] S. J. Bradtke and A. G. Barto. Linear least-squares algorithms for temporal difference
learning. Mach. Learn., 22(1-3):33–57, 1996.
[24] J. A. Boyan. Technical update: Least-squares temporal difference learning. Mach. Learn.,
49(2-3):233–246, 2002.
[25] A. Nedic and D. Bertsekas. Least squares policy evaluation algorithms with linear function
approximation. Discrete Event Dyn. Systems: Theory and Appl., 13(1-2):79–110, 2003.
[26] P. G. Mehta and S. P. Meyn. Q-learning and Pontryagin’s minimum principle. In IEEE
Conference on Decision and Control, pages 3598–3605, Dec. 2009.
18 / 18

More Related Content

PDF
Reinforcement Learning: Hidden Theory and New Super-Fast Algorithms
PDF
Introducing Zap Q-Learning
PDF
A new Perron-Frobenius theorem for nonnegative tensors
PDF
Nodal Domain Theorem for the p-Laplacian on Graphs and the Related Multiway C...
PDF
Small updates of matrix functions used for network centrality
PDF
talk MCMC & SMC 2004
PDF
Ece formula sheet
PDF
Numerical approach for Hamilton-Jacobi equations on a network: application to...
Reinforcement Learning: Hidden Theory and New Super-Fast Algorithms
Introducing Zap Q-Learning
A new Perron-Frobenius theorem for nonnegative tensors
Nodal Domain Theorem for the p-Laplacian on Graphs and the Related Multiway C...
Small updates of matrix functions used for network centrality
talk MCMC & SMC 2004
Ece formula sheet
Numerical approach for Hamilton-Jacobi equations on a network: application to...

What's hot (20)

PDF
Linear Bayesian update surrogate for updating PCE coefficients
PDF
QMC: Transition Workshop - Probabilistic Integrators for Deterministic Differ...
PDF
Signal Processing Course : Inverse Problems Regularization
PDF
Simplified Runtime Analysis of Estimation of Distribution Algorithms
PDF
sada_pres
PDF
2018 MUMS Fall Course - Statistical Representation of Model Input (EDITED) - ...
PPTX
6. graphs of trig functions x
PDF
Sampling strategies for Sequential Monte Carlo (SMC) methods
PDF
Optimal L-shaped matrix reordering, aka graph's core-periphery
PDF
Engineering formula sheet
PDF
GradStudentSeminarSept30
PDF
On the Jensen-Shannon symmetrization of distances relying on abstract means
PDF
Appendix to MLPI Lecture 2 - Monte Carlo Methods (Basics)
PDF
DissertationSlides169
PDF
The dual geometry of Shannon information
PDF
CLIM: Transition Workshop - Projected Data Assimilation - Erik Van Vleck, Ma...
PDF
Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Cat...
PDF
Sparse-Bayesian Approach to Inverse Problems with Partial Differential Equati...
PDF
Non-sampling functional approximation of linear and non-linear Bayesian Update
PDF
22 01 2014_03_23_31_eee_formula_sheet_final
Linear Bayesian update surrogate for updating PCE coefficients
QMC: Transition Workshop - Probabilistic Integrators for Deterministic Differ...
Signal Processing Course : Inverse Problems Regularization
Simplified Runtime Analysis of Estimation of Distribution Algorithms
sada_pres
2018 MUMS Fall Course - Statistical Representation of Model Input (EDITED) - ...
6. graphs of trig functions x
Sampling strategies for Sequential Monte Carlo (SMC) methods
Optimal L-shaped matrix reordering, aka graph's core-periphery
Engineering formula sheet
GradStudentSeminarSept30
On the Jensen-Shannon symmetrization of distances relying on abstract means
Appendix to MLPI Lecture 2 - Monte Carlo Methods (Basics)
DissertationSlides169
The dual geometry of Shannon information
CLIM: Transition Workshop - Projected Data Assimilation - Erik Van Vleck, Ma...
Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Cat...
Sparse-Bayesian Approach to Inverse Problems with Partial Differential Equati...
Non-sampling functional approximation of linear and non-linear Bayesian Update
22 01 2014_03_23_31_eee_formula_sheet_final
Ad

Similar to Zap Q-Learning - ISMP 2018 (20)

PDF
DeepLearn2022 2. Variance Matters
PDF
Convergence of ABC methods
PDF
Linear Machine Learning Models with L2 Regularization and Kernel Tricks
PPTX
stochastic processes assignment help
PDF
Quasi-Stochastic Approximation: Algorithm Design Principles with Applications...
PDF
Asymptotics of ABC, lecture, Collège de France
PDF
Monte Carlo Statistical Methods
PDF
Mathematics and AI
PDF
DeepLearn2022 1. Goals & AlgorithmDesign.pdf
PDF
QMC: Operator Splitting Workshop, Stochastic Block-Coordinate Fixed Point Alg...
PDF
lec-10-perceptron-upload.pdf
PDF
Research internship on optimal stochastic theory with financial application u...
PDF
Presentation on stochastic control problem with financial applications (Merto...
PDF
10 - 29 Jan - Recursion Part 2
PDF
Scattering theory analogues of several classical estimates in Fourier analysis
PDF
Random Matrix Theory and Machine Learning - Part 3
PDF
Truth, deduction, computation lecture g
PDF
stat-phys-appis-reduced.pdf
PDF
100 things I know
PPT
07 periodic functions and fourier series
DeepLearn2022 2. Variance Matters
Convergence of ABC methods
Linear Machine Learning Models with L2 Regularization and Kernel Tricks
stochastic processes assignment help
Quasi-Stochastic Approximation: Algorithm Design Principles with Applications...
Asymptotics of ABC, lecture, Collège de France
Monte Carlo Statistical Methods
Mathematics and AI
DeepLearn2022 1. Goals & AlgorithmDesign.pdf
QMC: Operator Splitting Workshop, Stochastic Block-Coordinate Fixed Point Alg...
lec-10-perceptron-upload.pdf
Research internship on optimal stochastic theory with financial application u...
Presentation on stochastic control problem with financial applications (Merto...
10 - 29 Jan - Recursion Part 2
Scattering theory analogues of several classical estimates in Fourier analysis
Random Matrix Theory and Machine Learning - Part 3
Truth, deduction, computation lecture g
stat-phys-appis-reduced.pdf
100 things I know
07 periodic functions and fourier series
Ad

More from Sean Meyn (20)

PDF
DeepLearn2022 3. TD and Q Learning
PDF
Smart Grid Tutorial - January 2019
PDF
State Space Collapse in Resource Allocation for Demand Dispatch - May 2019
PDF
Irrational Agents and the Power Grid
PDF
State estimation and Mean-Field Control with application to demand dispatch
PDF
Demand-Side Flexibility for Reliable Ancillary Services
PDF
Spectral Decomposition of Demand-Side Flexibility for Reliable Ancillary Serv...
PDF
Demand-Side Flexibility for Reliable Ancillary Services in a Smart Grid: Elim...
PDF
Why Do We Ignore Risk in Power Economics?
PDF
Distributed Randomized Control for Ancillary Service to the Power Grid
PDF
Ancillary service to the grid from deferrable loads: the case for intelligent...
PDF
2012 Tutorial: Markets for Differentiated Electric Power Products
PDF
Control Techniques for Complex Systems
PDF
Tutorial for Energy Systems Week - Cambridge 2010
PDF
Panel Lecture for Energy Systems Week
PDF
The Value of Volatile Resources... Caltech, May 6 2010
PDF
Approximate dynamic programming using fluid and diffusion approximations with...
PDF
Anomaly Detection Using Projective Markov Models
PDF
Markov Tutorial CDC Shanghai 2009
PDF
Q-Learning and Pontryagin's Minimum Principle
DeepLearn2022 3. TD and Q Learning
Smart Grid Tutorial - January 2019
State Space Collapse in Resource Allocation for Demand Dispatch - May 2019
Irrational Agents and the Power Grid
State estimation and Mean-Field Control with application to demand dispatch
Demand-Side Flexibility for Reliable Ancillary Services
Spectral Decomposition of Demand-Side Flexibility for Reliable Ancillary Serv...
Demand-Side Flexibility for Reliable Ancillary Services in a Smart Grid: Elim...
Why Do We Ignore Risk in Power Economics?
Distributed Randomized Control for Ancillary Service to the Power Grid
Ancillary service to the grid from deferrable loads: the case for intelligent...
2012 Tutorial: Markets for Differentiated Electric Power Products
Control Techniques for Complex Systems
Tutorial for Energy Systems Week - Cambridge 2010
Panel Lecture for Energy Systems Week
The Value of Volatile Resources... Caltech, May 6 2010
Approximate dynamic programming using fluid and diffusion approximations with...
Anomaly Detection Using Projective Markov Models
Markov Tutorial CDC Shanghai 2009
Q-Learning and Pontryagin's Minimum Principle

Recently uploaded (20)

PDF
Cosmology using numerical relativity - what hapenned before big bang?
PPTX
Substance Disorders- part different drugs change body
PPTX
GREEN FIELDS SCHOOL PPT ON HOLIDAY HOMEWORK
PPTX
HAEMATOLOGICAL DISEASES lack of red blood cells, which carry oxygen throughou...
PPTX
Presentation1 INTRODUCTION TO ENZYMES.pptx
PPTX
Introduction to Immunology (Unit-1).pptx
PDF
BET Eukaryotic signal Transduction BET Eukaryotic signal Transduction.pdf
PDF
Worlds Next Door: A Candidate Giant Planet Imaged in the Habitable Zone of ↵ ...
PPTX
A powerpoint on colorectal cancer with brief background
PPT
1. INTRODUCTION TO EPIDEMIOLOGY.pptx for community medicine
PPT
Biochemestry- PPT ON Protein,Nitrogenous constituents of Urine, Blood, their ...
PDF
The Future of Telehealth: Engineering New Platforms for Care (www.kiu.ac.ug)
PPTX
Introcution to Microbes Burton's Biology for the Health
PDF
Integrative Oncology: Merging Conventional and Alternative Approaches (www.k...
PPTX
Understanding the Circulatory System……..
PDF
Is Earendel a Star Cluster?: Metal-poor Globular Cluster Progenitors at z ∼ 6
PPT
Mutation in dna of bacteria and repairss
PPT
LEC Synthetic Biology and its application.ppt
PPTX
gene cloning powerpoint for general biology 2
PPTX
limit test definition and all limit tests
Cosmology using numerical relativity - what hapenned before big bang?
Substance Disorders- part different drugs change body
GREEN FIELDS SCHOOL PPT ON HOLIDAY HOMEWORK
HAEMATOLOGICAL DISEASES lack of red blood cells, which carry oxygen throughou...
Presentation1 INTRODUCTION TO ENZYMES.pptx
Introduction to Immunology (Unit-1).pptx
BET Eukaryotic signal Transduction BET Eukaryotic signal Transduction.pdf
Worlds Next Door: A Candidate Giant Planet Imaged in the Habitable Zone of ↵ ...
A powerpoint on colorectal cancer with brief background
1. INTRODUCTION TO EPIDEMIOLOGY.pptx for community medicine
Biochemestry- PPT ON Protein,Nitrogenous constituents of Urine, Blood, their ...
The Future of Telehealth: Engineering New Platforms for Care (www.kiu.ac.ug)
Introcution to Microbes Burton's Biology for the Health
Integrative Oncology: Merging Conventional and Alternative Approaches (www.k...
Understanding the Circulatory System……..
Is Earendel a Star Cluster?: Metal-poor Globular Cluster Progenitors at z ∼ 6
Mutation in dna of bacteria and repairss
LEC Synthetic Biology and its application.ppt
gene cloning powerpoint for general biology 2
limit test definition and all limit tests

Zap Q-Learning - ISMP 2018

  • 1. Zap Q-Learning Fastest Convergent Q-Learning Advances in Reinforcement Learning Algorithms Sean Meyn Department of Electrical and Computer Engineering — University of Florida Based on joint research with Adithya M. Devraj Ana Buˇsi´c Thanks to to the National Science Foundation
  • 2. Zap Q-Learning Essential References Simons tutorial, March 2018 Part I (Basics, with focus on variance of algorithms) https://www.youtube.com/watch?v=dhEF5pfYmvc Part II (Zap Q-learning) https://www.youtube.com/watch?v=Y3w8f1xIb6s A. M. Devraj and S. P. Meyn. Fastest convergence for Q-learning. ArXiv , July 2017. (full tutorial on stochastic approximation) A. M. Devraj and S. Meyn. Zap Q-Learning. In Advances in Neural Information Processing Systems, pages 2235–2244, 2017. A. M. Devraj, A. Buˇsi´c, and S. Meyn. Zap Q Learning — a user’s guide. Proceedings of the fifth Indian Control Conference, 9-11 January 2019. 1 / 18
  • 3. Zap Q-Learning Outline 1 Stochastic Approximation 2 Remedies for S L O W Convergence 3 Reinforcement Learning with Momentum 4 Examples 5 Conclusions & Future Work 6 References
  • 5. Stochastic Approximation Problem, Algorithm, and Issues Stochastic Approximation A simple goal: Find the solution θ∗ ∈ Rd to ¯f(θ∗ ) := E[f(θ, W)] θ=θ∗ = 0 1 / 18
  • 6. Stochastic Approximation Problem, Algorithm, and Issues Stochastic Approximation A simple goal: Find the solution θ∗ ∈ Rd to ¯f(θ∗ ) := E[f(θ, W)] θ=θ∗ = 0 Linear example illustrates challenges: E[f(θ, W)] θ=θ∗ = Aθ∗ − b := E[A(W)]θ∗ − E[b(W)] = 0 1 / 18
  • 7. Stochastic Approximation Problem, Algorithm, and Issues Stochastic Approximation A simple goal: Find the solution θ∗ ∈ Rd to ¯f(θ∗ ) := E[f(θ, W)] θ=θ∗ = 0 Linear example illustrates challenges: E[f(θ, W)] θ=θ∗ = Aθ∗ − b := E[A(W)]θ∗ − E[b(W)] = 0 Algorithm: Observe An+1 := A(Wn+1), bn+1 := b(Wn+1), θn+1 = θn + αn+1(An+1θn − bn+1) 1 / 18
  • 8. Stochastic Approximation Problem, Algorithm, and Issues Stochastic Approximation A simple goal: Find the solution θ∗ ∈ Rd to ¯f(θ∗ ) := E[f(θ, W)] θ=θ∗ = 0 Linear example illustrates challenges: E[f(θ, W)] θ=θ∗ = Aθ∗ − b := E[A(W)]θ∗ − E[b(W)] = 0 Algorithm: Observe An+1 := A(Wn+1), bn+1 := b(Wn+1), θn+1 = θn + αn+1(An+1θn − bn+1) CLT Variance: √ n(θn −θ∗) → N(0, Σθ) 1 / 18
  • 9. Stochastic Approximation Problem, Algorithm, and Issues Stochastic Approximation A simple goal: Find the solution θ∗ ∈ Rd to ¯f(θ∗ ) := E[f(θ, W)] θ=θ∗ = 0 Linear example illustrates challenges: E[f(θ, W)] θ=θ∗ = Aθ∗ − b := E[A(W)]θ∗ − E[b(W)] = 0 Algorithm: Observe An+1 := A(Wn+1), bn+1 := b(Wn+1), θn+1 = θn + αn+1(An+1θn − bn+1) CLT Variance: √ n(θn −θ∗) → N(0, Σθ) Typically, Σθ = ∞ [Devraj & M, 2017] -5000 0 5000 10000 0 2 4 6 8 10 θn(15) θ∗ (15)≈500 Histogram of Time horizon n = one million 1 / 18
  • 11. Remedies for S L O W Convergence Stochastic Newton Raphson Fixing (Optimizing) the Variance Goal: Find θ∗ such that Aθ∗ − b = 0 fn+1(θn) = An+1θn − bn+1 Stochastic Approximation : θn+1 = θn + αn+1fn+1(θn) , we take αn = 1/n 2 / 18
  • 12. Remedies for S L O W Convergence Stochastic Newton Raphson Fixing (Optimizing) the Variance Goal: Find θ∗ such that Aθ∗ − b = 0 fn+1(θn) = An+1θn − bn+1 Stochastic Approximation : θn+1 = θn + αn+1fn+1(θn) , we take αn = 1/n Optimal Asymptotic Variance: θn+1 = θn − αn+1A−1 fn+1(θn) 2 / 18
  • 13. Remedies for S L O W Convergence Stochastic Newton Raphson Fixing (Optimizing) the Variance Goal: Find θ∗ such that Aθ∗ − b = 0 fn+1(θn) = An+1θn − bn+1 Stochastic Approximation : θn+1 = θn + αn+1fn+1(θn) Optimal Asymptotic Variance: θn+1 = θn − αn+1A−1 fn+1(θn) Don’t know A 2 / 18
  • 14. Remedies for S L O W Convergence Stochastic Newton Raphson Fixing (Optimizing) the Variance Goal: Find θ∗ such that Aθ∗ − b = 0 fn+1(θn) = An+1θn − bn+1 Stochastic Approximation : θn+1 = θn + αn+1fn+1(θn) Optimal Asymptotic Variance: θn+1 = θn − αn+1A−1 fn+1(θn) Don’t know A – estimate (Monte-Carlo): An+1 ≈ A Stochastic Newton-Raphson : θn+1 = θn − αn+1A−1 n+1fn+1(θn) 2 / 18
  • 15. Remedies for S L O W Convergence Stochastic Newton Raphson Fixing (Optimizing) the Variance Goal: Find θ∗ such that Aθ∗ − b = 0 fn+1(θn) = An+1θn − bn+1 Stochastic Approximation (TD-learning!): θn+1 = θn + αn+1fn+1(θn) Optimal Asymptotic Variance: θn+1 = θn − αn+1A−1 fn+1(θn) Don’t know A – estimate (Monte-Carlo): An+1 ≈ A Stochastic Newton-Raphson (Least Squares TD-learning!): θn+1 = θn − αn+1A−1 n+1fn+1(θn) 2 / 18
  • 16. Remedies for S L O W Convergence Stochastic Newton Raphson Fixing (Optimizing) the Variance Goal: Find θ∗ such that ¯f(θ∗) = 0 fn+1 no longer linear Stochastic Approximation (Q-learning!): θn+1 = θn + αn+1fn+1(θn) Optimal Asymptotic Variance: θn+1 = θn − αn+1A(θ∗ )−1 fn+1(θn) Don’t know A(θ∗) – estimate (2-time-scale): An+1 ≈ A(θn) = ∂ ¯f(θn) Zap Stochastic Newton-Raphson (Zap Q-learning!): θn+1 = θn − αn+1A−1 n+1fn+1(θn) 2 / 18
  • 17. Remedies for S L O W Convergence Convergence for Zapped Watkins Zap Q-learning Discounted-cost optimal control problem: Q∗ = Q∗ (c) Watkins’ algorithm: variance is infinite for β > 1 2 [Devraj and M., 2017] 3 / 18
  • 18. Remedies for S L O W Convergence Convergence for Zapped Watkins Zap Q-learning Discounted-cost optimal control problem: Q∗ = Q∗ (c) Zap Q-learning has optimal variance: 3 / 18
  • 19. Remedies for S L O W Convergence Convergence for Zapped Watkins Zap Q-learning Discounted-cost optimal control problem: Q∗ = Q∗ (c) Zap Q-learning has optimal variance: ODE Analysis: change of variables q = Q∗(ς) Functional Q∗ maps cost functions to Q-functions: q(x, u) = ς(x, u) + β x Pu(x, x ) min u q(x , u ) 3 / 18
  • 20. Remedies for S L O W Convergence Convergence for Zapped Watkins Zap Q-learning Discounted-cost optimal control problem: Q∗ = Q∗ (c) Zap Q-learning has optimal variance: ODE Analysis: change of variables q = Q∗(ς) Functional Q∗ maps cost functions to Q-functions: q(x, u) = ς(x, u) + β x Pu(x, x ) min u q(x , u ) ODE for Zap-Q (for Watkins’ setting) qt = Q∗ (ςt), d dt ςt = −ςt + c ⇒ convergence, optimal covariance, ... 3 / 18
  • 21. Zap Q-Learning 0 1 2 3 4 5 6 7 8 9 10 Iterations 105 0 20 40 60 80 100BellmanError Reinforcement Learning with Momentum
  • 22. Reinforcement Learning with Momentum Matrix Momentum Momentum based Stochastic Approximation ∆θ(n) := θ(n) − θ(n − 1) Matrix Gain Stochastic Approximation: ∆θ(n + 1) = αnGn+1fn+1(θ(n)) 4 / 18
  • 23. Reinforcement Learning with Momentum Matrix Momentum Momentum based Stochastic Approximation ∆θ(n) := θ(n) − θ(n − 1) Matrix Gain Stochastic Approximation: ∆θ(n + 1) = αnGn+1fn+1(θ(n)) Heavy Ball Stochastic Approximation: ∆θ(n + 1) = m∆θ(n) + αnfn+1(θ(n)) 4 / 18
  • 24. Reinforcement Learning with Momentum Matrix Momentum Momentum based Stochastic Approximation ∆θ(n) := θ(n) − θ(n − 1) Matrix Gain Stochastic Approximation: ∆θ(n + 1) = αnGn+1fn+1(θ(n)) Heavy Ball Stochastic Approximation: ∆θ(n + 1) = m∆θ(n) + αnfn+1(θ(n)) Matirx Heavy Ball Stochastic Approximation: ∆θ(n + 1) = Mn+1∆θ(n) + αnGn+1fn+1(θ(n)) 4 / 18
  • 25. Reinforcement Learning with Momentum Matrix Momentum Matrix Heavy Ball Stochastic Approximation Optimizing {Mn+1} and {Gn+1} Matirx Heavy Ball Stochastic Approximation: ∆θ(n + 1) = Mn+1∆θ(n) + αnGn+1fn+1(θ(n)) Heuristic: Assume ∆θn → 0 much faster than θn → θ∗ 5 / 18
  • 26. Reinforcement Learning with Momentum Matrix Momentum Matrix Heavy Ball Stochastic Approximation Optimizing {Mn+1} and {Gn+1} Matirx Heavy Ball Stochastic Approximation: ∆θ(n + 1) = Mn+1∆θ(n) + αnGn+1fn+1(θ(n)) Heuristic: Assume ∆θn → 0 much faster than θn → θ∗ ∆θ(n + 1) ≈ Mn+1∆θ(n + 1) + αnGn+1fn+1(θ(n)) 5 / 18
  • 27. Reinforcement Learning with Momentum Matrix Momentum Matrix Heavy Ball Stochastic Approximation Optimizing {Mn+1} and {Gn+1} Matirx Heavy Ball Stochastic Approximation: ∆θ(n + 1) = Mn+1∆θ(n) + αnGn+1fn+1(θ(n)) Heuristic: Assume ∆θn → 0 much faster than θn → θ∗ ∆θ(n + 1) ≈ Mn+1∆θ(n + 1) + αnGn+1fn+1(θ(n)) ≈ [I − Mn+1]−1 αnGn+1fn+1(θ(n)) 5 / 18
  • 28. Reinforcement Learning with Momentum Matrix Momentum Matrix Heavy Ball Stochastic Approximation Optimizing {Mn+1} and {Gn+1} Matirx Heavy Ball Stochastic Approximation: ∆θ(n + 1) = Mn+1∆θ(n) + αnGn+1fn+1(θ(n)) Heuristic: Assume ∆θn → 0 much faster than θn → θ∗ ∆θ(n + 1) ≈ Mn+1∆θ(n + 1) + αnGn+1fn+1(θ(n)) ≈ [I − Mn+1]−1 αnGn+1fn+1(θ(n)) = −αnA−1 n+1fn+1(θ(n)) Zap! Mn+1 = (I + ζAn+1) and Gn+1 = ζI with ζ > 0 5 / 18
  • 29. Reinforcement Learning with Momentum Coupling Momentum based Stochastic Approximation An+1 ≈ d dθ ¯f (θn) SNR: ∆θ(n + 1) = −αnA−1 n+1fn+1(θ(n)) 6 / 18
  • 30. Reinforcement Learning with Momentum Coupling Momentum based Stochastic Approximation An+1 ≈ d dθ ¯f (θn) SNR: ∆θ(n + 1) = −αnA−1 n+1fn+1(θ(n)) PolSA: ∆θ(n + 1) = [I + ζAn+1]∆θ(n) + αnζfn+1(θ(n)) (linearized) NeSA: ∆θ(n + 1) = [I + ζAn+1]∆θ(n) + αnζfn+1(θ(n)) 6 / 18
  • 31. Reinforcement Learning with Momentum Coupling Momentum based Stochastic Approximation An+1 ≈ d dθ ¯f (θn) SNR: ∆θ(n + 1) = −αnA−1 n+1fn+1(θ(n)) PolSA: ∆θ(n + 1) = [I + ζAn+1]∆θ(n) + αnζfn+1(θ(n)) (linearized) NeSA: ∆θ(n + 1) = [I + ζAn+1]∆θ(n) + αnζfn+1(θ(n)) Coupling of SNR and PolSA: θSNR n − θPolSA n = O(n−1) 6 / 18
  • 32. Reinforcement Learning with Momentum Coupling Momentum based Stochastic Approximation An+1 ≈ d dθ ¯f (θn) SNR: ∆θ(n + 1) = −αnA−1 n+1fn+1(θ(n)) PolSA: ∆θ(n + 1) = [I + ζAn+1]∆θ(n) + αnζfn+1(θ(n)) (linearized) NeSA: ∆θ(n + 1) = [I + ζAn+1]∆θ(n) + αnζfn+1(θ(n)) Coupling of SNR and PolSA: θSNR n − θPolSA n = O(n−1) Linear model: fn+1(θ) = A(θ − θ∗) + ∆n+1 {∆n+1} square-integrable martingale difference sequence A Hurwitz and eig(I + ζA) ∈ open unit disk 6 / 18
  • 33. Reinforcement Learning with Momentum Coupling Momentum based Stochastic Approximation An+1 ≈ d dθ ¯f (θn) SNR: ∆θ(n + 1) = −αnA−1 n+1fn+1(θ(n)) PolSA: ∆θ(n + 1) = [I + ζAn+1]∆θ(n) + αnζfn+1(θ(n)) (linearized) NeSA: ∆θ(n + 1) = [I + ζAn+1]∆θ(n) + αnζfn+1(θ(n)) Coupling of SNR and PolSA: θSNR n − θPolSA n = O(n−1) Linear model: fn+1(θ) = A(θ − θ∗) + ∆n+1 {∆n+1} square-integrable martingale difference sequence A Hurwitz and eig(I + ζA) ∈ open unit disk PolSA has optimal asymptotic variance 6 / 18
  • 34. 0 1 2 3 4 5 6 7 8 9 10 105 0 20 40 60 80 100 Watkins, Speedy Q-learning, Polyak-Ruppert Averaging Zap BellmanError n Examples
  • 35. Examples Much like Newton-Raphson Zap Q-Learning: Fastest Convergent Q-Learning Classical Q-Learning Speedy Q-Learning R-P Averaging Polynomial Learning Rate Q-Learning: Optimal Scalar Gain Zap O(d) Zap Q-Learning 0 1 2 3 4 5 6 7 8 9 10 n : number of iterations 105 0 20 40 60 80 100 BellmanError 7 / 18
  • 36. Examples Much like Newton-Raphson Zap Q-Learning: Fastest Convergent Q-Learning Classical Q-Learning Speedy Q-Learning R-P Averaging Polynomial Learning Rate Q-Learning: Optimal Scalar Gain Zap O(d) Zap Q-Learning PolSA NeSA 0 1 2 3 4 5 6 7 8 9 10 n : number of iterations 105 0 20 40 60 80 100 BellmanError 7 / 18
  • 37. Examples Much like Newton-Raphson Zap Q-Learning: Fastest Convergent Q-Learning Convergence for larger models: 103 104 105 106 107 100 10 1 102 103 104 10 5 103 104 105 106 107 MaxBelmanError nn 10 -1 100 10 1 10 2 10 3 10 4 MaxBelmanError Online Clock Sampling Watkins Watkins: Zap PolSA NeSA 1/(1−β) d=19d=117 Coupling is amazing 8 / 18
  • 38. Examples Much like Newton-Raphson Zap Q-Learning: Fastest Convergent Q-Learning Convergence for larger models: 103 104 105 106 107 100 10 1 102 103 104 10 5 103 104 105 106 107 MaxBelmanError nn 10 -1 100 10 1 10 2 10 3 10 4 MaxBelmanError Online Clock Sampling Watkins Watkins: Zap PolSA NeSA 1/(1−β) d=19d=117 Avoid randomized policies if possible 8 / 18
  • 39. Examples Optimal stopping Zap Q-Learning Model of Tsitsiklis and Van Roy: Optimal Stopping Time in Finance State space: R100 Parameterized Q-function: Qθ with θ ∈ R10 i 0 1 2 3 4 5 6 7 8 9 10 -10 0 -10 -1 -10 -2 -10-3 -10-4 -10-5 -10 -6 Real for every eigenvalue λ Asymptotic covariance is infinite λ > − 1 2 Real λi(A) 9 / 18
  • 40. Examples Optimal stopping Zap Q-Learning Model of Tsitsiklis and Van Roy: Optimal Stopping Time in Finance State space: R100 Parameterized Q-function: Qθ with θ ∈ R10 i 0 1 2 3 4 5 6 7 8 9 10 -10 0 -10 -1 -10 -2 -10 -3 -10-4 -10-5 -10 -6 Real for every eigenvalue λ Authors observed slow convergence Proposed a matrix gain sequence (see refs for details) Asymptotic covariance is infinite λ > − 1 2 Real λi(A) {Gn} 9 / 18
  • 41. Examples Optimal stopping Zap Q-Learning Model of Tsitsiklis and Van Roy: Optimal Stopping Time in Finance State space: R100 Parameterized Q-function: Qθ with θ ∈ R10 i 0 1 2 3 4 5 6 7 8 9 10 -10 0 -10 -1 -10 -2 -10 -3 -10-4 -10-5 -10 -6 -0.525-30 -25 -20 -15 -10 -5 -10 -5 0 5 10 Re (λ(GA)) Co(λ(GA)) λi(GA)Real λi(A) Eigenvalues of A and GA for the finance example Favorite choice of gain in [22] barely meets the criterion Re(λ(GA)) < −1 2 9 / 18
  • 42. Examples Optimal stopping Zap Q-Learning Model of Tsitsiklis and Van Roy: Optimal Stopping Time in Finance State space: R100. Parameterized Q-function: Qθ with θ ∈ R10 Histograms of the average reward obtained using the different algorithms: 1 1.05 1.1 1.15 1.2 1.25 0 20 40 60 80 100 1 1.05 1.1 1.15 1.2 1.25 0 100 200 300 400 500 600 1 1.05 1.1 1.15 1.2 1.25 0 5 10 15 20 25 30 35 G-Q(0) G-Q(0) Zap-Q Zap-Q ρ = 0.8 ρ = 1.0 (gain doubled) Zap-Q ρ = 0.85 n = 2 × 104 n = 2 × 105 n = 2 × 106 Zap-Q G-Q 10 / 18
  • 43. Conclusions & Future Work Conclusions & Future Work Conclusions Reinforcement Learning is not just cursed by dimension, but also by variance We need better design tools to improve performance 11 / 18
  • 44. Conclusions & Future Work Conclusions & Future Work Conclusions Reinforcement Learning is not just cursed by dimension, but also by variance We need better design tools to improve performance The asymptotic covariance is an awesome design tool. It is also predictive of finite-n performance. Example: optimal g∗ was chosen based on asymptotic covariance 11 / 18
  • 45. Conclusions & Future Work Conclusions & Future Work Conclusions Reinforcement Learning is not just cursed by dimension, but also by variance We need better design tools to improve performance The asymptotic covariance is an awesome design tool. It is also predictive of finite-n performance. Example: optimal g∗ was chosen based on asymptotic covariance PolSA and NeSA based RL provide efficient alternatives to the expensive stochastic Newton-Raphson PolSA : same asymptotic variance as Zap without matrix inversion 11 / 18
  • 46. Conclusions & Future Work Conclusions & Future Work Conclusions Reinforcement Learning is not just cursed by dimension, but also by variance We need better design tools to improve performance The asymptotic covariance is an awesome design tool. It is also predictive of finite-n performance. Example: optimal g∗ was chosen based on asymptotic covariance PolSA and NeSA based RL provide efficient alternatives to the expensive stochastic Newton-Raphson PolSA : same asymptotic variance as Zap without matrix inversion Future work: Q-learning with function-approximation Obtain conditions for a stable algorithm in a general setting Adaptive optimization of algorithm parameters: g∗, ζ∗, G∗, M∗ 11 / 18
  • 47. Conclusions & Future Work Thank you! thankful 12 / 18
  • 48. References Control Techniques FOR Complex Networks Sean Meyn Pre-publication version for on-line viewing. Monograph available for purchase at your favorite retailer More information available at http://www.cambridge.org/us/catalogue/catalogue.asp?isbn=9780521884419 Markov Chains and Stochastic Stability S. P. Meyn and R. L. Tweedie August 2008 Pre-publication version for on-line viewing. Monograph to appear Februrary 2009 π(f)<∞ ∆V (x) ≤ −f(x) + bIC(x) Pn (x, · ) − π f → 0 sup C Ex[SτC(f)]<∞ References 13 / 18
  • 49. References This lecture A. M. Devraj and S. P. Meyn. Fastest convergence for Q-learning. ArXiv , July 2017. A. M. Devraj and S. Meyn. Zap Q-Learning. In Advances in Neural Information Processing Systems, pages 2235–2244, 2017. A. M. Devraj, A. Buˇsi´c, and S. Meyn. Zap Q Learning — a user’s guide. In Proceedings of the fifth Indian Control Conference, 9-11 January 2019. A. M. Devraj, A. Buˇsi´c, and S. Meyn. (stay tuned) 14 / 18
  • 50. References Selected References I [1] A. Benveniste, M. M´etivier, and P. Priouret. Adaptive algorithms and stochastic approximations, volume 22 of Applications of Mathematics (New York). Springer-Verlag, Berlin, 1990. Translated from the French by Stephen S. Wilson. [2] V. S. Borkar. Stochastic Approximation: A Dynamical Systems Viewpoint. Hindustan Book Agency and Cambridge University Press (jointly), 2008. [3] V. S. Borkar and S. P. Meyn. The ODE method for convergence of stochastic approximation and reinforcement learning. SIAM J. Control Optim., 38(2):447–469, 2000. [4] S. P. Meyn and R. L. Tweedie. Markov chains and stochastic stability. Cambridge Mathematical Library, second edition, 2009. [5] S. P. Meyn. Control Techniques for Complex Networks. Cambridge University Press, 2007. See last chapter on simulation and average-cost TD learning [6] D. Ruppert. A Newton-Raphson version of the multivariate Robbins-Monro procedure. The Annals of Statistics, 13(1):236–245, 1985. [7] D. Ruppert. Efficient estimators from a slowly convergent Robbins-Monro processes. Technical Report Tech. Rept. No. 781, Cornell University, School of Operations Research and Industrial Engineering, Ithaca, NY, 1988. 15 / 18
  • 51. References Selected References II [8] B. T. Polyak. A new method of stochastic approximation type. Avtomatika i telemekhanika. Translated in Automat. Remote Control, 51 (1991), pages 98–107, 1990. [9] B. T. Polyak and A. B. Juditsky. Acceleration of stochastic approximation by averaging. SIAM J. Control Optim., 30(4):838–855, 1992. [10] V. R. Konda and J. N. Tsitsiklis. Convergence rate of linear two-time-scale stochastic approximation. Ann. Appl. Probab., 14(2):796–819, 2004. [11] E. Moulines and F. R. Bach. Non-asymptotic analysis of stochastic approximation algorithms for machine learning. In Advances in Neural Information Processing Systems 24, pages 451–459. Curran Associates, Inc., 2011. [12] C. Szepesv´ari. Algorithms for Reinforcement Learning. Synthesis Lectures on Artificial Intelligence and Machine Learning. Morgan & Claypool Publishers, 2010. [13] C. J. C. H. Watkins. Learning from Delayed Rewards. PhD thesis, King’s College, Cambridge, Cambridge, UK, 1989. [14] C. J. C. H. Watkins and P. Dayan. Q-learning. Machine Learning, 8(3-4):279–292, 1992. 16 / 18
  • 52. References Selected References III [15] R. S. Sutton.Learning to predict by the methods of temporal differences. Mach. Learn., 3(1):9–44, 1988. [16] J. N. Tsitsiklis and B. Van Roy. An analysis of temporal-difference learning with function approximation. IEEE Trans. Automat. Control, 42(5):674–690, 1997. [17] C. Szepesv´ari. The asymptotic convergence-rate of Q-learning. In Proceedings of the 10th Internat. Conf. on Neural Info. Proc. Systems, pages 1064–1070. MIT Press, 1997. [18] M. G. Azar, R. Munos, M. Ghavamzadeh, and H. Kappen. Speedy Q-learning. In Advances in Neural Information Processing Systems, 2011. [19] E. Even-Dar and Y. Mansour. Learning rates for Q-learning. Journal of Machine Learning Research, 5(Dec):1–25, 2003. [20] D. Huang, W. Chen, P. Mehta, S. Meyn, and A. Surana. Feature selection for neuro-dynamic programming. In F. Lewis, editor, Reinforcement Learning and Approximate Dynamic Programming for Feedback Control. Wiley, 2011. [21] J. N. Tsitsiklis and B. Van Roy. Optimal stopping of Markov processes: Hilbert space theory, approximation algorithms, and an application to pricing high-dimensional financial derivatives. IEEE Trans. Automat. Control, 44(10):1840–1851, 1999. 17 / 18
  • 53. References Selected References IV [22] D. Choi and B. Van Roy. A generalized Kalman filter for fixed point approximation and efficient temporal-difference learning. Discrete Event Dynamic Systems: Theory and Applications, 16(2):207–239, 2006. [23] S. J. Bradtke and A. G. Barto. Linear least-squares algorithms for temporal difference learning. Mach. Learn., 22(1-3):33–57, 1996. [24] J. A. Boyan. Technical update: Least-squares temporal difference learning. Mach. Learn., 49(2-3):233–246, 2002. [25] A. Nedic and D. Bertsekas. Least squares policy evaluation algorithms with linear function approximation. Discrete Event Dyn. Systems: Theory and Appl., 13(1-2):79–110, 2003. [26] P. G. Mehta and S. P. Meyn. Q-learning and Pontryagin’s minimum principle. In IEEE Conference on Decision and Control, pages 3598–3605, Dec. 2009. 18 / 18