Zap Q-Learning - ISMP 2018

Zap Q-Learning
Fastest Convergent Q-Learning
Advances in Reinforcement Learning Algorithms
Sean Meyn
Department of Electrical and Computer Engineering — University of Florida
Based on joint research with Adithya M. Devraj Ana Buˇsi´c
Thanks to to the National Science Foundation

Zap Q-Learning
Essential References
Simons tutorial, March 2018
Part I (Basics, with focus on variance of algorithms)
https://www.youtube.com/watch?v=dhEF5pfYmvc
Part II (Zap Q-learning)
https://www.youtube.com/watch?v=Y3w8f1xIb6s
A. M. Devraj and S. P. Meyn. Fastest convergence for Q-learning. ArXiv , July 2017.
(full tutorial on stochastic approximation)
A. M. Devraj and S. Meyn. Zap Q-Learning. In Advances in Neural Information Processing
Systems, pages 2235–2244, 2017.
A. M. Devraj, A. Buˇsi´c, and S. Meyn. Zap Q Learning — a user’s guide. Proceedings of
the ﬁfth Indian Control Conference, 9-11 January 2019.
1 / 18

Zap Q-Learning
Outline
1 Stochastic Approximation
2 Remedies for S L O W Convergence
3 Reinforcement Learning with Momentum
4 Examples
5 Conclusions & Future Work
6 References

E[f(θ,W)]
θ=θ∗
= 0
Stochastic Approximation

Stochastic Approximation Problem, Algorithm, and Issues
A simple goal: Find the solution θ∗ ∈ Rd to
¯f(θ∗
) := E[f(θ, W)]
θ=θ∗
= 0
1 / 18

¯f(θ∗
) := E[f(θ, W)]
θ=θ∗
= 0
Linear example illustrates challenges:
E[f(θ, W)]
θ=θ∗
= Aθ∗
− b := E[A(W)]θ∗
− E[b(W)] = 0
1 / 18

¯f(θ∗
) := E[f(θ, W)]
θ=θ∗
= 0
E[f(θ, W)]
θ=θ∗
= Aθ∗
− b := E[A(W)]θ∗
− E[b(W)] = 0
Algorithm: Observe An+1 := A(Wn+1), bn+1 := b(Wn+1),
θn+1 = θn + αn+1(An+1θn − bn+1)
1 / 18

¯f(θ∗
) := E[f(θ, W)]
θ=θ∗
= 0
E[f(θ, W)]
θ=θ∗
= Aθ∗
− b := E[A(W)]θ∗
− E[b(W)] = 0
θn+1 = θn + αn+1(An+1θn − bn+1)
CLT Variance:
√
n(θn −θ∗) → N(0, Σθ)
1 / 18

¯f(θ∗
) := E[f(θ, W)]
θ=θ∗
= 0
E[f(θ, W)]
θ=θ∗
= Aθ∗
− b := E[A(W)]θ∗
− E[b(W)] = 0
θn+1 = θn + αn+1(An+1θn − bn+1)
CLT Variance:
√
n(θn −θ∗) → N(0, Σθ)
Typically, Σθ = ∞
[Devraj & M, 2017] -5000 0 5000 10000
0
2
4
6
8
10
θn(15)
θ∗
(15)≈500
Histogram of
Time horizon n = one million
1 / 18

θ(k)
k
Remedies for Slow Convergence

Remedies for S L O W Convergence Stochastic Newton Raphson
Fixing (Optimizing) the Variance
Goal: Find θ∗ such that Aθ∗ − b = 0 fn+1(θn) = An+1θn − bn+1
Stochastic Approximation :
θn+1 = θn + αn+1fn+1(θn) , we take αn = 1/n
2 / 18

θn+1 = θn + αn+1fn+1(θn) , we take αn = 1/n
Optimal Asymptotic Variance:
θn+1 = θn − αn+1A−1
fn+1(θn)
2 / 18

θn+1 = θn + αn+1fn+1(θn)
θn+1 = θn − αn+1A−1
fn+1(θn)
Don’t know A
2 / 18

θn+1 = θn − αn+1A−1
fn+1(θn)
Don’t know A – estimate (Monte-Carlo): An+1 ≈ A
Stochastic Newton-Raphson :
θn+1 = θn − αn+1A−1
n+1fn+1(θn)
2 / 18

Stochastic Approximation (TD-learning!):
θn+1 = θn − αn+1A−1
fn+1(θn)
Don’t know A – estimate (Monte-Carlo): An+1 ≈ A
Stochastic Newton-Raphson (Least Squares TD-learning!):
θn+1 = θn − αn+1A−1
n+1fn+1(θn)
2 / 18

Goal: Find θ∗ such that ¯f(θ∗) = 0 fn+1 no longer linear
Stochastic Approximation (Q-learning!):
θn+1 = θn − αn+1A(θ∗
)−1
fn+1(θn)
Don’t know A(θ∗) – estimate (2-time-scale): An+1 ≈ A(θn) = ∂ ¯f(θn)
Zap Stochastic Newton-Raphson (Zap Q-learning!):
θn+1 = θn − αn+1A−1
n+1fn+1(θn)
2 / 18

Remedies for S L O W Convergence Convergence for Zapped Watkins
Zap Q-learning
Discounted-cost optimal control problem: Q∗
= Q∗
(c)
Watkins’ algorithm: variance is inﬁnite for β > 1
2 [Devraj and M., 2017]
3 / 18

Zap Q-learning
= Q∗
(c)
Zap Q-learning has optimal variance:
3 / 18

Zap Q-learning
= Q∗
(c)
ODE Analysis: change of variables q = Q∗(ς)
Functional Q∗ maps cost functions to Q-functions:
q(x, u) = ς(x, u) + β
x
Pu(x, x ) min
u
q(x , u )
3 / 18

Zap Q-learning
= Q∗
(c)
ODE Analysis: change of variables q = Q∗(ς)
Functional Q∗ maps cost functions to Q-functions:
q(x, u) = ς(x, u) + β
x
Pu(x, x ) min
u
q(x , u )
ODE for Zap-Q (for Watkins’ setting)
qt = Q∗
(ςt),
d
dt
ςt = −ςt + c
⇒ convergence, optimal covariance, ...
3 / 18

Zap Q-Learning
0 1 2 3 4 5 6 7 8 9 10
Iterations 105
0
20
40
60
80
100BellmanError
Reinforcement Learning
with Momentum

Reinforcement Learning with Momentum Matrix Momentum
Momentum based Stochastic Approximation
∆θ(n) := θ(n) − θ(n − 1)
Matrix Gain Stochastic Approximation:
∆θ(n + 1) = αnGn+1fn+1(θ(n))
4 / 18

∆θ(n) := θ(n) − θ(n − 1)
∆θ(n + 1) = αnGn+1fn+1(θ(n))
Heavy Ball Stochastic Approximation:
∆θ(n + 1) = m∆θ(n) + αnfn+1(θ(n))
4 / 18

∆θ(n) := θ(n) − θ(n − 1)
∆θ(n + 1) = αnGn+1fn+1(θ(n))
Heavy Ball Stochastic Approximation:
∆θ(n + 1) = m∆θ(n) + αnfn+1(θ(n))
Matirx Heavy Ball Stochastic Approximation:
∆θ(n + 1) = Mn+1∆θ(n) + αnGn+1fn+1(θ(n))
4 / 18

Matrix Heavy Ball Stochastic Approximation
Optimizing {Mn+1} and {Gn+1}
∆θ(n + 1) = Mn+1∆θ(n) + αnGn+1fn+1(θ(n))
Heuristic: Assume ∆θn → 0 much faster than θn → θ∗
5 / 18

∆θ(n + 1) = Mn+1∆θ(n) + αnGn+1fn+1(θ(n))
∆θ(n + 1) ≈ Mn+1∆θ(n + 1) + αnGn+1fn+1(θ(n))
5 / 18

∆θ(n + 1) = Mn+1∆θ(n) + αnGn+1fn+1(θ(n))
∆θ(n + 1) ≈ Mn+1∆θ(n + 1) + αnGn+1fn+1(θ(n))
≈ [I − Mn+1]−1
αnGn+1fn+1(θ(n))
5 / 18

∆θ(n + 1) = Mn+1∆θ(n) + αnGn+1fn+1(θ(n))
∆θ(n + 1) ≈ Mn+1∆θ(n + 1) + αnGn+1fn+1(θ(n))
≈ [I − Mn+1]−1
αnGn+1fn+1(θ(n))
= −αnA−1
n+1fn+1(θ(n)) Zap!
Mn+1 = (I + ζAn+1) and Gn+1 = ζI with ζ > 0
5 / 18

Reinforcement Learning with Momentum Coupling
An+1 ≈ d
dθ
¯f (θn)
SNR: ∆θ(n + 1) = −αnA−1
n+1fn+1(θ(n))
6 / 18

An+1 ≈ d
dθ
¯f (θn)
SNR: ∆θ(n + 1) = −αnA−1
n+1fn+1(θ(n))
PolSA: ∆θ(n + 1) = [I + ζAn+1]∆θ(n) + αnζfn+1(θ(n))
(linearized) NeSA: ∆θ(n + 1) = [I + ζAn+1]∆θ(n) + αnζfn+1(θ(n))
6 / 18

An+1 ≈ d
dθ
¯f (θn)
SNR: ∆θ(n + 1) = −αnA−1
n+1fn+1(θ(n))
Coupling of SNR and PolSA: θSNR
n − θPolSA
n = O(n−1)
6 / 18

An+1 ≈ d
dθ
¯f (θn)
SNR: ∆θ(n + 1) = −αnA−1
n+1fn+1(θ(n))
n − θPolSA
n = O(n−1)
Linear model: fn+1(θ) = A(θ − θ∗) + ∆n+1
{∆n+1} square-integrable martingale diﬀerence sequence
A Hurwitz and eig(I + ζA) ∈ open unit disk
6 / 18

An+1 ≈ d
dθ
¯f (θn)
SNR: ∆θ(n + 1) = −αnA−1
n+1fn+1(θ(n))
n − θPolSA
n = O(n−1)
Linear model: fn+1(θ) = A(θ − θ∗) + ∆n+1
{∆n+1} square-integrable martingale diﬀerence sequence
A Hurwitz and eig(I + ζA) ∈ open unit disk
PolSA has optimal asymptotic variance
6 / 18

0 1 2 3 4 5 6 7 8 9 10 105
0
20
40
60
80
100 Watkins, Speedy Q-learning,
Polyak-Ruppert Averaging
Zap
BellmanError
n
Examples

Examples Much like Newton-Raphson
Zap Q-Learning: Fastest Convergent Q-Learning
Classical Q-Learning
Speedy Q-Learning
R-P Averaging
Polynomial Learning Rate
Q-Learning: Optimal Scalar Gain
Zap O(d)
Zap Q-Learning
0 1 2 3 4 5 6 7 8 9 10
n : number of iterations 105
0
20
40
60
80
100
BellmanError
7 / 18

Classical Q-Learning
Speedy Q-Learning
R-P Averaging
Polynomial Learning Rate
Q-Learning: Optimal Scalar Gain
Zap O(d)
Zap Q-Learning PolSA
NeSA
0 1 2 3 4 5 6 7 8 9 10
n : number of iterations 105
0
20
40
60
80
100
BellmanError
7 / 18

Convergence for larger models:
103
104
105
106
107
100
10
1
102
103
104
10
5
103
104
105
106
107
MaxBelmanError
nn
10
-1
100
10
1
10
2
10
3
10
4
MaxBelmanError
Online Clock Sampling
Watkins
Watkins:
Zap
PolSA
NeSA
1/(1−β)
d=19d=117
Coupling is amazing
8 / 18

Convergence for larger models:
103
104
105
106
107
100
10
1
102
103
104
10
5
103
104
105
106
107
MaxBelmanError
nn
10
-1
100
10
1
10
2
10
3
10
4
MaxBelmanError
Online Clock Sampling
Watkins
Watkins:
Zap
PolSA
NeSA
1/(1−β)
d=19d=117
Avoid randomized policies if possible
8 / 18

Examples Optimal stopping
Zap Q-Learning
Model of Tsitsiklis and Van Roy: Optimal Stopping Time in Finance
State space: R100
Parameterized Q-function: Qθ with θ ∈ R10
i
0 1 2 3 4 5 6 7 8 9 10
-10
0
-10
-1
-10
-2
-10-3
-10-4
-10-5
-10
-6
Real for every eigenvalue λ
Asymptotic covariance is infinite
λ > −
1
2
Real λi(A)
9 / 18

Zap Q-Learning
State space: R100
i
0 1 2 3 4 5 6 7 8 9 10
-10
0
-10
-1
-10
-2
-10
-3
-10-4
-10-5
-10
-6
Real for every eigenvalue λ
Authors observed slow convergence
Proposed a matrix gain sequence
(see refs for details)
Asymptotic covariance is infinite
λ > −
1
2
Real λi(A)
{Gn}
9 / 18

Zap Q-Learning
State space: R100
i
0 1 2 3 4 5 6 7 8 9 10
-10
0
-10
-1
-10
-2
-10
-3
-10-4
-10-5
-10
-6
-0.525-30 -25 -20 -15 -10 -5
-10
-5
0
5
10
Re (λ(GA))
Co(λ(GA))
λi(GA)Real λi(A)
Eigenvalues of A and GA for the ﬁnance example
Favorite choice of gain in [22] barely meets the criterion Re(λ(GA)) < −1
2
9 / 18

Zap Q-Learning
State space: R100.
Histograms of the average reward obtained using the diﬀerent algorithms:
1 1.05 1.1 1.15 1.2 1.25
0
20
40
60
80
100
1 1.05 1.1 1.15 1.2 1.25
0
100
200
300
400
500
600
1 1.05 1.1 1.15 1.2 1.25
0
5
10
15
20
25
30
35 G-Q(0)
G-Q(0)
Zap-Q
Zap-Q ρ = 0.8
ρ = 1.0
(gain doubled)
Zap-Q ρ = 0.85
n = 2 × 104
n = 2 × 105
n = 2 × 106
Zap-Q G-Q
10 / 18

Conclusions & Future Work
Conclusions
Reinforcement Learning is not just cursed by dimension,
but also by variance
We need better design tools to improve performance
11 / 18

Conclusions
The asymptotic covariance is an awesome design tool.
It is also predictive of ﬁnite-n performance.
Example: optimal g∗
was chosen based on asymptotic covariance
11 / 18

Conclusions
PolSA and NeSA based RL provide eﬃcient alternatives to the
expensive stochastic Newton-Raphson
PolSA : same asymptotic variance as Zap
without matrix inversion
11 / 18

Conclusions
PolSA and NeSA based RL provide eﬃcient alternatives to the
expensive stochastic Newton-Raphson
PolSA : same asymptotic variance as Zap
without matrix inversion
Future work:
Q-learning with function-approximation
Obtain conditions for a stable algorithm in a general setting
Adaptive optimization of algorithm parameters: g∗, ζ∗, G∗, M∗
11 / 18

Thank you!
thankful
12 / 18

References
Control Techniques
FOR
Complex Networks
Sean Meyn
Pre-publication version for on-line viewing. Monograph available for purchase at your favorite retailer
More information available at http://www.cambridge.org/us/catalogue/catalogue.asp?isbn=9780521884419
Markov Chains
and
Stochastic Stability
S. P. Meyn and R. L. Tweedie
August 2008 Pre-publication version for on-line viewing. Monograph to appear Februrary 2009
π(f)<∞
∆V (x) ≤ −f(x) + bIC(x)
Pn
(x, · ) − π f → 0
sup
C
Ex[SτC(f)]<∞
References
13 / 18

References
This lecture
A. M. Devraj and S. P. Meyn. Fastest convergence for Q-learning. ArXiv , July 2017.
A. M. Devraj and S. Meyn. Zap Q-Learning. In Advances in Neural Information Processing
Systems, pages 2235–2244, 2017.
A. M. Devraj, A. Buˇsić, and S. Meyn. Zap Q Learning — a user’s guide. In Proceedings of
the fifth Indian Control Conference, 9-11 January 2019.
A. M. Devraj, A. Buˇsić, and S. Meyn. (stay tuned)
14 / 18

References
Selected References I
[1] A. Benveniste, M. M´etivier, and P. Priouret. Adaptive algorithms and stochastic
approximations, volume 22 of Applications of Mathematics (New York). Springer-Verlag,
Berlin, 1990. Translated from the French by Stephen S. Wilson.
[2] V. S. Borkar. Stochastic Approximation: A Dynamical Systems Viewpoint. Hindustan
Book Agency and Cambridge University Press (jointly), 2008.
[3] V. S. Borkar and S. P. Meyn. The ODE method for convergence of stochastic
approximation and reinforcement learning. SIAM J. Control Optim., 38(2):447–469, 2000.
[4] S. P. Meyn and R. L. Tweedie. Markov chains and stochastic stability. Cambridge
Mathematical Library, second edition, 2009.
[5] S. P. Meyn. Control Techniques for Complex Networks. Cambridge University Press, 2007.
See last chapter on simulation and average-cost TD learning
[6] D. Ruppert. A Newton-Raphson version of the multivariate Robbins-Monro procedure.
The Annals of Statistics, 13(1):236–245, 1985.
[7] D. Ruppert. Eﬃcient estimators from a slowly convergent Robbins-Monro processes.
Technical Report Tech. Rept. No. 781, Cornell University, School of Operations Research
and Industrial Engineering, Ithaca, NY, 1988.
15 / 18

References
Selected References II
[8] B. T. Polyak. A new method of stochastic approximation type. Avtomatika i
telemekhanika. Translated in Automat. Remote Control, 51 (1991), pages 98–107, 1990.
[9] B. T. Polyak and A. B. Juditsky. Acceleration of stochastic approximation by averaging.
SIAM J. Control Optim., 30(4):838–855, 1992.
[10] V. R. Konda and J. N. Tsitsiklis. Convergence rate of linear two-time-scale stochastic
approximation. Ann. Appl. Probab., 14(2):796–819, 2004.
[11] E. Moulines and F. R. Bach. Non-asymptotic analysis of stochastic approximation
algorithms for machine learning. In Advances in Neural Information Processing Systems
24, pages 451–459. Curran Associates, Inc., 2011.
[12] C. Szepesv´ari. Algorithms for Reinforcement Learning. Synthesis Lectures on Artiﬁcial
Intelligence and Machine Learning. Morgan & Claypool Publishers, 2010.
[13] C. J. C. H. Watkins. Learning from Delayed Rewards. PhD thesis, King’s College,
Cambridge, Cambridge, UK, 1989.
[14] C. J. C. H. Watkins and P. Dayan. Q-learning. Machine Learning, 8(3-4):279–292, 1992.
16 / 18

References
Selected References III
[15] R. S. Sutton.Learning to predict by the methods of temporal differences. Mach. Learn.,
3(1):9–44, 1988.
[16] J. N. Tsitsiklis and B. Van Roy. An analysis of temporal-difference learning with function
approximation. IEEE Trans. Automat. Control, 42(5):674–690, 1997.
[17] C. Szepesvári. The asymptotic convergence-rate of Q-learning. In Proceedings of the 10th
Internat. Conf. on Neural Info. Proc. Systems, pages 1064–1070. MIT Press, 1997.
[18] M. G. Azar, R. Munos, M. Ghavamzadeh, and H. Kappen. Speedy Q-learning. In
Advances in Neural Information Processing Systems, 2011.
[19] E. Even-Dar and Y. Mansour. Learning rates for Q-learning. Journal of Machine Learning
Research, 5(Dec):1–25, 2003.
[20] D. Huang, W. Chen, P. Mehta, S. Meyn, and A. Surana. Feature selection for
neuro-dynamic programming. In F. Lewis, editor, Reinforcement Learning and
Approximate Dynamic Programming for Feedback Control. Wiley, 2011.
[21] J. N. Tsitsiklis and B. Van Roy. Optimal stopping of Markov processes: Hilbert space
theory, approximation algorithms, and an application to pricing high-dimensional financial
derivatives. IEEE Trans. Automat. Control, 44(10):1840–1851, 1999.
17 / 18

References
Selected References IV
[22] D. Choi and B. Van Roy. A generalized Kalman filter for fixed point approximation and
efficient temporal-difference learning. Discrete Event Dynamic Systems: Theory and
Applications, 16(2):207–239, 2006.
[23] S. J. Bradtke and A. G. Barto. Linear least-squares algorithms for temporal difference
learning. Mach. Learn., 22(1-3):33–57, 1996.
[24] J. A. Boyan. Technical update: Least-squares temporal difference learning. Mach. Learn.,
49(2-3):233–246, 2002.
[25] A. Nedic and D. Bertsekas. Least squares policy evaluation algorithms with linear function
approximation. Discrete Event Dyn. Systems: Theory and Appl., 13(1-2):79–110, 2003.
[26] P. G. Mehta and S. P. Meyn. Q-learning and Pontryagin’s minimum principle. In IEEE
Conference on Decision and Control, pages 3598–3605, Dec. 2009.
18 / 18

Zap Q-Learning - ISMP 2018

More Related Content

What's hot (20)

Similar to Zap Q-Learning - ISMP 2018 (20)

More from Sean Meyn (20)

Recently uploaded (20)

Zap Q-Learning - ISMP 2018