SlideShare a Scribd company logo
Part 2: Variance Matters
Sean Meyn
Department of Electrical and Computer Engineering University of Florida
Inria International Chair Inria, Paris
Thanks to to our sponsors: NSF and ARO
Part 2: Variance Matters
Outline
1 Stochastic Approximation
2 Understanding Variance
3 Unveiling Dynamics
4 Introduction to Zap
5 Conclusions
6 References
Part 2: Variance Matters Resources
ODE Method (using different meaning than in the 1970s)
CS&RL, Chapter 8
The ODE Method for Asymptotic Statistics in Stochastic
Approximation and Reinforcement Learning [73]
Control Techniques
FOR
Complex Networks
Sean Meyn
Pre-publication version for on-line viewing. Monograph available for purchase at your favorite retailer
More information available at http://www.cambridge.org/us/catalogue/catalogue.asp?isbn=9780521884419
Markov Chains
and
Stochastic Stability
S. P. Meyn and R. L. Tweedie
August 2008 Pre-publication version for on-line viewing. Monograph to appear Februrary 2009
π(f
)
<
∞
∆V (x) ≤ −f(x) + bIC(x)
Pn
(x, · ) − πf → 0
sup
C
E
x
[S
τ
C
(f
)]

∞
1 / 37
Special Thanks
My interests in RLSA began during my first
sabbatical—at IISc with Vivek Borkar
Many other heroes along the way since
Metivier
Van Roy Tsitsiklis
Bertsekas
Konda
Priouret
Kushner
Surana
Huang
Ruppert Polyak
Szepesvari
Benveniste
Yin
Nedic Yu
Colombino
Dall’Anese
Bernstein
Chen  Chen
Barooah
Raman
Max
P.E. Caines
Ioannis Kontoyiannis Ana Busic Eric Moulines Adithya Devraj Many Others!
-100 0 100 -100 0 100 -100 0 100
Zap 2 Zap 5 Zap 10
θn+1 = θn + an+1f(θn, ξn+1)
θn
Excitation
Estimates
Stochastic Approximation
Stochastic Approximation ODE Method
What is Stochastic Approximation? ¯
f(θ) = E[f(θ, ξ)]
No different than last lecture: find solution to ¯
f(θ∗
) = 0
Example: ¯
f(θ) = −∇Γ (θ) for optimization.
In RL the function ¯
f is typically not of this form.
3 / 37
Stochastic Approximation ODE Method
What is Stochastic Approximation? ¯
f(θ) = E[f(θ, ξ)]
No different than last lecture: find solution to ¯
f(θ∗
) = 0
ODE algorithm:
d
dt
ϑt = ¯
f(ϑt)
If stable: ϑt → θ∗
and ¯
f(ϑt) → ¯
f(θ∗
) = 0.
3 / 37
Stochastic Approximation ODE Method
What is Stochastic Approximation? ¯
f(θ) = E[f(θ, ξ)]
No different than last lecture: find solution to ¯
f(θ∗
) = 0
ODE algorithm:
d
dt
ϑt = ¯
f(ϑt)
If stable: ϑt → θ∗
and ¯
f(ϑt) → ¯
f(θ∗
) = 0.
Euler approximation: θn+1 = θn + αn+1
¯
f(θn)
Terminology: αn usually called the step-size
3 / 37
Stochastic Approximation ODE Method
What is Stochastic Approximation? ¯
f(θ) = E[f(θ, ξ)]
No different than last lecture: find solution to ¯
f(θ∗
) = 0
ODE algorithm:
d
dt
ϑt = ¯
f(ϑt)
If stable: ϑt → θ∗
and ¯
f(ϑt) → ¯
f(θ∗
) = 0.
Euler approximation: θn+1 = θn + αn+1
¯
f(θn)
Stochastic Approximation
θn+1 = θn + αn+1f(θn, ξn+1) with ξn → ξ in dist.
3 / 37
Stochastic Approximation ODE Method
What is Stochastic Approximation? ¯
f(θ) = E[f(θ, ξ)]
No different than last lecture: find solution to ¯
f(θ∗
) = 0
ODE algorithm:
d
dt
ϑt = ¯
f(ϑt)
If stable: ϑt → θ∗
and ¯
f(ϑt) → ¯
f(θ∗
) = 0.
Euler approximation: θn+1 = θn + αn+1
¯
f(θn)
Stochastic Approximation
θn+1 = θn + αn+1f(θn, ξn+1)
= θn + αn+1

¯
f(θn) + “NOISE”
	
Under very general conditions:
the ODE, the Euler approximation, and SA are all convergent to θ∗
3 / 37
Stochastic Approximation ODE Method
What is Stochastic Approximation? ¯
f(θ) = E[f(θ, ξ)]
No different than last lecture: find solution to ¯
f(θ∗
) = 0
ODE algorithm:
d
dt
ϑt = ¯
f(ϑt)
If stable: ϑt → θ∗
and ¯
f(ϑt) → ¯
f(θ∗
) = 0.
Euler approximation: θn+1 = θn + αn+1
¯
f(θn)
Stochastic Approximation
θn+1 = θn + αn+1f(θn, ξn+1)
= θn + αn+1

¯
f(θn) + “NOISE”
	
Under very general conditions:
the ODE, the Euler approximation, and SA are all convergent to θ∗
[Robbins and Monro, 1951] see Borkar’s monograph [70]
3 / 37
Stochastic Approximation ODE Method
Questions for Algorithm Design ¯
f(θ) = E[f(θ, ξ)]
Stochastic Approximation
θn+1 = θn + αn+1f(θn, ξn+1)
= θn + αn+1

¯
f(θn) + e
Ξn
	
(NOISE)
Questions
Can we duplicate our success with QSA?
Is there a Perturbative Mean Flow?
Implications to transient and asymptotic behavior?
What are the implications to design?
4 / 37
Stochastic Approximation ODE Method
Questions for Algorithm Design ¯
f(θ) = E[f(θ, ξ)]
Stochastic Approximation
θn+1 = θn + αn+1f(θn, ξn+1)
= θn + αn+1

¯
f(θn) + e
Ξn
	
Questions
Can we duplicate our success with QSA?
Is there a Perturbative Mean Flow?
Implications to transient and asymptotic behavior?
What are the implications to design?
Go beyond the last lecture: What if the mean flow
d
dt
ϑ = ¯
f(ϑ) is not stable, or “ill conditioned”?
4 / 37
Control Techniques for Complex Networks page 528
Example 11.5.1. LSTD for the M/M/1 queue (minimum variance algorithm)
Nonsense after One Million Samples
0
10
20
30
0
5
10
15
ρ = 0.8 ρ = 0.9 hθ
(x) = θ1
θ1
x + θ2x2
θ2
θ∗
1 = θ∗
2
0 1 2 3 4 5 x 10
6 0 1 2 3 4 5 x 10
6
Understanding Variance
Understanding Variance Disturbance decompositions
Perturbative Mean Flow θn+1 = θn + αn+1f(θn, Wn+1)
Prerequisites as before
Mean flow d
dt ϑ = ¯
f(ϑ) globally asymptotically stable
So ϑt → θ∗ from each initial condition
A tiny bit more is needed to ensure {θn} is bounded – see CSRL for details!
A∗ = ∂θ
¯
f (θ∗) Hurwitz eigenvalues in left half plane of C
5 / 37
Understanding Variance Disturbance decompositions
Perturbative Mean Flow θn+1 = θn + αn+1f(θn, Wn+1)
Prerequisites as before
Mean flow d
dt ϑ = ¯
f(ϑ) globally asymptotically stable
So ϑt → θ∗ from each initial condition
A tiny bit more is needed to ensure {θn} is bounded – see CSRL for details!
A∗ = ∂θ
¯
f (θ∗) Hurwitz eigenvalues in left half plane of C
And a replacement for our almost periodic exploration:
ξn+1 = G0(Φn+1), with Φ a “nice Markov chain”
Assume here: finite state space and irreducible
Please don’t rule out periodicity! Remember our luck with QSA!
5 / 37
Understanding Variance Disturbance decompositions
P- Mean Flow θn+1 = θn + αn+1f(θn, Wn+1) = θn + αn+1

¯
f(θn) + e
Ξn
	
Perturbative Mean Flow (details for fixed step-size)
Two distinctions: 1. We remain in discrete time, and 2. We obtain an
expression for the sum of the NOISE:
N−1
X
n=0
e
Ξn =
Foundation of all is Poisson’s equation:

b
fn(θ) − E[ b
fn+1(θ) | Fn]
DeepLearn2022  2. Variance Matters
θn−1
= e
Ξn−1
6 / 37
Understanding Variance Disturbance decompositions
P- Mean Flow θn+1 = θn + αn+1f(θn, Wn+1) = θn + αn+1

¯
f(θn) + e
Ξn
	
Perturbative Mean Flow (details for fixed step-size)
Two distinctions: 1. We remain in discrete time, and 2. We obtain an
expression for the sum of the NOISE:
N−1
X
n=0
e
Ξn =
N+1
X
n=2
Wn
{Wn} is white—beautiful theory waiting for us
Foundation of all is Poisson’s equation:

b
fn(θ) − E[ b
fn+1(θ) | Fn]
DeepLearn2022  2. Variance Matters
θn−1
= e
Ξn−1
6 / 37
Understanding Variance Disturbance decompositions
P- Mean Flow θn+1 = θn + αn+1f(θn, Wn+1) = θn + αn+1

¯
f(θn) + e
Ξn
	
Perturbative Mean Flow (details for fixed step-size)
Two distinctions: 1. We remain in discrete time, and 2. We obtain an
expression for the sum of the NOISE:
N−1
X
n=0
e
Ξn =
N+1
X
n=2
Wn + ∆ b
fN
{Wn} is white—beautiful theory waiting for us
Uniformly bounded—adds O(1/N) in convergence rate.
Foundation of all is Poisson’s equation:

b
fn(θ) − E[ b
fn+1(θ) | Fn]
DeepLearn2022  2. Variance Matters
θn−1
= e
Ξn−1
6 / 37
Understanding Variance Disturbance decompositions
P- Mean Flow θn+1 = θn + αn+1f(θn, Wn+1) = θn + αn+1

¯
f(θn) + e
Ξn
	
Perturbative Mean Flow (details for fixed step-size)
Two distinctions: 1. We remain in discrete time, and 2. We obtain an
expression for the sum of the NOISE:
N−1
X
n=0
e
Ξn =
N+1
X
n=2
Wn + ∆ b
fN
− α
N
X
n=1
Υn
{Wn} is white—beautiful theory waiting for us
Uniformly bounded—adds O(1/N) in convergence rate.
Bad news for bias: Υn = − 1
α

b
fn+1(θn) − b
fn+1(θn−1)

Foundation of all is Poisson’s equation:

b
fn(θ) − E[ b
fn+1(θ) | Fn]
DeepLearn2022  2. Variance Matters
θn−1
= e
Ξn−1
6 / 37
Understanding Variance P- Mean Flow =
⇒ Polyak-Ruppert Averaging
Design Implications θn+1 − θn = α

¯
f(θn) + e
Ξn
	
Constant Step-size
N−1
X
n=0
e
Ξn =
N+1
X
n=2
Wn + ∆ b
fN − α
N
X
n=1
Υn
E[kθnk2] uniformly bounded for 0  α ≤ α0 some α0
 0
7 / 37
Understanding Variance P- Mean Flow =
⇒ Polyak-Ruppert Averaging
Design Implications θn+1 − θn = α

¯
f(θn) + e
Ξn
	
Constant Step-size
N−1
X
n=0
e
Ξn =
N+1
X
n=2
Wn + ∆ b
fN − α
N
X
n=1
Υn
E[kθnk2] uniformly bounded for 0  α ≤ α0 some α0
 0
Justifies θn = Xn + O(α2) (approximation in mean-square)
Xn+1 − Xn = α[A∗
[Xn − θ∗
] + e
Ξn]
7 / 37
Understanding Variance P- Mean Flow =
⇒ Polyak-Ruppert Averaging
Design Implications θn+1 − θn = α

¯
f(θn) + e
Ξn
	
Constant Step-size
N−1
X
n=0
e
Ξn =
N+1
X
n=2
Wn + ∆ b
fN − α
N
X
n=1
Υn
E[kθnk2] uniformly bounded for 0  α ≤ α0 some α0
 0
Justifies θn = Xn + O(α2) (approximation in mean-square)
N−1
X
n=0
Xn+1 − Xn

=
N−1
X
n=0
α[A∗
[Xn − θ∗
] + e
Ξn]

7 / 37
Understanding Variance P- Mean Flow =
⇒ Polyak-Ruppert Averaging
Design Implications θn+1 − θn = α

¯
f(θn) + e
Ξn
	
Constant Step-size
N−1
X
n=0
e
Ξn =
N+1
X
n=2
Wn + ∆ b
fN − α
N
X
n=1
Υn
E[kθnk2] uniformly bounded for 0  α ≤ α0 some α0
 0
Justifies θn = Xn + O(α2) (approximation in mean-square)
N−1
X
n=0
Xn+1 − Xn

=
N−1
X
n=0
α[A∗
[Xn − θ∗
] + e
Ξn]

See a Polyak-Ruppert filter pop out?
7 / 37
Understanding Variance P- Mean Flow =
⇒ Polyak-Ruppert Averaging
Design Implications θn+1 − θn = α

¯
f(θn) + e
Ξn
	
Constant Step-size
N−1
X
n=0
e
Ξn =
N+1
X
n=2
Wn + ∆ b
fN − α
N
X
n=1
Υn
1
N
N−1
X
n=0
θn+1 − θn

= α
1
N
N−1
X
n=0
A∗
[θn − θ∗
] + e
Ξn + O(α2
)

8 / 37
Understanding Variance P- Mean Flow =
⇒ Polyak-Ruppert Averaging
Design Implications θn+1 − θn = α

¯
f(θn) + e
Ξn
	
Constant Step-size
N−1
X
n=0
e
Ξn =
N+1
X
n=2
Wn + ∆ b
fN − α
N
X
n=1
Υn
1
N
N−1
X
n=0
θn+1 − θn

= α
1
N
N−1
X
n=0
A∗
[θn − θ∗
] + e
Ξn + O(α2
)

Polyak-Ruppert Representation θPR
N =
1
N
N−1
X
n=0
θn
1
α
1
N
∆θN = A∗
[θPR
N − θ∗
] +
1
N
N−1
X
n=0
e
Ξn + O(α2
)
8 / 37
Understanding Variance P- Mean Flow =
⇒ Polyak-Ruppert Averaging
Design Implications θn+1 − θn = α

¯
f(θn) + e
Ξn
	
Constant Step-size
Polyak-Ruppert Representation θPR
N =
1
N
N−1
X
n=0
θn
1
α
1
N
∆θN = A∗
[θPR
N − θ∗
] +
1
N
N−1
X
n=0
e
Ξn + O(α2
)
P-R representation = the statistical optimal + statistical annoying:
θPR
N = θ∗
+ − + O((αN)−1
+ α2
)
= [A∗
]−1 1
N
N
X
n=1
Wn = [A∗
]−1 α
N
N
X
n=1
Υn
8 / 37
Understanding Variance P- Mean Flow =
⇒ Polyak-Ruppert Averaging
Design Implications θn+1 − θn = α

¯
f(θn) + e
Ξn
	
Constant Step-size
Polyak-Ruppert Representation θPR
N =
1
N
N−1
X
n=0
θn
1
α
1
N
∆θN = A∗
[θPR
N − θ∗
] +
1
N
N−1
X
n=0
e
Ξn + O(α2
)
P-R representation = the statistical optimal + statistical annoying:
θPR
N = θ∗
+ − + O((αN)−1
+ α2
)
= [A∗
]−1 1
N
N
X
n=1
Wn = [A∗
]−1 α
N
N
X
n=1
Υn
Annoyance Υn: introduces bias of order O(α) (and variance not understood)
Υn = − 1
α

b
fn+1(θn) − b
fn+1(θn−1)

≈ −∂θ
b
fn+1(θn) · f(θn−1, Wn)
8 / 37
Understanding Variance P- Mean Flow =
⇒ Polyak-Ruppert Averaging
Design Implications θn+1 − θn = αn+1

¯
f(θn) + e
Ξn
	
Vanishing Step-size
Slightly different starting point:
N−1
X
n=0
1
αn+1
[θn+1 − θn] =
N−1
X
n=0

¯
f(θn) + e
Ξn
	
=
N−1
X
n=0

A∗
[θn − θ∗
] + e
Ξn
	
+ O
 N−1
X
n=0
α2
n+1
| {z }
Assumed bounded

9 / 37
Understanding Variance P- Mean Flow =
⇒ Polyak-Ruppert Averaging
Design Implications θn+1 − θn = αn+1

¯
f(θn) + e
Ξn
	
Vanishing Step-size
Statistical memory =⇒ try αn = g/(1 + n/ne)ρ 1
2  ρ  1
Ignoring transients (and ignoring summation by parts calculation),
θPR
N =
1
N − N0
N−1
X
n=N0
θn
= θ∗
+ + O(1/N)
=
1
N − N0
[A∗
]−1
N
X
n=N0+1
Wn
Optimality of PR-Averaging
Cov(θPR
N − θ∗
) =
1
N − N0
ΣPR
N
ΣPR
N → ΣPR
= [A∗]−1ΣW[A∗T
]−1 minimal in strongest sense
9 / 37
Understanding Variance P- Mean Flow =
⇒ Polyak-Ruppert Averaging
Asymptotic Statistics
Optimality of PR-Averaging
Cov(θPR
N − θ∗
) =
1
N − N0
ΣPR
N
ΣPR
N → ΣPR
= [A∗]−1ΣW[A∗T
]−1 minimal in strongest sense
The Central Limit Theorem holds tremendous tool for validation, ...
√
Nθ̃N
Histogram from 103
independent runs
10 / 37
αn = g/(1 + n/ne)ρ
0 5 10 5 10 5 10
0
20
40
60
80
100
0
0
20
40
60
80
100
0
0
20
40
60
80
100
ρ = 1.0 ρ = 0.9 ρ = 0.8
= 2.8 = 5.3 = 11
θk
ϑτk
τk τk τk
τN τN τN
Unveiling Dynamics
Unveiling Dynamics Two Sources of Error
SA Error θn+1 = θn + αn+1

¯
f(θn) + “NOISE”
	 d
dt ϑt = ¯
f(ϑt)
1 Asymptotic Covariance: αn = g
(1+n/ne)ρ
1
αn
[θn − ϑτn ] ≈ N(0, Σ) without averaging
11 / 37
Unveiling Dynamics Two Sources of Error
SA Error θn+1 = θn + αn+1

¯
f(θn) + “NOISE”
	 d
dt ϑt = ¯
f(ϑt)
1 1
αn
[θn − ϑτn ] ≈ N(0, Σ) without averaging where τn =
n
X
k=1
αk
11 / 37
Unveiling Dynamics Two Sources of Error
SA Error θn+1 = θn + αn+1

¯
f(θn) + “NOISE”
	 d
dt ϑt = ¯
f(ϑt)
1 1
αn
[θn − ϑτn ] ≈ N(0, Σ) without averaging where τn =
n
X
k=1
αk
2 ϑt → θ∗ exponentially fast, but
11 / 37
Unveiling Dynamics Two Sources of Error
SA Error θn+1 = θn + αn+1

¯
f(θn) + “NOISE”
	 d
dt ϑt = ¯
f(ϑt)
1 1
αn
[θn − ϑτn ] ≈ N(0, Σ) without averaging where τn =
n
X
k=1
αk
2 ϑt → θ∗ exponentially fast, but τn is increasing slowly,
and
11 / 37
Unveiling Dynamics Two Sources of Error
SA Error θn+1 = θn + αn+1

¯
f(θn) + “NOISE”
	 d
dt ϑt = ¯
f(ϑt)
1 1
αn
[θn − ϑτn ] ≈ N(0, Σ) without averaging where τn =
n
X
k=1
αk
2 ϑt → θ∗ exponentially fast, but τn is increasing slowly,
and nonlinear dynamics can complicate gain selection
11 / 37
Unveiling Dynamics Two Sources of Error
SA Error θn+1 = θn + αn+1

¯
f(θn) + “NOISE”
	 d
dt ϑt = ¯
f(ϑt)
1 1
αn
[θn − ϑτn ] ≈ N(0, Σ) without averaging where τn =
n
X
k=1
αk
2 ϑt → θ∗ exponentially fast, but τn is increasing slowly,
and nonlinear dynamics can complicate gain selection
What can happen in applications, using αn+1 = g/(1 + n/ne)ρ:
θn far from θ∗, the dynamics are slow, need large g
θn ≈ θ∗, best gain is far smaller
11 / 37
Unveiling Dynamics Two Sources of Error
Two Sources of Error. Example: SGD αn = g/(1 + n/ne)ρ
Stochastic Gradient Descent:
L̄(θ) = E[L(θ, Φn)]
¯
f(θ) = −∇L̄(θ)
θ
-20 -15 -10 -5 0 5 10 15 20
-25
-20
-15
-10
-5
0
5
10
15
20
25
S
l
o
p
e
-
4
f(θ)
Slope
-1
L̄(θ)
θn+1 = θn −
g
nρ
∇L(θn, Φn+1)
12 / 37
Unveiling Dynamics Two Sources of Error
Two Sources of Error. Example: SGD αn = g/(1 + n/ne)ρ
ODE bound using ρ = 1 and setting ne = 1
|ϑτn − θ∗
| ≤ |ϑ0 − θ∗
|e0.34g
n−0.34g
¯
f(θ) = −∇L̄(θ)
θ
-20 -15 -10 -5 0 5 10 15 20
-25
-20
-15
-10
-5
0
5
10
15
20
25
S
l
o
p
e
-
4
f(θ) L̄(θ)
Slope
-1
Slope -0.34
θn+1 = θn −
g
nρ
∇L(θn, Φn+1)
d
dt ϑt = ¯
f(ϑt)
|ϑt − θ∗| ≤ |ϑ0 − θ∗|e−0.34t
12 / 37
Unveiling Dynamics Two Sources of Error
Two Sources of Error. Example: SGD αn = g/(1 + n/ne)ρ
ODE bound using ρ = 1 and setting ne = 1
|ϑτn − θ∗
| ≤ |ϑ0 − θ∗
|e0.34g
n−0.34g
g ≥ 3 to kill deterministic behavior,
but g∗ = −[A∗]−1 = 1/4 gives optimal variance
¯
f(θ) = −∇L̄(θ)
θ
-20 -15 -10 -5 0 5 10 15 20
-25
-20
-15
-10
-5
0
5
10
15
20
25
S
l
o
p
e
-
4
f(θ) L̄(θ)
Slope
-1
Slope -0.34
θn+1 = θn −
g
nρ
∇L(θn, Φn+1)
d
dt ϑt = ¯
f(ϑt)
|ϑt − θ∗| ≤ |ϑ0 − θ∗|e−0.34t
12 / 37
Unveiling Dynamics Two Sources of Error
Two Sources of Error. Example: SGD αn = g/(1 + n/ne)ρ
ODE bound using ρ = 1 and setting ne = 1
|ϑτn − θ∗
| ≤ |ϑ0 − θ∗
|e0.34g
n−0.34g
g ≥ 3 to kill deterministic behavior,
but g∗ = 1/4 gives optimal variance
Dynamics for g∗ = 1/4
0 5 10 5 10 5 10
0
20
40
60
80
100
0
0
20
40
60
80
100
0
0
20
40
60
80
100
ρ = 1.0 ρ = 0.9 ρ = 0.8
= 2.8 = 5.3 = 11
θk
ϑτk
τk τk τk
τN τN τN
τN  3 for N = one million
¯
f(θ) = −∇L̄(θ)
θ
-20 -15 -10 -5 0 5 10 15 20
-25
-20
-15
-10
-5
0
5
10
15
20
25
S
l
o
p
e
-
4
f(θ) L̄(θ)
Slope
-1
Slope -0.34
θn+1 = θn −
g
nρ
∇L(θn, Φn+1)
d
dt ϑt = ¯
f(ϑt)
|ϑt − θ∗| ≤ |ϑ0 − θ∗|e−0.34t
12 / 37
Unveiling Dynamics Two Sources of Error
Two Sources of Error. Example: SGD αn = g/(1 + n/ne)ρ
ODE bound using ρ = 1 and setting ne = 1
|ϑτn − θ∗
| ≤ |ϑ0 − θ∗
|e0.34g
n−0.34g
g ≥ 3 to kill deterministic behavior,
but g∗ = 1/4 gives optimal variance
Dynamics for g∗ = 1/4
0 5 10 5 10 5 10
0
20
40
60
80
100
0
0
20
40
60
80
100
0
0
20
40
60
80
100
ρ = 1.0 ρ = 0.9 ρ = 0.8
= 2.8 = 5.3 = 11
θk
ϑτk
τk τk τk
τN τN τN
τN  3 for N = one million
¯
f(θ) = −∇L̄(θ)
θ
-20 -15 -10 -5 0 5 10 15 20
-25
-20
-15
-10
-5
0
5
10
15
20
25
S
l
o
p
e
-
4
f(θ) L̄(θ)
Slope
-1
Slope -0.34
θn+1 = θn −
g
nρ
∇L(θn, Φn+1)
d
dt ϑt = ¯
f(ϑt)
|ϑt − θ∗| ≤ |ϑ0 − θ∗|e−0.34t
CLT approximation: rapid for θ0 = 0
12 / 37
Unveiling Dynamics Two Sources of Error
Two Sources of Error. Example: SGD αn = g/(1 + n/ne)ρ
ODE bound using ρ = 1 and setting ne = 1
|ϑτn − θ∗
| ≤ |ϑ0 − θ∗
|e0.34g
n−0.34g
g ≥ 3 to kill deterministic behavior,
but g∗ = 1/4 gives optimal variance
Dynamics for g∗ = 1/4
0 5 10 5 10 5 10
0
20
40
60
80
100
0
0
20
40
60
80
100
0
0
20
40
60
80
100
ρ = 1.0 ρ = 0.9 ρ = 0.8
= 2.8 = 5.3 = 11
θk
ϑτk
τk τk τk
τN τN τN
τN  3 for N = one million
¯
f(θ) = −∇L̄(θ)
θ
-20 -15 -10 -5 0 5 10 15 20
-25
-20
-15
-10
-5
0
5
10
15
20
25
S
l
o
p
e
-
4
f(θ) L̄(θ)
Slope
-1
Slope -0.34
θn+1 = θn −
g
nρ
∇L(θn, Φn+1)
d
dt ϑt = ¯
f(ϑt)
|ϑt − θ∗| ≤ |ϑ0 − θ∗|e−0.34t
CLT approximation: rapid for θ0 = 0 slow for θ0 = 100
12 / 37
Unveiling Dynamics Two Sources of Error
Two Sources of Error. Example: SGD αn = g/(1 + n/ne)ρ
ODE bound using ρ = 1 and setting ne = 1
|ϑτn − θ∗
| ≤ |ϑ0 − θ∗
|e0.34g
n−0.34g
Polyak-Ruppert to the rescue
0
20
40
60
√
Nθ̃N σθ =
5
2
ρ = 1.0 ρ = 0.9 ρ = 0.8
-20 0 20 -20 0 20 -20 0 20
0
20
40
60
√
Nθ̃N
g =
1
4
g =
1
0.34
Histograms from Polyak-Ruppert averaging: big and small g
¯
f(θ) = −∇L̄(θ)
θ
-20 -15 -10 -5 0 5 10 15 20
-25
-20
-15
-10
-5
0
5
10
15
20
25
S
l
o
p
e
-
4
f(θ) L̄(θ)
Slope
-1
Slope -0.34
θn+1 = θn −
g
nρ
∇L(θn, Φn+1)
d
dt ϑt = ¯
f(ϑt)
|ϑt − θ∗| ≤ |ϑ0 − θ∗|e−0.34t
12 / 37
Unveiling Dynamics Two Sources of Error
Two Sources of Error. Example: Tabular Q-Learning
g ≥ 1/(1 − γ) required, if using ρ = 1*
1 2 3 4 5 6 7 8 9 10
One million samples 10
5
0
20
40
60
80
100
120
MaxBEQ
n
n
Q-learning g = gAD
αn =
g
n(x, u)
Awesome performance! (?)
Generic tabular Q-learning example. Discount factor γ
13 / 37
Unveiling Dynamics Two Sources of Error
Two Sources of Error. Example: Tabular Q-Learning
g ≥ 1/(1 − γ) required, if using ρ = 1* Stay tuned!
1 2 3 4 5 6 7 8 9 10
One million samples 10
5
0
20
40
60
80
100
120
MaxBEQ
n
n
Q-learning g = 1/(1 − γ)
Q-learning g = gAD
αn =
g
n(x, u)
Generic tabular Q-learning example. Discount factor γ
*See Devraj  M 2017 (extension in Wainwright 2019); see also Szepesvári 1997
13 / 37
Zap
Introduction to Zap Newton-Raphson flow
Taming Nonlinear Dynamics
What if stability of d
dt ϑ = ¯
f(ϑ) is unknown? [typically the case in RL]
Newton Raphson Flow [Smale 1976]
Idea: Interpret ¯
f as the “parameter”: d
dt
¯
ft = V( ¯
ft) and design V
14 / 37
Introduction to Zap Newton-Raphson flow
Taming Nonlinear Dynamics
What if stability of d
dt ϑ = ¯
f(ϑ) is unknown? [typically the case in RL]
Newton Raphson Flow [Smale 1976]
Idea: Interpret ¯
f as the “parameter”: d
dt
¯
ft = − ¯
ft
d
dt
¯
f(ϑt) = − ¯
f(ϑt) giving ¯
f(ϑt) = ¯
f(ϑ0)e−t
Linear dynamics
14 / 37
Introduction to Zap Newton-Raphson flow
Taming Nonlinear Dynamics
What if stability of d
dt ϑ = ¯
f(ϑ) is unknown? [typically the case in RL]
Newton Raphson Flow [Smale 1976]
Idea: Interpret ¯
f as the “parameter”: d
dt
¯
ft = − ¯
ft
d
dt
¯
f(ϑt) = − ¯
f(ϑt)
d
dt
ϑt = −[A(ϑt)]−1 ¯
f(ϑt)
SA translation: Zap Stochastic Approximation
14 / 37
Introduction to Zap Newton-Raphson flow
Zap Algorithm Designed to emulate Newton-Raphson flow
d
dt
ϑt = −[A(ϑt)]−1 ¯
f(ϑt), A(θ) = ∂
∂θ
¯
f(θ)
Zap-SA (designed to emulate deterministic Newton-Raphson)
θn+1 = θn + αn+1[− b
An+1]−1
f(θn, ξn+1)
b
An+1 = b
An + βn+1(An+1 − b
An), An+1 = ∂θf(θn, ξn+1)
15 / 37
Introduction to Zap Newton-Raphson flow
Zap Algorithm Designed to emulate Newton-Raphson flow
d
dt
ϑt = −[A(ϑt)]−1 ¯
f(ϑt), A(θ) = ∂
∂θ
¯
f(θ)
Zap-SA (designed to emulate deterministic Newton-Raphson)
θn+1 = θn + αn+1[− b
An+1]−1
f(θn, ξn+1)
b
An+1 = b
An + βn+1(An+1 − b
An), An+1 = ∂θf(θn, ξn+1)
Requires b
An+1 ≈ A(θn)
def
= ∂θ
¯
f (θn)
15 / 37
Introduction to Zap Newton-Raphson flow
Zap Algorithm Designed to emulate Newton-Raphson flow
d
dt
ϑt = −[A(ϑt)]−1 ¯
f(ϑt), A(θ) = ∂
∂θ
¯
f(θ)
Zap-SA (designed to emulate deterministic Newton-Raphson)
θn+1 = θn + αn+1[− b
An+1]−1
f(θn, ξn+1)
b
An+1 = b
An + βn+1(An+1 − b
An), An+1 = ∂θf(θn, ξn+1)
b
An+1 ≈ A(θn) requires high-gain,
βn
αn
→ ∞, n → ∞
15 / 37
Introduction to Zap Newton-Raphson flow
Zap Algorithm Designed to emulate Newton-Raphson flow
d
dt
ϑt = −[A(ϑt)]−1 ¯
f(ϑt), A(θ) = ∂
∂θ
¯
f(θ)
Zap-SA (designed to emulate deterministic Newton-Raphson)
θn+1 = θn + αn+1[− b
An+1]−1
f(θn, ξn+1)
b
An+1 = b
An + βn+1(An+1 − b
An), An+1 = ∂θf(θn, ξn+1)
b
An+1 ≈ A(θn) requires high-gain,
βn
αn
→ ∞, n → ∞
Can use αn = 1/n (without averaging).
Numerics to come: use this choice, and βn = (1/n)ρ, ρ ∈ (0.5, 1)
15 / 37
Introduction to Zap Newton-Raphson flow
Zap Algorithm Designed to emulate Newton-Raphson flow
d
dt
ϑt = −[A(ϑt)]−1 ¯
f(ϑt), A(θ) = ∂
∂θ
¯
f(θ)
Zap-SA (designed to emulate deterministic Newton-Raphson)
θn+1 = θn + αn+1[− b
An+1]−1
f(θn, ξn+1)
b
An+1 = b
An + βn+1(An+1 − b
An), An+1 = ∂θf(θn, ξn+1)
b
An+1 ≈ A(θn) requires high-gain,
βn
αn
→ ∞, n → ∞
Can use αn = 1/n (without averaging).
Numerics to come: use this choice, and βn = (1/n)ρ, ρ ∈ (0.5, 1)
Stability? Virtually universal
Optimal variance, too!
Based on ancient theory from Ruppert  Polyak [80, 81, 79]
15 / 37
Conclusions
Thank you Lyapunov and Polyak!
0
20
40
60
80
100
ρ = 0.8
τk
5 10
0
Outcomes
from
M
repeated
runs
-20 0 20
Crank Up Gain: Tame Transients
Average: Tame Variance
θk
ϑτk
N − N0 θ̃PR
N
Steps to a Successful Design
1 Design ¯
f for mean flow d
dt ϑ = ¯
f(ϑ) GAS, Hurwitz A∗
16 / 37
Conclusions
Thank you Lyapunov and Polyak!
0
20
40
60
80
100
ρ = 0.8
τk
5 10
0
Outcomes
from
M
repeated
runs
-20 0 20
Crank Up Gain: Tame Transients
Average: Tame Variance
θk
ϑτk
N − N0 θ̃PR
N
Steps to a Successful Design
1 Design ¯
f for mean flow d
dt ϑ = ¯
f(ϑ) GAS, Hurwitz A∗
2 Design step-size: αn = g/(1 + n/ne)ρ 1
2  ρ  1
16 / 37
Conclusions
Thank you Lyapunov and Polyak!
0
20
40
60
80
100
ρ = 0.8
τk
5 10
0
Outcomes
from
M
repeated
runs
-20 0 20
Crank Up Gain: Tame Transients
Average: Tame Variance
θk
ϑτk
N − N0 θ̃PR
N
Steps to a Successful Design
1 Design ¯
f for mean flow d
dt ϑ = ¯
f(ϑ) GAS, Hurwitz A∗
2 Design step-size: αn = g/(1 + n/ne)ρ 1
2  ρ  1
3 Perform PR Averaging
θPR
N =
1
N − N0
N
X
n=N0+1
θn
16 / 37
Conclusions
Thank you Lyapunov and Polyak!
0
20
40
60
80
100
ρ = 0.8
τk
5 10
0
Outcomes
from
M
repeated
runs
-20 0 20
Crank Up Gain: Tame Transients
Average: Tame Variance
θk
ϑτk
N − N0 θ̃PR
N
Steps to a Successful Design
1 Design ¯
f for mean flow d
dt ϑ = ¯
f(ϑ) GAS, Hurwitz A∗
2 Design step-size: αn = g/(1 + n/ne)ρ 1
2  ρ  1
3 Perform PR Averaging
4 Repeat!
16 / 37
Conclusions
Thank you Lyapunov and Polyak!
0
20
40
60
80
100
ρ = 0.8
τk
5 10
0
Outcomes
from
M
repeated
runs
-20 0 20
Crank Up Gain: Tame Transients
Average: Tame Variance
θk
ϑτk
N − N0 θ̃PR
N
Steps to a Successful Design
1 Design ¯
f for mean flow d
dt ϑ = ¯
f(ϑ) GAS, Hurwitz A∗
2 Design step-size: αn = g/(1 + n/ne)ρ 1
2  ρ  1
3 Perform PR Averaging
4 Repeat! Obtain histogram
√
N − N0 θ̃PR (m)
N : 1 ≤ m ≤ M
	
with θ
(m)
0 widely dispersed and N relatively small
16 / 37
Conclusions
Thank you Lyapunov and Polyak!
0
20
40
60
80
100
ρ = 0.8
τk
5 10
0
Outcomes
from
M
repeated
runs
-20 0 20
Crank Up Gain: Tame Transients
Average: Tame Variance
θk
ϑτk
N − N0 θ̃PR
N
Steps to a Successful Design
1 Design ¯
f for mean flow d
dt ϑ = ¯
f(ϑ) GAS, Hurwitz A∗
2 Design step-size: αn = g/(1 + n/ne)ρ 1
2  ρ  1
3 Perform PR Averaging
4 Repeat! Obtain histogram
√
N − N0 θ̃PR (m)
N : 1 ≤ m ≤ M
	
with θ
(m)
0 widely dispersed and N relatively small
What you will learn:
How big N needs to be for a meaningful estimate
Approximate confidence bounds
Heuristic: θPR
N ≈ θ∗ + 1
√
N−N0
Z, Z ∼ N(0, ΣPR
)
16 / 37
Conclusions
Thank you Lyapunov and Polyak!
0
20
40
60
80
100
ρ = 0.8
τk
5 10
0
Outcomes
from
M
repeated
runs
-20 0 20
Crank Up Gain: Tame Transients
Average: Tame Variance
θk
ϑτk
N − N0 θ̃PR
N
Steps to a Successful Design
1 Design ¯
f for mean flow d
dt ϑ = ¯
f(ϑ) GAS, Hurwitz A∗
2 Design step-size: αn = g/(1 + n/ne)ρ 1
2  ρ  1
3 Perform PR Averaging
4 Repeat! Obtain histogram
√
N − N0 θ̃PR (m)
N : 1 ≤ m ≤ M
	
with θ
(m)
0 widely dispersed and N relatively small
What you will learn:
How big N needs to be for a meaningful estimate
Approximate confidence bounds
Next Steps:
A flash-crash control course!
Applications to RL
16 / 37
References
Control Techniques
FOR
Complex Networks
Sean Meyn
Pre-publication version for on-line viewing. Monograph available for purchase at your favorite retailer
More information available at http://www.cambridge.org/us/catalogue/catalogue.asp?isbn=9780521884419
Markov Chains
and
Stochastic Stability
S. P. Meyn and R. L. Tweedie
August 2008 Pre-publication version for on-line viewing. Monograph to appear Februrary 2009
π(f
)

∞
∆V (x) ≤ −f(x) + bIC(x)
Pn
(x, · ) − πf → 0
sup
C
E
x
[S
τ
C
(f
)]

∞
References
17 / 37
References
Control Background I
[1] K. J. Åström and R. M. Murray. Feedback Systems: An Introduction for Scientists and
Engineers. Princeton University Press, USA, 2008 (recent edition on-line).
[2] K. J. Åström and B. Wittenmark. Adaptive Control. Addison-Wesley Longman Publishing
Co., Inc., Boston, MA, USA, 2nd edition, 1994.
[3] A. Fradkov and B. T. Polyak. Adaptive and robust control in the USSR.
IFAC–PapersOnLine, 53(2):1373–1378, 2020. 21th IFAC World Congress.
[4] M. Krstic, P. V. Kokotovic, and I. Kanellakopoulos. Nonlinear and adaptive control
design. John Wiley  Sons, Inc., 1995.
[5] K. J. Åström. Theory and applications of adaptive control—a survey. Automatica,
19(5):471–486, 1983.
[6] K. J. Åström. Adaptive control around 1960. IEEE Control Systems Magazine,
16(3):44–49, 1996.
[7] L. Ljung. Analysis of recursive stochastic algorithms. IEEE Transactions on Automatic
Control, 22(4):551–575, 1977.
18 / 37
References
Control Background II
[8] N. Matni, A. Proutiere, A. Rantzer, and S. Tu. From self-tuning regulators to
reinforcement learning and back again. In Proc. of the IEEE Conf. on Dec. and Control,
pages 3724–3740, 2019.
19 / 37
References
RL Background I
[9] S. Meyn. Control Systems and Reinforcement Learning. Cambridge University Press,
Cambridge, 2021.
[10] R. Sutton and A. Barto. Reinforcement Learning: An Introduction. MIT Press. On-line
edition at http://www.cs.ualberta.ca/~sutton/book/the-book.html, Cambridge,
MA, 2nd edition, 2018.
[11] C. Szepesvári. Algorithms for Reinforcement Learning. Synthesis Lectures on Artificial
Intelligence and Machine Learning. Morgan  Claypool Publishers, 2010.
[12] D. P. Bertsekas. Reinforcement learning and optimal control. Athena Scientific, Belmont,
MA, 2019.
[13] T. Lattimore and C. Szepesvari. Bandit Algorithms. Cambridge University Press, 2020.
[14] R. S. Sutton.Learning to predict by the methods of temporal differences. Mach. Learn.,
3(1):9–44, 1988.
[15] C. J. C. H. Watkins and P. Dayan. Q-learning. Machine Learning, 8(3-4):279–292, 1992.
[16] J. Tsitsiklis. Asynchronous stochastic approximation and Q-learning. Machine Learning,
16:185–202, 1994.
20 / 37
References
RL Background II
[17] T. Jaakola, M. Jordan, and S. Singh. On the convergence of stochastic iterative dynamic
programming algorithms. Neural Computation, 6:1185–1201, 1994.
[18] B. Van Roy. Learning and Value Function Approximation in Complex Decision Processes.
PhD thesis, Massachusetts Institute of Technology, Cambridge, MA, 1998.
[19] J. N. Tsitsiklis and B. Van Roy. Feature-based methods for large scale dynamic
programming. Mach. Learn., 22(1-3):59–94, 1996.
[20] J. N. Tsitsiklis and B. Van Roy. An analysis of temporal-difference learning with function
approximation. IEEE Trans. Automat. Control, 42(5):674–690, 1997.
[21] J. N. Tsitsiklis and B. V. Roy. Average cost temporal-difference learning. Automatica,
35(11):1799–1808, 1999.
[22] J. N. Tsitsiklis and B. Van Roy. Optimal stopping of Markov processes: Hilbert space
theory, approximation algorithms, and an application to pricing high-dimensional financial
derivatives. IEEE Trans. Automat. Control, 44(10):1840–1851, 1999.
[23] D. Choi and B. Van Roy. A generalized Kalman filter for fixed point approximation and
efficient temporal-difference learning. Discrete Event Dynamic Systems: Theory and
Applications, 16(2):207–239, 2006.
21 / 37
References
RL Background III
[24] S. J. Bradtke and A. G. Barto. Linear least-squares algorithms for temporal difference
learning. Mach. Learn., 22(1-3):33–57, 1996.
[25] J. A. Boyan. Technical update: Least-squares temporal difference learning. Mach. Learn.,
49(2-3):233–246, 2002.
[26] A. Nedic and D. Bertsekas. Least squares policy evaluation algorithms with linear function
approximation. Discrete Event Dyn. Systems: Theory and Appl., 13(1-2):79–110, 2003.
[27] C. Szepesvári. The asymptotic convergence-rate of Q-learning. In Proceedings of the 10th
Internat. Conf. on Neural Info. Proc. Systems, 1064–1070. MIT Press, 1997.
[28] E. Even-Dar and Y. Mansour. Learning rates for Q-learning. Journal of Machine Learning
Research, 5(Dec):1–25, 2003.
[29] M. G. Azar, R. Munos, M. Ghavamzadeh, and H. Kappen. Speedy Q-learning. In
Advances in Neural Information Processing Systems, 2011.
[30] D. Huang, W. Chen, P. Mehta, S. Meyn, and A. Surana. Feature selection for
neuro-dynamic programming. In F. Lewis, editor, Reinforcement Learning and
Approximate Dynamic Programming for Feedback Control. Wiley, 2011.
22 / 37
References
RL Background IV
[31] A. M. Devraj and S. P. Meyn. Zap Q-learning. In Proc. of the Intl. Conference on Neural
Information Processing Systems, pages 2232–2241, 2017.
[32] S. Chen, A. M. Devraj, F. Lu, A. Busic, and S. Meyn. Zap Q-Learning with nonlinear
function approximation. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and
H. Lin, editors, Advances in Neural Information Processing Systems, and arXiv e-prints
1910.05405, volume 33, pages 16879–16890, 2020.
[33] S. P. Meyn. Control Techniques for Complex Networks. Cambridge University Press, 2007.
See last chapter on simulation and average-cost TD learning
DQN:
[34] M. Riedmiller. Neural fitted Q iteration – first experiences with a data efficient neural
reinforcement learning method. In J. Gama, R. Camacho, P. B. Brazdil, A. M. Jorge, and
L. Torgo, editors, Machine Learning: ECML 2005, pages 317–328, Berlin, Heidelberg,
2005. Springer Berlin Heidelberg.
[35] S. Lange, T. Gabel, and M. Riedmiller. Batch reinforcement learning. In Reinforcement
learning, pages 45–73. Springer, 2012.
[36] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. A.
Riedmiller. Playing Atari with deep reinforcement learning. ArXiv, abs/1312.5602, 2013.
23 / 37
References
RL Background V
[37] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves,
M. A. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik,
I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis. Human-level
control through deep reinforcement learning. Nature, 518:529–533, 2015.
[38] O. Anschel, N. Baram, and N. Shimkin. Averaged-DQN: Variance reduction and
stabilization for deep reinforcement learning. In Proc. of ICML, pages 176–185.
JMLR.org, 2017.
Actor Critic / Policy Gradient
[39] P. J. Schweitzer. Perturbation theory and finite Markov chains. J. Appl. Prob., 5:401–403,
1968.
[40] C. D. Meyer, Jr. The role of the group generalized inverse in the theory of finite Markov
chains. SIAM Review, 17(3):443–464, 1975.
[41] P. W. Glynn. Stochastic approximation for Monte Carlo optimization. In Proceedings of
the 18th conference on Winter simulation, pages 356–365, 1986.
[42] R. J. Williams. Simple statistical gradient-following algorithms for connectionist
reinforcement learning. Machine learning, 8(3-4):229–256, 1992.
24 / 37
References
RL Background VI
[43] T. Jaakkola, S. P. Singh, and M. I. Jordan. Reinforcement learning algorithm for partially
observable Markov decision problems. In Advances in neural information processing
systems, pages 345–352, 1995.
[44] X.-R. Cao and H.-F. Chen. Perturbation realization, potentials, and sensitivity analysis of
Markov processes. IEEE Transactions on Automatic Control, 42(10):1382–1393, Oct
1997.
[45] P. Marbach and J. N. Tsitsiklis. Simulation-based optimization of Markov reward
processes. IEEE Trans. Automat. Control, 46(2):191–209, 2001.
[46] V. Konda. Actor-critic algorithms. PhD thesis, Massachusetts Institute of Technology,
2002.
[47] V. R. Konda and J. N. Tsitsiklis. Actor-critic algorithms. In Advances in neural
information processing systems, pages 1008–1014, 2000.
[48] R. S. Sutton, D. A. McAllester, S. P. Singh, and Y. Mansour. Policy gradient methods for
reinforcement learning with function approximation. In Advances in neural information
processing systems, pages 1057–1063, 2000.
[49] P. Marbach and J. N. Tsitsiklis. Simulation-based optimization of Markov reward
processes. IEEE Trans. Automat. Control, 46(2):191–209, 2001.
25 / 37
References
RL Background VII
[50] S. M. Kakade. A natural policy gradient. In Advances in neural information processing
systems, pages 1531–1538, 2002.
[51] H. Mania, A. Guy, and B. Recht. Simple random search provides a competitive approach
to reinforcement learning. In Advances in Neural Information Processing Systems, pages
1800–1809, 2018.
MDPs, LPs and Convex Q:
[52] A. S. Manne. Linear programming and sequential decisions. Management Sci.,
6(3):259–267, 1960.
[53] C. Derman. Finite State Markovian Decision Processes, volume 67 of Mathematics in
Science and Engineering. Academic Press, Inc., 1970.
[54] V. S. Borkar. Convex analytic methods in Markov decision processes. In Handbook of
Markov decision processes, volume 40 of Internat. Ser. Oper. Res. Management Sci.,
pages 347–375. Kluwer Acad. Publ., Boston, MA, 2002.
[55] D. P. de Farias and B. Van Roy. The linear programming approach to approximate
dynamic programming. Operations Res., 51(6):850–865, 2003.
26 / 37
References
RL Background VIII
[56] D. P. de Farias and B. Van Roy. A cost-shaping linear program for average-cost
approximate dynamic programming with performance guarantees. Math. Oper. Res.,
31(3):597–620, 2006.
[57] P. G. Mehta and S. P. Meyn. Q-learning and Pontryagin’s minimum principle. In Proc. of
the IEEE Conf. on Dec. and Control, pages 3598–3605, Dec. 2009.
[58] J. Bas Serrano, S. Curi, A. Krause, and G. Neu. Logistic Q-learning. In A. Banerjee and
K. Fukumizu, editors, Proc. of The Intl. Conference on Artificial Intelligence and
Statistics, volume 130, pages 3610–3618, 13–15 Apr 2021.
[59] F. Lu, P. G. Mehta, S. P. Meyn, and G. Neu. Convex Q-learning. In American Control
Conf., pages 4749–4756. IEEE, 2021.
[60] F. Lu, P. G. Mehta, S. P. Meyn, and G. Neu. Convex analytic theory for convex
Q-learning. In Conference on Decision and Control–to appear. IEEE, 2022.
Gator Nation:
[61] A. M. Devraj, A. Bušić, and S. Meyn. Fundamental design principles for reinforcement
learning algorithms. In K. G. Vamvoudakis, Y. Wan, F. L. Lewis, and D. Cansever,
editors, Handbook on Reinforcement Learning and Control, Studies in Systems, Decision
and Control series (SSDC, volume 325). Springer, 2021.
27 / 37
References
RL Background IX
[62] A. M. Devraj and S. P. Meyn. Fastest convergence for Q-learning. ArXiv , July 2017
(extended version of NIPS 2017).
[63] A. M. Devraj. Reinforcement Learning Design with Optimal Learning Rate. PhD thesis,
University of Florida, 2019.
[64] A. M. Devraj and S. P. Meyn. Q-learning with uniformly bounded variance: Large
discounting is not a barrier to fast learning. IEEE Trans Auto Control (and
arXiv:2002.10301), 2021.
[65] A. M. Devraj, A. Bušić, and S. Meyn. On matrix momentum stochastic approximation
and applications to Q-learning. In Allerton Conference on Communication, Control, and
Computing, pages 749–756, Sep 2019.
28 / 37
References
Stochastic Miscellanea I
[66] S. Asmussen and P. W. Glynn. Stochastic Simulation: Algorithms and Analysis, volume 57
of Stochastic Modelling and Applied Probability. Springer-Verlag, New York, 2007.
[67] P. W. Glynn and S. P. Meyn. A Liapounov bound for solutions of the Poisson equation.
Ann. Probab., 24(2):916–931, 1996.
[68] S. P. Meyn and R. L. Tweedie. Markov chains and stochastic stability. Cambridge
University Press, Cambridge, second edition, 2009. Published in the Cambridge
Mathematical Library.
[69] R. Douc, E. Moulines, P. Priouret, and P. Soulier. Markov Chains. Springer, 2018.
29 / 37

More Related Content

PDF
Introducing Zap Q-Learning
PDF
Quasi-Stochastic Approximation: Algorithm Design Principles with Applications...
PDF
Zap Q-Learning - ISMP 2018
PDF
Pinning and facetting in multiphase LBMs
PDF
Convergence of ABC methods
PDF
Crib Sheet AP Calculus AB and BC exams
PPT
07 periodic functions and fourier series
PPTX
Signal Processing Homework Help
Introducing Zap Q-Learning
Quasi-Stochastic Approximation: Algorithm Design Principles with Applications...
Zap Q-Learning - ISMP 2018
Pinning and facetting in multiphase LBMs
Convergence of ABC methods
Crib Sheet AP Calculus AB and BC exams
07 periodic functions and fourier series
Signal Processing Homework Help

Similar to DeepLearn2022 2. Variance Matters (20)

PDF
2.1 Calculus 2.formulas.pdf.pdf
PDF
THE CHORD GAP DIVERGENCE AND A GENERALIZATION OF THE BHATTACHARYYA DISTANCE
PDF
A sharp nonlinear Hausdorff-Young inequality for small potentials
PDF
IVR - Chapter 1 - Introduction
PDF
DeepLearn2022 1. Goals & AlgorithmDesign.pdf
PDF
Scattering theory analogues of several classical estimates in Fourier analysis
PPTX
Digital signal processing on arm new
PDF
The lattice Boltzmann equation: background, boundary conditions, and Burnett-...
PDF
QMC: Operator Splitting Workshop, Thresholdings, Robustness, and Generalized ...
PDF
02 CS316_algorithms_Asymptotic_Notations(2).pdf
PPTX
stochastic processes assignment help
PDF
ENFPC 2010
PDF
EC8553 Discrete time signal processing
PDF
Trajectory Alignment: Understanding the Edge of Stability Phenomenon via Bifu...
PDF
Introduction to the theory of optimization
PDF
A Fibonacci-like universe expansion on time-scale
PDF
NCE, GANs & VAEs (and maybe BAC)
PDF
CDT 22 slides.pdf
PDF
01Introduction_Lecture8signalmoddiscr.pdf
2.1 Calculus 2.formulas.pdf.pdf
THE CHORD GAP DIVERGENCE AND A GENERALIZATION OF THE BHATTACHARYYA DISTANCE
A sharp nonlinear Hausdorff-Young inequality for small potentials
IVR - Chapter 1 - Introduction
DeepLearn2022 1. Goals & AlgorithmDesign.pdf
Scattering theory analogues of several classical estimates in Fourier analysis
Digital signal processing on arm new
The lattice Boltzmann equation: background, boundary conditions, and Burnett-...
QMC: Operator Splitting Workshop, Thresholdings, Robustness, and Generalized ...
02 CS316_algorithms_Asymptotic_Notations(2).pdf
stochastic processes assignment help
ENFPC 2010
EC8553 Discrete time signal processing
Trajectory Alignment: Understanding the Edge of Stability Phenomenon via Bifu...
Introduction to the theory of optimization
A Fibonacci-like universe expansion on time-scale
NCE, GANs & VAEs (and maybe BAC)
CDT 22 slides.pdf
01Introduction_Lecture8signalmoddiscr.pdf

More from Sean Meyn (20)

PDF
DeepLearn2022 3. TD and Q Learning
PDF
Smart Grid Tutorial - January 2019
PDF
State Space Collapse in Resource Allocation for Demand Dispatch - May 2019
PDF
Irrational Agents and the Power Grid
PDF
Reinforcement Learning: Hidden Theory and New Super-Fast Algorithms
PDF
State estimation and Mean-Field Control with application to demand dispatch
PDF
Demand-Side Flexibility for Reliable Ancillary Services
PDF
Spectral Decomposition of Demand-Side Flexibility for Reliable Ancillary Serv...
PDF
Demand-Side Flexibility for Reliable Ancillary Services in a Smart Grid: Elim...
PDF
Why Do We Ignore Risk in Power Economics?
PDF
Distributed Randomized Control for Ancillary Service to the Power Grid
PDF
Ancillary service to the grid from deferrable loads: the case for intelligent...
PDF
2012 Tutorial: Markets for Differentiated Electric Power Products
PDF
Control Techniques for Complex Systems
PDF
Tutorial for Energy Systems Week - Cambridge 2010
PDF
Panel Lecture for Energy Systems Week
PDF
The Value of Volatile Resources... Caltech, May 6 2010
PDF
Approximate dynamic programming using fluid and diffusion approximations with...
PDF
Anomaly Detection Using Projective Markov Models
PDF
Markov Tutorial CDC Shanghai 2009
DeepLearn2022 3. TD and Q Learning
Smart Grid Tutorial - January 2019
State Space Collapse in Resource Allocation for Demand Dispatch - May 2019
Irrational Agents and the Power Grid
Reinforcement Learning: Hidden Theory and New Super-Fast Algorithms
State estimation and Mean-Field Control with application to demand dispatch
Demand-Side Flexibility for Reliable Ancillary Services
Spectral Decomposition of Demand-Side Flexibility for Reliable Ancillary Serv...
Demand-Side Flexibility for Reliable Ancillary Services in a Smart Grid: Elim...
Why Do We Ignore Risk in Power Economics?
Distributed Randomized Control for Ancillary Service to the Power Grid
Ancillary service to the grid from deferrable loads: the case for intelligent...
2012 Tutorial: Markets for Differentiated Electric Power Products
Control Techniques for Complex Systems
Tutorial for Energy Systems Week - Cambridge 2010
Panel Lecture for Energy Systems Week
The Value of Volatile Resources... Caltech, May 6 2010
Approximate dynamic programming using fluid and diffusion approximations with...
Anomaly Detection Using Projective Markov Models
Markov Tutorial CDC Shanghai 2009

Recently uploaded (20)

PPTX
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
PPTX
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
PDF
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
PPTX
web development for engineering and engineering
DOCX
573137875-Attendance-Management-System-original
PDF
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
PDF
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
PPTX
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
PPTX
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
PPT
Project quality management in manufacturing
PDF
Automation-in-Manufacturing-Chapter-Introduction.pdf
PPTX
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
PPTX
Welding lecture in detail for understanding
PDF
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
PDF
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
PDF
Digital Logic Computer Design lecture notes
PDF
Model Code of Practice - Construction Work - 21102022 .pdf
PPTX
CYBER-CRIMES AND SECURITY A guide to understanding
PPTX
UNIT-1 - COAL BASED THERMAL POWER PLANTS
PPTX
Internet of Things (IOT) - A guide to understanding
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
web development for engineering and engineering
573137875-Attendance-Management-System-original
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
Project quality management in manufacturing
Automation-in-Manufacturing-Chapter-Introduction.pdf
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
Welding lecture in detail for understanding
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
Digital Logic Computer Design lecture notes
Model Code of Practice - Construction Work - 21102022 .pdf
CYBER-CRIMES AND SECURITY A guide to understanding
UNIT-1 - COAL BASED THERMAL POWER PLANTS
Internet of Things (IOT) - A guide to understanding

DeepLearn2022 2. Variance Matters

  • 1. Part 2: Variance Matters Sean Meyn Department of Electrical and Computer Engineering University of Florida Inria International Chair Inria, Paris Thanks to to our sponsors: NSF and ARO
  • 2. Part 2: Variance Matters Outline 1 Stochastic Approximation 2 Understanding Variance 3 Unveiling Dynamics 4 Introduction to Zap 5 Conclusions 6 References
  • 3. Part 2: Variance Matters Resources ODE Method (using different meaning than in the 1970s) CS&RL, Chapter 8 The ODE Method for Asymptotic Statistics in Stochastic Approximation and Reinforcement Learning [73] Control Techniques FOR Complex Networks Sean Meyn Pre-publication version for on-line viewing. Monograph available for purchase at your favorite retailer More information available at http://www.cambridge.org/us/catalogue/catalogue.asp?isbn=9780521884419 Markov Chains and Stochastic Stability S. P. Meyn and R. L. Tweedie August 2008 Pre-publication version for on-line viewing. Monograph to appear Februrary 2009 π(f ) < ∞ ∆V (x) ≤ −f(x) + bIC(x) Pn (x, · ) − πf → 0 sup C E x [S τ C (f )] ∞ 1 / 37
  • 4. Special Thanks My interests in RLSA began during my first sabbatical—at IISc with Vivek Borkar Many other heroes along the way since Metivier Van Roy Tsitsiklis Bertsekas Konda Priouret Kushner Surana Huang Ruppert Polyak Szepesvari Benveniste Yin Nedic Yu Colombino Dall’Anese Bernstein Chen Chen Barooah Raman Max P.E. Caines Ioannis Kontoyiannis Ana Busic Eric Moulines Adithya Devraj Many Others!
  • 5. -100 0 100 -100 0 100 -100 0 100 Zap 2 Zap 5 Zap 10 θn+1 = θn + an+1f(θn, ξn+1) θn Excitation Estimates Stochastic Approximation
  • 6. Stochastic Approximation ODE Method What is Stochastic Approximation? ¯ f(θ) = E[f(θ, ξ)] No different than last lecture: find solution to ¯ f(θ∗ ) = 0 Example: ¯ f(θ) = −∇Γ (θ) for optimization. In RL the function ¯ f is typically not of this form. 3 / 37
  • 7. Stochastic Approximation ODE Method What is Stochastic Approximation? ¯ f(θ) = E[f(θ, ξ)] No different than last lecture: find solution to ¯ f(θ∗ ) = 0 ODE algorithm: d dt ϑt = ¯ f(ϑt) If stable: ϑt → θ∗ and ¯ f(ϑt) → ¯ f(θ∗ ) = 0. 3 / 37
  • 8. Stochastic Approximation ODE Method What is Stochastic Approximation? ¯ f(θ) = E[f(θ, ξ)] No different than last lecture: find solution to ¯ f(θ∗ ) = 0 ODE algorithm: d dt ϑt = ¯ f(ϑt) If stable: ϑt → θ∗ and ¯ f(ϑt) → ¯ f(θ∗ ) = 0. Euler approximation: θn+1 = θn + αn+1 ¯ f(θn) Terminology: αn usually called the step-size 3 / 37
  • 9. Stochastic Approximation ODE Method What is Stochastic Approximation? ¯ f(θ) = E[f(θ, ξ)] No different than last lecture: find solution to ¯ f(θ∗ ) = 0 ODE algorithm: d dt ϑt = ¯ f(ϑt) If stable: ϑt → θ∗ and ¯ f(ϑt) → ¯ f(θ∗ ) = 0. Euler approximation: θn+1 = θn + αn+1 ¯ f(θn) Stochastic Approximation θn+1 = θn + αn+1f(θn, ξn+1) with ξn → ξ in dist. 3 / 37
  • 10. Stochastic Approximation ODE Method What is Stochastic Approximation? ¯ f(θ) = E[f(θ, ξ)] No different than last lecture: find solution to ¯ f(θ∗ ) = 0 ODE algorithm: d dt ϑt = ¯ f(ϑt) If stable: ϑt → θ∗ and ¯ f(ϑt) → ¯ f(θ∗ ) = 0. Euler approximation: θn+1 = θn + αn+1 ¯ f(θn) Stochastic Approximation θn+1 = θn + αn+1f(θn, ξn+1) = θn + αn+1 ¯ f(θn) + “NOISE” Under very general conditions: the ODE, the Euler approximation, and SA are all convergent to θ∗ 3 / 37
  • 11. Stochastic Approximation ODE Method What is Stochastic Approximation? ¯ f(θ) = E[f(θ, ξ)] No different than last lecture: find solution to ¯ f(θ∗ ) = 0 ODE algorithm: d dt ϑt = ¯ f(ϑt) If stable: ϑt → θ∗ and ¯ f(ϑt) → ¯ f(θ∗ ) = 0. Euler approximation: θn+1 = θn + αn+1 ¯ f(θn) Stochastic Approximation θn+1 = θn + αn+1f(θn, ξn+1) = θn + αn+1 ¯ f(θn) + “NOISE” Under very general conditions: the ODE, the Euler approximation, and SA are all convergent to θ∗ [Robbins and Monro, 1951] see Borkar’s monograph [70] 3 / 37
  • 12. Stochastic Approximation ODE Method Questions for Algorithm Design ¯ f(θ) = E[f(θ, ξ)] Stochastic Approximation θn+1 = θn + αn+1f(θn, ξn+1) = θn + αn+1 ¯ f(θn) + e Ξn (NOISE) Questions Can we duplicate our success with QSA? Is there a Perturbative Mean Flow? Implications to transient and asymptotic behavior? What are the implications to design? 4 / 37
  • 13. Stochastic Approximation ODE Method Questions for Algorithm Design ¯ f(θ) = E[f(θ, ξ)] Stochastic Approximation θn+1 = θn + αn+1f(θn, ξn+1) = θn + αn+1 ¯ f(θn) + e Ξn Questions Can we duplicate our success with QSA? Is there a Perturbative Mean Flow? Implications to transient and asymptotic behavior? What are the implications to design? Go beyond the last lecture: What if the mean flow d dt ϑ = ¯ f(ϑ) is not stable, or “ill conditioned”? 4 / 37
  • 14. Control Techniques for Complex Networks page 528 Example 11.5.1. LSTD for the M/M/1 queue (minimum variance algorithm) Nonsense after One Million Samples 0 10 20 30 0 5 10 15 ρ = 0.8 ρ = 0.9 hθ (x) = θ1 θ1 x + θ2x2 θ2 θ∗ 1 = θ∗ 2 0 1 2 3 4 5 x 10 6 0 1 2 3 4 5 x 10 6 Understanding Variance
  • 15. Understanding Variance Disturbance decompositions Perturbative Mean Flow θn+1 = θn + αn+1f(θn, Wn+1) Prerequisites as before Mean flow d dt ϑ = ¯ f(ϑ) globally asymptotically stable So ϑt → θ∗ from each initial condition A tiny bit more is needed to ensure {θn} is bounded – see CSRL for details! A∗ = ∂θ ¯ f (θ∗) Hurwitz eigenvalues in left half plane of C 5 / 37
  • 16. Understanding Variance Disturbance decompositions Perturbative Mean Flow θn+1 = θn + αn+1f(θn, Wn+1) Prerequisites as before Mean flow d dt ϑ = ¯ f(ϑ) globally asymptotically stable So ϑt → θ∗ from each initial condition A tiny bit more is needed to ensure {θn} is bounded – see CSRL for details! A∗ = ∂θ ¯ f (θ∗) Hurwitz eigenvalues in left half plane of C And a replacement for our almost periodic exploration: ξn+1 = G0(Φn+1), with Φ a “nice Markov chain” Assume here: finite state space and irreducible Please don’t rule out periodicity! Remember our luck with QSA! 5 / 37
  • 17. Understanding Variance Disturbance decompositions P- Mean Flow θn+1 = θn + αn+1f(θn, Wn+1) = θn + αn+1 ¯ f(θn) + e Ξn Perturbative Mean Flow (details for fixed step-size) Two distinctions: 1. We remain in discrete time, and 2. We obtain an expression for the sum of the NOISE: N−1 X n=0 e Ξn = Foundation of all is Poisson’s equation: b fn(θ) − E[ b fn+1(θ) | Fn]
  • 20. Understanding Variance Disturbance decompositions P- Mean Flow θn+1 = θn + αn+1f(θn, Wn+1) = θn + αn+1 ¯ f(θn) + e Ξn Perturbative Mean Flow (details for fixed step-size) Two distinctions: 1. We remain in discrete time, and 2. We obtain an expression for the sum of the NOISE: N−1 X n=0 e Ξn = N+1 X n=2 Wn {Wn} is white—beautiful theory waiting for us Foundation of all is Poisson’s equation: b fn(θ) − E[ b fn+1(θ) | Fn]
  • 23. Understanding Variance Disturbance decompositions P- Mean Flow θn+1 = θn + αn+1f(θn, Wn+1) = θn + αn+1 ¯ f(θn) + e Ξn Perturbative Mean Flow (details for fixed step-size) Two distinctions: 1. We remain in discrete time, and 2. We obtain an expression for the sum of the NOISE: N−1 X n=0 e Ξn = N+1 X n=2 Wn + ∆ b fN {Wn} is white—beautiful theory waiting for us Uniformly bounded—adds O(1/N) in convergence rate. Foundation of all is Poisson’s equation: b fn(θ) − E[ b fn+1(θ) | Fn]
  • 26. Understanding Variance Disturbance decompositions P- Mean Flow θn+1 = θn + αn+1f(θn, Wn+1) = θn + αn+1 ¯ f(θn) + e Ξn Perturbative Mean Flow (details for fixed step-size) Two distinctions: 1. We remain in discrete time, and 2. We obtain an expression for the sum of the NOISE: N−1 X n=0 e Ξn = N+1 X n=2 Wn + ∆ b fN − α N X n=1 Υn {Wn} is white—beautiful theory waiting for us Uniformly bounded—adds O(1/N) in convergence rate. Bad news for bias: Υn = − 1 α b fn+1(θn) − b fn+1(θn−1) Foundation of all is Poisson’s equation: b fn(θ) − E[ b fn+1(θ) | Fn]
  • 29. Understanding Variance P- Mean Flow = ⇒ Polyak-Ruppert Averaging Design Implications θn+1 − θn = α ¯ f(θn) + e Ξn Constant Step-size N−1 X n=0 e Ξn = N+1 X n=2 Wn + ∆ b fN − α N X n=1 Υn E[kθnk2] uniformly bounded for 0 α ≤ α0 some α0 0 7 / 37
  • 30. Understanding Variance P- Mean Flow = ⇒ Polyak-Ruppert Averaging Design Implications θn+1 − θn = α ¯ f(θn) + e Ξn Constant Step-size N−1 X n=0 e Ξn = N+1 X n=2 Wn + ∆ b fN − α N X n=1 Υn E[kθnk2] uniformly bounded for 0 α ≤ α0 some α0 0 Justifies θn = Xn + O(α2) (approximation in mean-square) Xn+1 − Xn = α[A∗ [Xn − θ∗ ] + e Ξn] 7 / 37
  • 31. Understanding Variance P- Mean Flow = ⇒ Polyak-Ruppert Averaging Design Implications θn+1 − θn = α ¯ f(θn) + e Ξn Constant Step-size N−1 X n=0 e Ξn = N+1 X n=2 Wn + ∆ b fN − α N X n=1 Υn E[kθnk2] uniformly bounded for 0 α ≤ α0 some α0 0 Justifies θn = Xn + O(α2) (approximation in mean-square) N−1 X n=0 Xn+1 − Xn = N−1 X n=0 α[A∗ [Xn − θ∗ ] + e Ξn] 7 / 37
  • 32. Understanding Variance P- Mean Flow = ⇒ Polyak-Ruppert Averaging Design Implications θn+1 − θn = α ¯ f(θn) + e Ξn Constant Step-size N−1 X n=0 e Ξn = N+1 X n=2 Wn + ∆ b fN − α N X n=1 Υn E[kθnk2] uniformly bounded for 0 α ≤ α0 some α0 0 Justifies θn = Xn + O(α2) (approximation in mean-square) N−1 X n=0 Xn+1 − Xn = N−1 X n=0 α[A∗ [Xn − θ∗ ] + e Ξn] See a Polyak-Ruppert filter pop out? 7 / 37
  • 33. Understanding Variance P- Mean Flow = ⇒ Polyak-Ruppert Averaging Design Implications θn+1 − θn = α ¯ f(θn) + e Ξn Constant Step-size N−1 X n=0 e Ξn = N+1 X n=2 Wn + ∆ b fN − α N X n=1 Υn 1 N N−1 X n=0 θn+1 − θn = α 1 N N−1 X n=0 A∗ [θn − θ∗ ] + e Ξn + O(α2 ) 8 / 37
  • 34. Understanding Variance P- Mean Flow = ⇒ Polyak-Ruppert Averaging Design Implications θn+1 − θn = α ¯ f(θn) + e Ξn Constant Step-size N−1 X n=0 e Ξn = N+1 X n=2 Wn + ∆ b fN − α N X n=1 Υn 1 N N−1 X n=0 θn+1 − θn = α 1 N N−1 X n=0 A∗ [θn − θ∗ ] + e Ξn + O(α2 ) Polyak-Ruppert Representation θPR N = 1 N N−1 X n=0 θn 1 α 1 N ∆θN = A∗ [θPR N − θ∗ ] + 1 N N−1 X n=0 e Ξn + O(α2 ) 8 / 37
  • 35. Understanding Variance P- Mean Flow = ⇒ Polyak-Ruppert Averaging Design Implications θn+1 − θn = α ¯ f(θn) + e Ξn Constant Step-size Polyak-Ruppert Representation θPR N = 1 N N−1 X n=0 θn 1 α 1 N ∆θN = A∗ [θPR N − θ∗ ] + 1 N N−1 X n=0 e Ξn + O(α2 ) P-R representation = the statistical optimal + statistical annoying: θPR N = θ∗ + − + O((αN)−1 + α2 ) = [A∗ ]−1 1 N N X n=1 Wn = [A∗ ]−1 α N N X n=1 Υn 8 / 37
  • 36. Understanding Variance P- Mean Flow = ⇒ Polyak-Ruppert Averaging Design Implications θn+1 − θn = α ¯ f(θn) + e Ξn Constant Step-size Polyak-Ruppert Representation θPR N = 1 N N−1 X n=0 θn 1 α 1 N ∆θN = A∗ [θPR N − θ∗ ] + 1 N N−1 X n=0 e Ξn + O(α2 ) P-R representation = the statistical optimal + statistical annoying: θPR N = θ∗ + − + O((αN)−1 + α2 ) = [A∗ ]−1 1 N N X n=1 Wn = [A∗ ]−1 α N N X n=1 Υn Annoyance Υn: introduces bias of order O(α) (and variance not understood) Υn = − 1 α b fn+1(θn) − b fn+1(θn−1) ≈ −∂θ b fn+1(θn) · f(θn−1, Wn) 8 / 37
  • 37. Understanding Variance P- Mean Flow = ⇒ Polyak-Ruppert Averaging Design Implications θn+1 − θn = αn+1 ¯ f(θn) + e Ξn Vanishing Step-size Slightly different starting point: N−1 X n=0 1 αn+1 [θn+1 − θn] = N−1 X n=0 ¯ f(θn) + e Ξn = N−1 X n=0 A∗ [θn − θ∗ ] + e Ξn + O N−1 X n=0 α2 n+1 | {z } Assumed bounded 9 / 37
  • 38. Understanding Variance P- Mean Flow = ⇒ Polyak-Ruppert Averaging Design Implications θn+1 − θn = αn+1 ¯ f(θn) + e Ξn Vanishing Step-size Statistical memory =⇒ try αn = g/(1 + n/ne)ρ 1 2 ρ 1 Ignoring transients (and ignoring summation by parts calculation), θPR N = 1 N − N0 N−1 X n=N0 θn = θ∗ + + O(1/N) = 1 N − N0 [A∗ ]−1 N X n=N0+1 Wn Optimality of PR-Averaging Cov(θPR N − θ∗ ) = 1 N − N0 ΣPR N ΣPR N → ΣPR = [A∗]−1ΣW[A∗T ]−1 minimal in strongest sense 9 / 37
  • 39. Understanding Variance P- Mean Flow = ⇒ Polyak-Ruppert Averaging Asymptotic Statistics Optimality of PR-Averaging Cov(θPR N − θ∗ ) = 1 N − N0 ΣPR N ΣPR N → ΣPR = [A∗]−1ΣW[A∗T ]−1 minimal in strongest sense The Central Limit Theorem holds tremendous tool for validation, ... √ Nθ̃N Histogram from 103 independent runs 10 / 37
  • 40. αn = g/(1 + n/ne)ρ 0 5 10 5 10 5 10 0 20 40 60 80 100 0 0 20 40 60 80 100 0 0 20 40 60 80 100 ρ = 1.0 ρ = 0.9 ρ = 0.8 = 2.8 = 5.3 = 11 θk ϑτk τk τk τk τN τN τN Unveiling Dynamics
  • 41. Unveiling Dynamics Two Sources of Error SA Error θn+1 = θn + αn+1 ¯ f(θn) + “NOISE” d dt ϑt = ¯ f(ϑt) 1 Asymptotic Covariance: αn = g (1+n/ne)ρ 1 αn [θn − ϑτn ] ≈ N(0, Σ) without averaging 11 / 37
  • 42. Unveiling Dynamics Two Sources of Error SA Error θn+1 = θn + αn+1 ¯ f(θn) + “NOISE” d dt ϑt = ¯ f(ϑt) 1 1 αn [θn − ϑτn ] ≈ N(0, Σ) without averaging where τn = n X k=1 αk 11 / 37
  • 43. Unveiling Dynamics Two Sources of Error SA Error θn+1 = θn + αn+1 ¯ f(θn) + “NOISE” d dt ϑt = ¯ f(ϑt) 1 1 αn [θn − ϑτn ] ≈ N(0, Σ) without averaging where τn = n X k=1 αk 2 ϑt → θ∗ exponentially fast, but 11 / 37
  • 44. Unveiling Dynamics Two Sources of Error SA Error θn+1 = θn + αn+1 ¯ f(θn) + “NOISE” d dt ϑt = ¯ f(ϑt) 1 1 αn [θn − ϑτn ] ≈ N(0, Σ) without averaging where τn = n X k=1 αk 2 ϑt → θ∗ exponentially fast, but τn is increasing slowly, and 11 / 37
  • 45. Unveiling Dynamics Two Sources of Error SA Error θn+1 = θn + αn+1 ¯ f(θn) + “NOISE” d dt ϑt = ¯ f(ϑt) 1 1 αn [θn − ϑτn ] ≈ N(0, Σ) without averaging where τn = n X k=1 αk 2 ϑt → θ∗ exponentially fast, but τn is increasing slowly, and nonlinear dynamics can complicate gain selection 11 / 37
  • 46. Unveiling Dynamics Two Sources of Error SA Error θn+1 = θn + αn+1 ¯ f(θn) + “NOISE” d dt ϑt = ¯ f(ϑt) 1 1 αn [θn − ϑτn ] ≈ N(0, Σ) without averaging where τn = n X k=1 αk 2 ϑt → θ∗ exponentially fast, but τn is increasing slowly, and nonlinear dynamics can complicate gain selection What can happen in applications, using αn+1 = g/(1 + n/ne)ρ: θn far from θ∗, the dynamics are slow, need large g θn ≈ θ∗, best gain is far smaller 11 / 37
  • 47. Unveiling Dynamics Two Sources of Error Two Sources of Error. Example: SGD αn = g/(1 + n/ne)ρ Stochastic Gradient Descent: L̄(θ) = E[L(θ, Φn)] ¯ f(θ) = −∇L̄(θ) θ -20 -15 -10 -5 0 5 10 15 20 -25 -20 -15 -10 -5 0 5 10 15 20 25 S l o p e - 4 f(θ) Slope -1 L̄(θ) θn+1 = θn − g nρ ∇L(θn, Φn+1) 12 / 37
  • 48. Unveiling Dynamics Two Sources of Error Two Sources of Error. Example: SGD αn = g/(1 + n/ne)ρ ODE bound using ρ = 1 and setting ne = 1 |ϑτn − θ∗ | ≤ |ϑ0 − θ∗ |e0.34g n−0.34g ¯ f(θ) = −∇L̄(θ) θ -20 -15 -10 -5 0 5 10 15 20 -25 -20 -15 -10 -5 0 5 10 15 20 25 S l o p e - 4 f(θ) L̄(θ) Slope -1 Slope -0.34 θn+1 = θn − g nρ ∇L(θn, Φn+1) d dt ϑt = ¯ f(ϑt) |ϑt − θ∗| ≤ |ϑ0 − θ∗|e−0.34t 12 / 37
  • 49. Unveiling Dynamics Two Sources of Error Two Sources of Error. Example: SGD αn = g/(1 + n/ne)ρ ODE bound using ρ = 1 and setting ne = 1 |ϑτn − θ∗ | ≤ |ϑ0 − θ∗ |e0.34g n−0.34g g ≥ 3 to kill deterministic behavior, but g∗ = −[A∗]−1 = 1/4 gives optimal variance ¯ f(θ) = −∇L̄(θ) θ -20 -15 -10 -5 0 5 10 15 20 -25 -20 -15 -10 -5 0 5 10 15 20 25 S l o p e - 4 f(θ) L̄(θ) Slope -1 Slope -0.34 θn+1 = θn − g nρ ∇L(θn, Φn+1) d dt ϑt = ¯ f(ϑt) |ϑt − θ∗| ≤ |ϑ0 − θ∗|e−0.34t 12 / 37
  • 50. Unveiling Dynamics Two Sources of Error Two Sources of Error. Example: SGD αn = g/(1 + n/ne)ρ ODE bound using ρ = 1 and setting ne = 1 |ϑτn − θ∗ | ≤ |ϑ0 − θ∗ |e0.34g n−0.34g g ≥ 3 to kill deterministic behavior, but g∗ = 1/4 gives optimal variance Dynamics for g∗ = 1/4 0 5 10 5 10 5 10 0 20 40 60 80 100 0 0 20 40 60 80 100 0 0 20 40 60 80 100 ρ = 1.0 ρ = 0.9 ρ = 0.8 = 2.8 = 5.3 = 11 θk ϑτk τk τk τk τN τN τN τN 3 for N = one million ¯ f(θ) = −∇L̄(θ) θ -20 -15 -10 -5 0 5 10 15 20 -25 -20 -15 -10 -5 0 5 10 15 20 25 S l o p e - 4 f(θ) L̄(θ) Slope -1 Slope -0.34 θn+1 = θn − g nρ ∇L(θn, Φn+1) d dt ϑt = ¯ f(ϑt) |ϑt − θ∗| ≤ |ϑ0 − θ∗|e−0.34t 12 / 37
  • 51. Unveiling Dynamics Two Sources of Error Two Sources of Error. Example: SGD αn = g/(1 + n/ne)ρ ODE bound using ρ = 1 and setting ne = 1 |ϑτn − θ∗ | ≤ |ϑ0 − θ∗ |e0.34g n−0.34g g ≥ 3 to kill deterministic behavior, but g∗ = 1/4 gives optimal variance Dynamics for g∗ = 1/4 0 5 10 5 10 5 10 0 20 40 60 80 100 0 0 20 40 60 80 100 0 0 20 40 60 80 100 ρ = 1.0 ρ = 0.9 ρ = 0.8 = 2.8 = 5.3 = 11 θk ϑτk τk τk τk τN τN τN τN 3 for N = one million ¯ f(θ) = −∇L̄(θ) θ -20 -15 -10 -5 0 5 10 15 20 -25 -20 -15 -10 -5 0 5 10 15 20 25 S l o p e - 4 f(θ) L̄(θ) Slope -1 Slope -0.34 θn+1 = θn − g nρ ∇L(θn, Φn+1) d dt ϑt = ¯ f(ϑt) |ϑt − θ∗| ≤ |ϑ0 − θ∗|e−0.34t CLT approximation: rapid for θ0 = 0 12 / 37
  • 52. Unveiling Dynamics Two Sources of Error Two Sources of Error. Example: SGD αn = g/(1 + n/ne)ρ ODE bound using ρ = 1 and setting ne = 1 |ϑτn − θ∗ | ≤ |ϑ0 − θ∗ |e0.34g n−0.34g g ≥ 3 to kill deterministic behavior, but g∗ = 1/4 gives optimal variance Dynamics for g∗ = 1/4 0 5 10 5 10 5 10 0 20 40 60 80 100 0 0 20 40 60 80 100 0 0 20 40 60 80 100 ρ = 1.0 ρ = 0.9 ρ = 0.8 = 2.8 = 5.3 = 11 θk ϑτk τk τk τk τN τN τN τN 3 for N = one million ¯ f(θ) = −∇L̄(θ) θ -20 -15 -10 -5 0 5 10 15 20 -25 -20 -15 -10 -5 0 5 10 15 20 25 S l o p e - 4 f(θ) L̄(θ) Slope -1 Slope -0.34 θn+1 = θn − g nρ ∇L(θn, Φn+1) d dt ϑt = ¯ f(ϑt) |ϑt − θ∗| ≤ |ϑ0 − θ∗|e−0.34t CLT approximation: rapid for θ0 = 0 slow for θ0 = 100 12 / 37
  • 53. Unveiling Dynamics Two Sources of Error Two Sources of Error. Example: SGD αn = g/(1 + n/ne)ρ ODE bound using ρ = 1 and setting ne = 1 |ϑτn − θ∗ | ≤ |ϑ0 − θ∗ |e0.34g n−0.34g Polyak-Ruppert to the rescue 0 20 40 60 √ Nθ̃N σθ = 5 2 ρ = 1.0 ρ = 0.9 ρ = 0.8 -20 0 20 -20 0 20 -20 0 20 0 20 40 60 √ Nθ̃N g = 1 4 g = 1 0.34 Histograms from Polyak-Ruppert averaging: big and small g ¯ f(θ) = −∇L̄(θ) θ -20 -15 -10 -5 0 5 10 15 20 -25 -20 -15 -10 -5 0 5 10 15 20 25 S l o p e - 4 f(θ) L̄(θ) Slope -1 Slope -0.34 θn+1 = θn − g nρ ∇L(θn, Φn+1) d dt ϑt = ¯ f(ϑt) |ϑt − θ∗| ≤ |ϑ0 − θ∗|e−0.34t 12 / 37
  • 54. Unveiling Dynamics Two Sources of Error Two Sources of Error. Example: Tabular Q-Learning g ≥ 1/(1 − γ) required, if using ρ = 1* 1 2 3 4 5 6 7 8 9 10 One million samples 10 5 0 20 40 60 80 100 120 MaxBEQ n n Q-learning g = gAD αn = g n(x, u) Awesome performance! (?) Generic tabular Q-learning example. Discount factor γ 13 / 37
  • 55. Unveiling Dynamics Two Sources of Error Two Sources of Error. Example: Tabular Q-Learning g ≥ 1/(1 − γ) required, if using ρ = 1* Stay tuned! 1 2 3 4 5 6 7 8 9 10 One million samples 10 5 0 20 40 60 80 100 120 MaxBEQ n n Q-learning g = 1/(1 − γ) Q-learning g = gAD αn = g n(x, u) Generic tabular Q-learning example. Discount factor γ *See Devraj M 2017 (extension in Wainwright 2019); see also Szepesvári 1997 13 / 37
  • 56. Zap
  • 57. Introduction to Zap Newton-Raphson flow Taming Nonlinear Dynamics What if stability of d dt ϑ = ¯ f(ϑ) is unknown? [typically the case in RL] Newton Raphson Flow [Smale 1976] Idea: Interpret ¯ f as the “parameter”: d dt ¯ ft = V( ¯ ft) and design V 14 / 37
  • 58. Introduction to Zap Newton-Raphson flow Taming Nonlinear Dynamics What if stability of d dt ϑ = ¯ f(ϑ) is unknown? [typically the case in RL] Newton Raphson Flow [Smale 1976] Idea: Interpret ¯ f as the “parameter”: d dt ¯ ft = − ¯ ft d dt ¯ f(ϑt) = − ¯ f(ϑt) giving ¯ f(ϑt) = ¯ f(ϑ0)e−t Linear dynamics 14 / 37
  • 59. Introduction to Zap Newton-Raphson flow Taming Nonlinear Dynamics What if stability of d dt ϑ = ¯ f(ϑ) is unknown? [typically the case in RL] Newton Raphson Flow [Smale 1976] Idea: Interpret ¯ f as the “parameter”: d dt ¯ ft = − ¯ ft d dt ¯ f(ϑt) = − ¯ f(ϑt) d dt ϑt = −[A(ϑt)]−1 ¯ f(ϑt) SA translation: Zap Stochastic Approximation 14 / 37
  • 60. Introduction to Zap Newton-Raphson flow Zap Algorithm Designed to emulate Newton-Raphson flow d dt ϑt = −[A(ϑt)]−1 ¯ f(ϑt), A(θ) = ∂ ∂θ ¯ f(θ) Zap-SA (designed to emulate deterministic Newton-Raphson) θn+1 = θn + αn+1[− b An+1]−1 f(θn, ξn+1) b An+1 = b An + βn+1(An+1 − b An), An+1 = ∂θf(θn, ξn+1) 15 / 37
  • 61. Introduction to Zap Newton-Raphson flow Zap Algorithm Designed to emulate Newton-Raphson flow d dt ϑt = −[A(ϑt)]−1 ¯ f(ϑt), A(θ) = ∂ ∂θ ¯ f(θ) Zap-SA (designed to emulate deterministic Newton-Raphson) θn+1 = θn + αn+1[− b An+1]−1 f(θn, ξn+1) b An+1 = b An + βn+1(An+1 − b An), An+1 = ∂θf(θn, ξn+1) Requires b An+1 ≈ A(θn) def = ∂θ ¯ f (θn) 15 / 37
  • 62. Introduction to Zap Newton-Raphson flow Zap Algorithm Designed to emulate Newton-Raphson flow d dt ϑt = −[A(ϑt)]−1 ¯ f(ϑt), A(θ) = ∂ ∂θ ¯ f(θ) Zap-SA (designed to emulate deterministic Newton-Raphson) θn+1 = θn + αn+1[− b An+1]−1 f(θn, ξn+1) b An+1 = b An + βn+1(An+1 − b An), An+1 = ∂θf(θn, ξn+1) b An+1 ≈ A(θn) requires high-gain, βn αn → ∞, n → ∞ 15 / 37
  • 63. Introduction to Zap Newton-Raphson flow Zap Algorithm Designed to emulate Newton-Raphson flow d dt ϑt = −[A(ϑt)]−1 ¯ f(ϑt), A(θ) = ∂ ∂θ ¯ f(θ) Zap-SA (designed to emulate deterministic Newton-Raphson) θn+1 = θn + αn+1[− b An+1]−1 f(θn, ξn+1) b An+1 = b An + βn+1(An+1 − b An), An+1 = ∂θf(θn, ξn+1) b An+1 ≈ A(θn) requires high-gain, βn αn → ∞, n → ∞ Can use αn = 1/n (without averaging). Numerics to come: use this choice, and βn = (1/n)ρ, ρ ∈ (0.5, 1) 15 / 37
  • 64. Introduction to Zap Newton-Raphson flow Zap Algorithm Designed to emulate Newton-Raphson flow d dt ϑt = −[A(ϑt)]−1 ¯ f(ϑt), A(θ) = ∂ ∂θ ¯ f(θ) Zap-SA (designed to emulate deterministic Newton-Raphson) θn+1 = θn + αn+1[− b An+1]−1 f(θn, ξn+1) b An+1 = b An + βn+1(An+1 − b An), An+1 = ∂θf(θn, ξn+1) b An+1 ≈ A(θn) requires high-gain, βn αn → ∞, n → ∞ Can use αn = 1/n (without averaging). Numerics to come: use this choice, and βn = (1/n)ρ, ρ ∈ (0.5, 1) Stability? Virtually universal Optimal variance, too! Based on ancient theory from Ruppert Polyak [80, 81, 79] 15 / 37
  • 65. Conclusions Thank you Lyapunov and Polyak! 0 20 40 60 80 100 ρ = 0.8 τk 5 10 0 Outcomes from M repeated runs -20 0 20 Crank Up Gain: Tame Transients Average: Tame Variance θk ϑτk N − N0 θ̃PR N Steps to a Successful Design 1 Design ¯ f for mean flow d dt ϑ = ¯ f(ϑ) GAS, Hurwitz A∗ 16 / 37
  • 66. Conclusions Thank you Lyapunov and Polyak! 0 20 40 60 80 100 ρ = 0.8 τk 5 10 0 Outcomes from M repeated runs -20 0 20 Crank Up Gain: Tame Transients Average: Tame Variance θk ϑτk N − N0 θ̃PR N Steps to a Successful Design 1 Design ¯ f for mean flow d dt ϑ = ¯ f(ϑ) GAS, Hurwitz A∗ 2 Design step-size: αn = g/(1 + n/ne)ρ 1 2 ρ 1 16 / 37
  • 67. Conclusions Thank you Lyapunov and Polyak! 0 20 40 60 80 100 ρ = 0.8 τk 5 10 0 Outcomes from M repeated runs -20 0 20 Crank Up Gain: Tame Transients Average: Tame Variance θk ϑτk N − N0 θ̃PR N Steps to a Successful Design 1 Design ¯ f for mean flow d dt ϑ = ¯ f(ϑ) GAS, Hurwitz A∗ 2 Design step-size: αn = g/(1 + n/ne)ρ 1 2 ρ 1 3 Perform PR Averaging θPR N = 1 N − N0 N X n=N0+1 θn 16 / 37
  • 68. Conclusions Thank you Lyapunov and Polyak! 0 20 40 60 80 100 ρ = 0.8 τk 5 10 0 Outcomes from M repeated runs -20 0 20 Crank Up Gain: Tame Transients Average: Tame Variance θk ϑτk N − N0 θ̃PR N Steps to a Successful Design 1 Design ¯ f for mean flow d dt ϑ = ¯ f(ϑ) GAS, Hurwitz A∗ 2 Design step-size: αn = g/(1 + n/ne)ρ 1 2 ρ 1 3 Perform PR Averaging 4 Repeat! 16 / 37
  • 69. Conclusions Thank you Lyapunov and Polyak! 0 20 40 60 80 100 ρ = 0.8 τk 5 10 0 Outcomes from M repeated runs -20 0 20 Crank Up Gain: Tame Transients Average: Tame Variance θk ϑτk N − N0 θ̃PR N Steps to a Successful Design 1 Design ¯ f for mean flow d dt ϑ = ¯ f(ϑ) GAS, Hurwitz A∗ 2 Design step-size: αn = g/(1 + n/ne)ρ 1 2 ρ 1 3 Perform PR Averaging 4 Repeat! Obtain histogram √ N − N0 θ̃PR (m) N : 1 ≤ m ≤ M with θ (m) 0 widely dispersed and N relatively small 16 / 37
  • 70. Conclusions Thank you Lyapunov and Polyak! 0 20 40 60 80 100 ρ = 0.8 τk 5 10 0 Outcomes from M repeated runs -20 0 20 Crank Up Gain: Tame Transients Average: Tame Variance θk ϑτk N − N0 θ̃PR N Steps to a Successful Design 1 Design ¯ f for mean flow d dt ϑ = ¯ f(ϑ) GAS, Hurwitz A∗ 2 Design step-size: αn = g/(1 + n/ne)ρ 1 2 ρ 1 3 Perform PR Averaging 4 Repeat! Obtain histogram √ N − N0 θ̃PR (m) N : 1 ≤ m ≤ M with θ (m) 0 widely dispersed and N relatively small What you will learn: How big N needs to be for a meaningful estimate Approximate confidence bounds Heuristic: θPR N ≈ θ∗ + 1 √ N−N0 Z, Z ∼ N(0, ΣPR ) 16 / 37
  • 71. Conclusions Thank you Lyapunov and Polyak! 0 20 40 60 80 100 ρ = 0.8 τk 5 10 0 Outcomes from M repeated runs -20 0 20 Crank Up Gain: Tame Transients Average: Tame Variance θk ϑτk N − N0 θ̃PR N Steps to a Successful Design 1 Design ¯ f for mean flow d dt ϑ = ¯ f(ϑ) GAS, Hurwitz A∗ 2 Design step-size: αn = g/(1 + n/ne)ρ 1 2 ρ 1 3 Perform PR Averaging 4 Repeat! Obtain histogram √ N − N0 θ̃PR (m) N : 1 ≤ m ≤ M with θ (m) 0 widely dispersed and N relatively small What you will learn: How big N needs to be for a meaningful estimate Approximate confidence bounds Next Steps: A flash-crash control course! Applications to RL 16 / 37
  • 72. References Control Techniques FOR Complex Networks Sean Meyn Pre-publication version for on-line viewing. Monograph available for purchase at your favorite retailer More information available at http://www.cambridge.org/us/catalogue/catalogue.asp?isbn=9780521884419 Markov Chains and Stochastic Stability S. P. Meyn and R. L. Tweedie August 2008 Pre-publication version for on-line viewing. Monograph to appear Februrary 2009 π(f ) ∞ ∆V (x) ≤ −f(x) + bIC(x) Pn (x, · ) − πf → 0 sup C E x [S τ C (f )] ∞ References 17 / 37
  • 73. References Control Background I [1] K. J. Åström and R. M. Murray. Feedback Systems: An Introduction for Scientists and Engineers. Princeton University Press, USA, 2008 (recent edition on-line). [2] K. J. Åström and B. Wittenmark. Adaptive Control. Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA, 2nd edition, 1994. [3] A. Fradkov and B. T. Polyak. Adaptive and robust control in the USSR. IFAC–PapersOnLine, 53(2):1373–1378, 2020. 21th IFAC World Congress. [4] M. Krstic, P. V. Kokotovic, and I. Kanellakopoulos. Nonlinear and adaptive control design. John Wiley Sons, Inc., 1995. [5] K. J. Åström. Theory and applications of adaptive control—a survey. Automatica, 19(5):471–486, 1983. [6] K. J. Åström. Adaptive control around 1960. IEEE Control Systems Magazine, 16(3):44–49, 1996. [7] L. Ljung. Analysis of recursive stochastic algorithms. IEEE Transactions on Automatic Control, 22(4):551–575, 1977. 18 / 37
  • 74. References Control Background II [8] N. Matni, A. Proutiere, A. Rantzer, and S. Tu. From self-tuning regulators to reinforcement learning and back again. In Proc. of the IEEE Conf. on Dec. and Control, pages 3724–3740, 2019. 19 / 37
  • 75. References RL Background I [9] S. Meyn. Control Systems and Reinforcement Learning. Cambridge University Press, Cambridge, 2021. [10] R. Sutton and A. Barto. Reinforcement Learning: An Introduction. MIT Press. On-line edition at http://www.cs.ualberta.ca/~sutton/book/the-book.html, Cambridge, MA, 2nd edition, 2018. [11] C. Szepesvári. Algorithms for Reinforcement Learning. Synthesis Lectures on Artificial Intelligence and Machine Learning. Morgan Claypool Publishers, 2010. [12] D. P. Bertsekas. Reinforcement learning and optimal control. Athena Scientific, Belmont, MA, 2019. [13] T. Lattimore and C. Szepesvari. Bandit Algorithms. Cambridge University Press, 2020. [14] R. S. Sutton.Learning to predict by the methods of temporal differences. Mach. Learn., 3(1):9–44, 1988. [15] C. J. C. H. Watkins and P. Dayan. Q-learning. Machine Learning, 8(3-4):279–292, 1992. [16] J. Tsitsiklis. Asynchronous stochastic approximation and Q-learning. Machine Learning, 16:185–202, 1994. 20 / 37
  • 76. References RL Background II [17] T. Jaakola, M. Jordan, and S. Singh. On the convergence of stochastic iterative dynamic programming algorithms. Neural Computation, 6:1185–1201, 1994. [18] B. Van Roy. Learning and Value Function Approximation in Complex Decision Processes. PhD thesis, Massachusetts Institute of Technology, Cambridge, MA, 1998. [19] J. N. Tsitsiklis and B. Van Roy. Feature-based methods for large scale dynamic programming. Mach. Learn., 22(1-3):59–94, 1996. [20] J. N. Tsitsiklis and B. Van Roy. An analysis of temporal-difference learning with function approximation. IEEE Trans. Automat. Control, 42(5):674–690, 1997. [21] J. N. Tsitsiklis and B. V. Roy. Average cost temporal-difference learning. Automatica, 35(11):1799–1808, 1999. [22] J. N. Tsitsiklis and B. Van Roy. Optimal stopping of Markov processes: Hilbert space theory, approximation algorithms, and an application to pricing high-dimensional financial derivatives. IEEE Trans. Automat. Control, 44(10):1840–1851, 1999. [23] D. Choi and B. Van Roy. A generalized Kalman filter for fixed point approximation and efficient temporal-difference learning. Discrete Event Dynamic Systems: Theory and Applications, 16(2):207–239, 2006. 21 / 37
  • 77. References RL Background III [24] S. J. Bradtke and A. G. Barto. Linear least-squares algorithms for temporal difference learning. Mach. Learn., 22(1-3):33–57, 1996. [25] J. A. Boyan. Technical update: Least-squares temporal difference learning. Mach. Learn., 49(2-3):233–246, 2002. [26] A. Nedic and D. Bertsekas. Least squares policy evaluation algorithms with linear function approximation. Discrete Event Dyn. Systems: Theory and Appl., 13(1-2):79–110, 2003. [27] C. Szepesvári. The asymptotic convergence-rate of Q-learning. In Proceedings of the 10th Internat. Conf. on Neural Info. Proc. Systems, 1064–1070. MIT Press, 1997. [28] E. Even-Dar and Y. Mansour. Learning rates for Q-learning. Journal of Machine Learning Research, 5(Dec):1–25, 2003. [29] M. G. Azar, R. Munos, M. Ghavamzadeh, and H. Kappen. Speedy Q-learning. In Advances in Neural Information Processing Systems, 2011. [30] D. Huang, W. Chen, P. Mehta, S. Meyn, and A. Surana. Feature selection for neuro-dynamic programming. In F. Lewis, editor, Reinforcement Learning and Approximate Dynamic Programming for Feedback Control. Wiley, 2011. 22 / 37
  • 78. References RL Background IV [31] A. M. Devraj and S. P. Meyn. Zap Q-learning. In Proc. of the Intl. Conference on Neural Information Processing Systems, pages 2232–2241, 2017. [32] S. Chen, A. M. Devraj, F. Lu, A. Busic, and S. Meyn. Zap Q-Learning with nonlinear function approximation. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, and arXiv e-prints 1910.05405, volume 33, pages 16879–16890, 2020. [33] S. P. Meyn. Control Techniques for Complex Networks. Cambridge University Press, 2007. See last chapter on simulation and average-cost TD learning DQN: [34] M. Riedmiller. Neural fitted Q iteration – first experiences with a data efficient neural reinforcement learning method. In J. Gama, R. Camacho, P. B. Brazdil, A. M. Jorge, and L. Torgo, editors, Machine Learning: ECML 2005, pages 317–328, Berlin, Heidelberg, 2005. Springer Berlin Heidelberg. [35] S. Lange, T. Gabel, and M. Riedmiller. Batch reinforcement learning. In Reinforcement learning, pages 45–73. Springer, 2012. [36] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. A. Riedmiller. Playing Atari with deep reinforcement learning. ArXiv, abs/1312.5602, 2013. 23 / 37
  • 79. References RL Background V [37] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. A. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis. Human-level control through deep reinforcement learning. Nature, 518:529–533, 2015. [38] O. Anschel, N. Baram, and N. Shimkin. Averaged-DQN: Variance reduction and stabilization for deep reinforcement learning. In Proc. of ICML, pages 176–185. JMLR.org, 2017. Actor Critic / Policy Gradient [39] P. J. Schweitzer. Perturbation theory and finite Markov chains. J. Appl. Prob., 5:401–403, 1968. [40] C. D. Meyer, Jr. The role of the group generalized inverse in the theory of finite Markov chains. SIAM Review, 17(3):443–464, 1975. [41] P. W. Glynn. Stochastic approximation for Monte Carlo optimization. In Proceedings of the 18th conference on Winter simulation, pages 356–365, 1986. [42] R. J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8(3-4):229–256, 1992. 24 / 37
  • 80. References RL Background VI [43] T. Jaakkola, S. P. Singh, and M. I. Jordan. Reinforcement learning algorithm for partially observable Markov decision problems. In Advances in neural information processing systems, pages 345–352, 1995. [44] X.-R. Cao and H.-F. Chen. Perturbation realization, potentials, and sensitivity analysis of Markov processes. IEEE Transactions on Automatic Control, 42(10):1382–1393, Oct 1997. [45] P. Marbach and J. N. Tsitsiklis. Simulation-based optimization of Markov reward processes. IEEE Trans. Automat. Control, 46(2):191–209, 2001. [46] V. Konda. Actor-critic algorithms. PhD thesis, Massachusetts Institute of Technology, 2002. [47] V. R. Konda and J. N. Tsitsiklis. Actor-critic algorithms. In Advances in neural information processing systems, pages 1008–1014, 2000. [48] R. S. Sutton, D. A. McAllester, S. P. Singh, and Y. Mansour. Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems, pages 1057–1063, 2000. [49] P. Marbach and J. N. Tsitsiklis. Simulation-based optimization of Markov reward processes. IEEE Trans. Automat. Control, 46(2):191–209, 2001. 25 / 37
  • 81. References RL Background VII [50] S. M. Kakade. A natural policy gradient. In Advances in neural information processing systems, pages 1531–1538, 2002. [51] H. Mania, A. Guy, and B. Recht. Simple random search provides a competitive approach to reinforcement learning. In Advances in Neural Information Processing Systems, pages 1800–1809, 2018. MDPs, LPs and Convex Q: [52] A. S. Manne. Linear programming and sequential decisions. Management Sci., 6(3):259–267, 1960. [53] C. Derman. Finite State Markovian Decision Processes, volume 67 of Mathematics in Science and Engineering. Academic Press, Inc., 1970. [54] V. S. Borkar. Convex analytic methods in Markov decision processes. In Handbook of Markov decision processes, volume 40 of Internat. Ser. Oper. Res. Management Sci., pages 347–375. Kluwer Acad. Publ., Boston, MA, 2002. [55] D. P. de Farias and B. Van Roy. The linear programming approach to approximate dynamic programming. Operations Res., 51(6):850–865, 2003. 26 / 37
  • 82. References RL Background VIII [56] D. P. de Farias and B. Van Roy. A cost-shaping linear program for average-cost approximate dynamic programming with performance guarantees. Math. Oper. Res., 31(3):597–620, 2006. [57] P. G. Mehta and S. P. Meyn. Q-learning and Pontryagin’s minimum principle. In Proc. of the IEEE Conf. on Dec. and Control, pages 3598–3605, Dec. 2009. [58] J. Bas Serrano, S. Curi, A. Krause, and G. Neu. Logistic Q-learning. In A. Banerjee and K. Fukumizu, editors, Proc. of The Intl. Conference on Artificial Intelligence and Statistics, volume 130, pages 3610–3618, 13–15 Apr 2021. [59] F. Lu, P. G. Mehta, S. P. Meyn, and G. Neu. Convex Q-learning. In American Control Conf., pages 4749–4756. IEEE, 2021. [60] F. Lu, P. G. Mehta, S. P. Meyn, and G. Neu. Convex analytic theory for convex Q-learning. In Conference on Decision and Control–to appear. IEEE, 2022. Gator Nation: [61] A. M. Devraj, A. Bušić, and S. Meyn. Fundamental design principles for reinforcement learning algorithms. In K. G. Vamvoudakis, Y. Wan, F. L. Lewis, and D. Cansever, editors, Handbook on Reinforcement Learning and Control, Studies in Systems, Decision and Control series (SSDC, volume 325). Springer, 2021. 27 / 37
  • 83. References RL Background IX [62] A. M. Devraj and S. P. Meyn. Fastest convergence for Q-learning. ArXiv , July 2017 (extended version of NIPS 2017). [63] A. M. Devraj. Reinforcement Learning Design with Optimal Learning Rate. PhD thesis, University of Florida, 2019. [64] A. M. Devraj and S. P. Meyn. Q-learning with uniformly bounded variance: Large discounting is not a barrier to fast learning. IEEE Trans Auto Control (and arXiv:2002.10301), 2021. [65] A. M. Devraj, A. Bušić, and S. Meyn. On matrix momentum stochastic approximation and applications to Q-learning. In Allerton Conference on Communication, Control, and Computing, pages 749–756, Sep 2019. 28 / 37
  • 84. References Stochastic Miscellanea I [66] S. Asmussen and P. W. Glynn. Stochastic Simulation: Algorithms and Analysis, volume 57 of Stochastic Modelling and Applied Probability. Springer-Verlag, New York, 2007. [67] P. W. Glynn and S. P. Meyn. A Liapounov bound for solutions of the Poisson equation. Ann. Probab., 24(2):916–931, 1996. [68] S. P. Meyn and R. L. Tweedie. Markov chains and stochastic stability. Cambridge University Press, Cambridge, second edition, 2009. Published in the Cambridge Mathematical Library. [69] R. Douc, E. Moulines, P. Priouret, and P. Soulier. Markov Chains. Springer, 2018. 29 / 37
  • 85. References Stochastic Approximation I [70] V. S. Borkar. Stochastic Approximation: A Dynamical Systems Viewpoint. Hindustan Book Agency and Cambridge University Press, Delhi, India Cambridge, UK, 2008. [71] A. Benveniste, M. Métivier, and P. Priouret. Adaptive algorithms and stochastic approximations, volume 22 of Applications of Mathematics (New York). Springer-Verlag, Berlin, 1990. Translated from the French by Stephen S. Wilson. [72] V. S. Borkar and S. P. Meyn. The ODE method for convergence of stochastic approximation and reinforcement learning. SIAM J. Control Optim., 38(2):447–469, 2000. [73] V. Borkar, S. Chen, A. Devraj, I. Kontoyiannis, and S. Meyn. The ODE method for asymptotic statistics in stochastic approximation and reinforcement learning. arXiv e-prints:2110.14427, pages 1–50, 2021. [74] M. Benaı̈m. Dynamics of stochastic approximation algorithms. In Séminaire de Probabilités, XXXIII, pages 1–68. Springer, Berlin, 1999. [75] V. Borkar and S. P. Meyn. Oja’s algorithm for graph clustering, Markov spectral decomposition, and risk sensitive control. Automatica, 48(10):2512–2519, 2012. [76] J. Kiefer and J. Wolfowitz. Stochastic estimation of the maximum of a regression function. Ann. Math. Statist., 23(3):462–466, 09 1952. 30 / 37
  • 86. References Stochastic Approximation II [77] J. C. Spall. Introduction to stochastic search and optimization: estimation, simulation, and control. John Wiley Sons, 2003. [78] D. Ruppert. A Newton-Raphson version of the multivariate Robbins-Monro procedure. The Annals of Statistics, 13(1):236–245, 1985. [79] D. Ruppert. Efficient estimators from a slowly convergent Robbins-Monro processes. Technical Report Tech. Rept. No. 781, Cornell University, School of Operations Research and Industrial Engineering, Ithaca, NY, 1988. [80] B. T. Polyak. A new method of stochastic approximation type. Avtomatika i telemekhanika, 98–107, 1990 (in Russian). Translated in Automat. Remote Control, 51 1991. [81] B. T. Polyak and A. B. Juditsky. Acceleration of stochastic approximation by averaging. SIAM J. Control Optim., 30(4):838–855, 1992. [82] V. R. Konda and J. N. Tsitsiklis. Convergence rate of linear two-time-scale stochastic approximation. Ann. Appl. Probab., 14(2):796–819, 2004. [83] E. Moulines and F. R. Bach. Non-asymptotic analysis of stochastic approximation algorithms for machine learning. In Advances in Neural Information Processing Systems 24, 451–459. Curran Associates, Inc., 2011. 31 / 37
  • 87. References Stochastic Approximation III [84] W. Mou, C. Junchi Li, M. J. Wainwright, P. L. Bartlett, and M. I. Jordan. On Linear Stochastic Approximation: Fine-grained Polyak-Ruppert and Non-Asymptotic Concentration. arXiv e-prints, page arXiv:2004.04719, Apr. 2020. 32 / 37
  • 88. References Optimization and ODEs I [85] W. Su, S. Boyd, and E. Candes. A differential equation for modeling nesterov’s accelerated gradient method: Theory and insights. In Advances in neural information processing systems, pages 2510–2518, 2014. [86] B. Shi, S. S. Du, W. Su, and M. I. Jordan. Acceleration via symplectic discretization of high-resolution differential equations. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems 32, pages 5744–5752. Curran Associates, Inc., 2019. [87] B. T. Polyak. Some methods of speeding up the convergence of iteration methods. USSR Computational Mathematics and Mathematical Physics, 4(5):1–17, 1964. [88] Y. Nesterov. A method of solving a convex programming problem with convergence rate O(1/k2). In Soviet Mathematics Doklady, 1983. 33 / 37
  • 89. References QSA and Extremum Seeking Control I [89] C. K. Lauand and S. Meyn. Extremely fast convergence rates for extremum seeking control with Polyak-Ruppert averaging. arXiv 2206.00814, 2022. [90] C. K. Lauand and S. Meyn. Markovian foundations for quasi stochastic approximation with applications to extremum seeking control, 2022. [91] B. Lapeybe, G. Pages, and K. Sab. Sequences with low discrepancy generalisation and application to Robbins-Monro algorithm. Statistics, 21(2):251–272, 1990. [92] S. Laruelle and G. Pagès. Stochastic approximation with averaging innovation applied to finance. Monte Carlo Methods and Applications, 18(1):1–51, 2012. [93] S. Shirodkar and S. Meyn. Quasi stochastic approximation. In Proc. of the 2011 American Control Conference (ACC), pages 2429–2435, July 2011. [94] S. Chen, A. Devraj, A. Bernstein, and S. Meyn. Revisiting the ODE method for recursive algorithms: Fast using quasi stochastic approximation. Journal of Systems Science and Complexity, 34(5):1681–1702, 2021. [95] Y. Chen, A. Bernstein, A. Devraj, and S. Meyn. Model-Free Primal-Dual Methods for Network Optimization with Application to Real-Time Optimal Power Flow. In Proc. of the American Control Conf., pages 3140–3147, Sept. 2019. 34 / 37
  • 90. References QSA and Extremum Seeking Control II [96] S. Bhatnagar and V. S. Borkar. Multiscale chaotic spsa and smoothed functional algorithms for simulation optimization. Simulation, 79(10):568–580, 2003. [97] S. Bhatnagar, M. C. Fu, S. I. Marcus, and I.-J. Wang. Two-timescale simultaneous perturbation stochastic approximation using deterministic perturbation sequences. ACM Transactions on Modeling and Computer Simulation (TOMACS), 13(2):180–209, 2003. [98] M. Le Blanc. Sur l’electrification des chemins de fer au moyen de courants alternatifs de frequence elevee [On the electrification of railways by means of alternating currents of high frequency]. Revue Generale de l’Electricite, 12(8):275–277, 1922. [99] Y. Tan, W. H. Moase, C. Manzie, D. Nešić, and I. M. Y. Mareels. Extremum seeking from 1922 to 2010. In Proceedings of the 29th Chinese Control Conference, pages 14–26, July 2010. [100] P. F. Blackman. Extremum-seeking regulators. In An Exposition of Adaptive Control. Macmillan, 1962. [101] J. Sternby. Adaptive control of extremum systems. In H. Unbehauen, editor, Methods and Applications in Adaptive Control, pages 151–160, Berlin, Heidelberg, 1980. Springer Berlin Heidelberg. 35 / 37
  • 91. References QSA and Extremum Seeking Control III [102] J. Sternby. Extremum control systems–an area for adaptive control? In Joint Automatic Control Conference, number 17, page 8, 1980. [103] K. B. Ariyur and M. Krstić. Real Time Optimization by Extremum Seeking Control. John Wiley Sons, Inc., New York, NY, USA, 2003. [104] M. Krstić and H.-H. Wang. Stability of extremum seeking feedback for general nonlinear dynamic systems. Automatica, 36(4):595 – 601, 2000. [105] S. Liu and M. Krstic. Introduction to extremum seeking. In Stochastic Averaging and Stochastic Extremum Seeking, Communications and Control Engineering. Springer, London, 2012. [106] O. Trollberg and E. W. Jacobsen. On the convergence rate of extremum seeking control. In European Control Conference (ECC), pages 2115–2120. 2014. [107] Y. Bugeaud. Linear forms in logarithms and applications, volume 28 of IRMA Lectures in Mathematics and Theoretical Physics. EMS Press, 2018. [108] G. Wüstholz, editor. A Panorama of Number Theory—Or—The View from Baker’s Garden. Cambridge University Press, 2002. 36 / 37
  • 92. References Selected Applications I [109] N. S. Raman, A. M. Devraj, P. Barooah, and S. P. Meyn. Reinforcement learning for control of building HVAC systems. In American Control Conference, July 2020. [110] K. Mason and S. Grijalva. A review of reinforcement learning for autonomous building energy management. arXiv.org, 2019. arXiv:1903.05196. 37 / 37