DeepLearn2022 2. Variance Matters

Part 2: Variance Matters
Sean Meyn
Department of Electrical and Computer Engineering University of Florida
Inria International Chair Inria, Paris
Thanks to to our sponsors: NSF and ARO

Part 2: Variance Matters
Outline
1 Stochastic Approximation
2 Understanding Variance
3 Unveiling Dynamics
4 Introduction to Zap
5 Conclusions
6 References

Part 2: Variance Matters Resources
ODE Method (using different meaning than in the 1970s)
CS&RL, Chapter 8
The ODE Method for Asymptotic Statistics in Stochastic
Approximation and Reinforcement Learning [73]
Control Techniques
FOR
Complex Networks
Sean Meyn
Pre-publication version for on-line viewing. Monograph available for purchase at your favorite retailer
More information available at http://www.cambridge.org/us/catalogue/catalogue.asp?isbn=9780521884419
Markov Chains
and
Stochastic Stability
S. P. Meyn and R. L. Tweedie
August 2008 Pre-publication version for on-line viewing. Monograph to appear Februrary 2009
π(f
)
<
∞
∆V (x) ≤ −f(x) + bIC(x)
Pn
(x, · ) − πf → 0
sup
C
E
x
[S
τ
C
(f
)]

∞
1 / 37

Special Thanks
My interests in RLSA began during my first
sabbatical—at IISc with Vivek Borkar
Many other heroes along the way since
Metivier
Van Roy Tsitsiklis
Bertsekas
Konda
Priouret
Kushner
Surana
Huang
Ruppert Polyak
Szepesvari
Benveniste
Yin
Nedic Yu
Colombino
Dall’Anese
Bernstein
Chen Chen
Barooah
Raman
Max
P.E. Caines
Ioannis Kontoyiannis Ana Busic Eric Moulines Adithya Devraj Many Others!

-100 0 100 -100 0 100 -100 0 100
Zap 2 Zap 5 Zap 10
θn+1 = θn + an+1f(θn, ξn+1)
θn
Excitation
Estimates
Stochastic Approximation

Stochastic Approximation ODE Method
What is Stochastic Approximation? ¯
f(θ) = E[f(θ, ξ)]
No different than last lecture: find solution to ¯
f(θ∗
) = 0
Example: ¯
f(θ) = −∇Γ (θ) for optimization.
In RL the function ¯
f is typically not of this form.
3 / 37

f(θ) = E[f(θ, ξ)]
f(θ∗
) = 0
ODE algorithm:
d
dt
ϑt = ¯
f(ϑt)
If stable: ϑt → θ∗
and ¯
f(ϑt) → ¯
f(θ∗
) = 0.
3 / 37

f(θ) = E[f(θ, ξ)]
f(θ∗
) = 0
ODE algorithm:
d
dt
ϑt = ¯
f(ϑt)
and ¯
f(ϑt) → ¯
f(θ∗
) = 0.
Euler approximation: θn+1 = θn + αn+1
¯
f(θn)
Terminology: αn usually called the step-size
3 / 37

f(θ) = E[f(θ, ξ)]
f(θ∗
) = 0
ODE algorithm:
d
dt
ϑt = ¯
f(ϑt)
and ¯
f(ϑt) → ¯
f(θ∗
) = 0.
¯
f(θn)
θn+1 = θn + αn+1f(θn, ξn+1) with ξn → ξ in dist.
3 / 37

f(θ) = E[f(θ, ξ)]
f(θ∗
) = 0
ODE algorithm:
d
dt
ϑt = ¯
f(ϑt)
and ¯
f(ϑt) → ¯
f(θ∗
) = 0.
¯
f(θn)
θn+1 = θn + αn+1f(θn, ξn+1)
= θn + αn+1

¯
f(θn) + “NOISE”

Under very general conditions:
the ODE, the Euler approximation, and SA are all convergent to θ∗
3 / 37

f(θ) = E[f(θ, ξ)]
f(θ∗
) = 0
ODE algorithm:
d
dt
ϑt = ¯
f(ϑt)
and ¯
f(ϑt) → ¯
f(θ∗
) = 0.
¯
f(θn)
= θn + αn+1

¯

Under very general conditions:
the ODE, the Euler approximation, and SA are all convergent to θ∗
[Robbins and Monro, 1951] see Borkar’s monograph [70]
3 / 37

Questions for Algorithm Design ¯
f(θ) = E[f(θ, ξ)]
= θn + αn+1

¯
f(θn) + e
Ξn

(NOISE)
Questions
Can we duplicate our success with QSA?
Is there a Perturbative Mean Flow?
Implications to transient and asymptotic behavior?
What are the implications to design?
4 / 37

Questions for Algorithm Design ¯
f(θ) = E[f(θ, ξ)]
= θn + αn+1

¯
f(θn) + e
Ξn

Questions
Can we duplicate our success with QSA?
Is there a Perturbative Mean Flow?
Implications to transient and asymptotic behavior?
What are the implications to design?
Go beyond the last lecture: What if the mean flow
d
dt
ϑ = ¯
f(ϑ) is not stable, or “ill conditioned”?
4 / 37

Control Techniques for Complex Networks page 528
Example 11.5.1. LSTD for the M/M/1 queue (minimum variance algorithm)
Nonsense after One Million Samples
0
10
20
30
0
5
10
15
ρ = 0.8 ρ = 0.9 hθ
(x) = θ1
θ1
x + θ2x2
θ2
θ∗
1 = θ∗
2
0 1 2 3 4 5 x 10
6 0 1 2 3 4 5 x 10
6
Understanding Variance

Understanding Variance Disturbance decompositions
Perturbative Mean Flow θn+1 = θn + αn+1f(θn, Wn+1)
Prerequisites as before
Mean flow d
dt ϑ = ¯
f(ϑ) globally asymptotically stable
So ϑt → θ∗ from each initial condition
A tiny bit more is needed to ensure {θn} is bounded – see CSRL for details!
A∗ = ∂θ
¯
f (θ∗) Hurwitz eigenvalues in left half plane of C
5 / 37

Perturbative Mean Flow θn+1 = θn + αn+1f(θn, Wn+1)
Prerequisites as before
Mean flow d
dt ϑ = ¯
f(ϑ) globally asymptotically stable
So ϑt → θ∗ from each initial condition
A tiny bit more is needed to ensure {θn} is bounded – see CSRL for details!
A∗ = ∂θ
¯
f (θ∗) Hurwitz eigenvalues in left half plane of C
And a replacement for our almost periodic exploration:
ξn+1 = G0(Φn+1), with Φ a “nice Markov chain”
Assume here: finite state space and irreducible
Please don’t rule out periodicity! Remember our luck with QSA!
5 / 37

P- Mean Flow θn+1 = θn + αn+1f(θn, Wn+1) = θn + αn+1

¯
f(θn) + e
Ξn

Perturbative Mean Flow (details for fixed step-size)
Two distinctions: 1. We remain in discrete time, and 2. We obtain an
expression for the sum of the NOISE:
N−1
X
n=0
e
Ξn =
Foundation of all is Poisson’s equation:

b
fn(θ) − E[ b
fn+1(θ) | Fn]

DeepLearn2022 2. Variance Matters


¯
f(θn) + e
Ξn

N−1
X
n=0
e
Ξn =
N+1
X
n=2
Wn
{Wn} is white—beautiful theory waiting for us

b
fn(θ) − E[ b
fn+1(θ) | Fn]


¯
f(θn) + e
Ξn

N−1
X
n=0
e
Ξn =
N+1
X
n=2
Wn + ∆ b
fN
Uniformly bounded—adds O(1/N) in convergence rate.

b
fn(θ) − E[ b
fn+1(θ) | Fn]


¯
f(θn) + e
Ξn

N−1
X
n=0
e
Ξn =
N+1
X
n=2
Wn + ∆ b
fN
− α
N
X
n=1
Υn
Uniformly bounded—adds O(1/N) in convergence rate.
Bad news for bias: Υn = − 1
α

b
fn+1(θn) − b
fn+1(θn−1)


b
fn(θ) − E[ b
fn+1(θ) | Fn]

Understanding Variance P- Mean Flow =
⇒ Polyak-Ruppert Averaging
Design Implications θn+1 − θn = α

¯
f(θn) + e
Ξn

Constant Step-size
N−1
X
n=0
e
Ξn =
N+1
X
n=2
Wn + ∆ b
fN − α
N
X
n=1
Υn
E[kθnk2] uniformly bounded for 0 α ≤ α0 some α0
0
7 / 37


¯
f(θn) + e
Ξn

Constant Step-size
N−1
X
n=0
e
Ξn =
N+1
X
n=2
Wn + ∆ b
fN − α
N
X
n=1
Υn
0
Justifies θn = Xn + O(α2) (approximation in mean-square)
Xn+1 − Xn = α[A∗
[Xn − θ∗
] + e
Ξn]
7 / 37


¯
f(θn) + e
Ξn

Constant Step-size
N−1
X
n=0
e
Ξn =
N+1
X
n=2
Wn + ∆ b
fN − α
N
X
n=1
Υn
0
N−1
X
n=0
Xn+1 − Xn

=
N−1
X
n=0
α[A∗
[Xn − θ∗
] + e
Ξn]

7 / 37


¯
f(θn) + e
Ξn

Constant Step-size
N−1
X
n=0
e
Ξn =
N+1
X
n=2
Wn + ∆ b
fN − α
N
X
n=1
Υn
0
N−1
X
n=0
Xn+1 − Xn

=
N−1
X
n=0
α[A∗
[Xn − θ∗
] + e
Ξn]

See a Polyak-Ruppert filter pop out?
7 / 37


¯
f(θn) + e
Ξn

Constant Step-size
N−1
X
n=0
e
Ξn =
N+1
X
n=2
Wn + ∆ b
fN − α
N
X
n=1
Υn
1
N
N−1
X
n=0
θn+1 − θn

= α
1
N
N−1
X
n=0
A∗
[θn − θ∗
] + e
Ξn + O(α2
)

8 / 37


¯
f(θn) + e
Ξn

Constant Step-size
N−1
X
n=0
e
Ξn =
N+1
X
n=2
Wn + ∆ b
fN − α
N
X
n=1
Υn
1
N
N−1
X
n=0
θn+1 − θn

= α
1
N
N−1
X
n=0
A∗
[θn − θ∗
] + e
Ξn + O(α2
)

Polyak-Ruppert Representation θPR
N =
1
N
N−1
X
n=0
θn
1
α
1
N
∆θN = A∗
[θPR
N − θ∗
] +
1
N
N−1
X
n=0
e
Ξn + O(α2
)
8 / 37


¯
f(θn) + e
Ξn

Constant Step-size
N =
1
N
N−1
X
n=0
θn
1
α
1
N
∆θN = A∗
[θPR
N − θ∗
] +
1
N
N−1
X
n=0
e
Ξn + O(α2
)
P-R representation = the statistical optimal + statistical annoying:
θPR
N = θ∗
+ − + O((αN)−1
+ α2
)
= [A∗
]−1 1
N
N
X
n=1
Wn = [A∗
]−1 α
N
N
X
n=1
Υn
8 / 37


¯
f(θn) + e
Ξn

Constant Step-size
N =
1
N
N−1
X
n=0
θn
1
α
1
N
∆θN = A∗
[θPR
N − θ∗
] +
1
N
N−1
X
n=0
e
Ξn + O(α2
)
P-R representation = the statistical optimal + statistical annoying:
θPR
N = θ∗
+ − + O((αN)−1
+ α2
)
= [A∗
]−1 1
N
N
X
n=1
Wn = [A∗
]−1 α
N
N
X
n=1
Υn
Annoyance Υn: introduces bias of order O(α) (and variance not understood)
Υn = − 1
α

b
fn+1(θn) − b
fn+1(θn−1)

≈ −∂θ
b
fn+1(θn) · f(θn−1, Wn)
8 / 37

Design Implications θn+1 − θn = αn+1

¯
f(θn) + e
Ξn

Vanishing Step-size
Slightly different starting point:
N−1
X
n=0
1
αn+1
[θn+1 − θn] =
N−1
X
n=0

¯
f(θn) + e
Ξn

=
N−1
X
n=0

A∗
[θn − θ∗
] + e
Ξn

+ O
N−1
X
n=0
α2
n+1
| {z }
Assumed bounded

9 / 37

Design Implications θn+1 − θn = αn+1

¯
f(θn) + e
Ξn

Vanishing Step-size
Statistical memory =⇒ try αn = g/(1 + n/ne)ρ 1
2 ρ 1
Ignoring transients (and ignoring summation by parts calculation),
θPR
N =
1
N − N0
N−1
X
n=N0
θn
= θ∗
+ + O(1/N)
=
1
N − N0
[A∗
]−1
N
X
n=N0+1
Wn
Optimality of PR-Averaging
Cov(θPR
N − θ∗
) =
1
N − N0
ΣPR
N
ΣPR
N → ΣPR
= [A∗]−1ΣW[A∗T
]−1 minimal in strongest sense
9 / 37

Asymptotic Statistics
Optimality of PR-Averaging
Cov(θPR
N − θ∗
) =
1
N − N0
ΣPR
N
ΣPR
N → ΣPR
= [A∗]−1ΣW[A∗T
]−1 minimal in strongest sense
The Central Limit Theorem holds tremendous tool for validation, ...
√
Nθ̃N
Histogram from 103
independent runs
10 / 37

αn = g/(1 + n/ne)ρ
0 5 10 5 10 5 10
0
20
40
60
80
100
0
0
20
40
60
80
100
0
0
20
40
60
80
100
ρ = 1.0 ρ = 0.9 ρ = 0.8
= 2.8 = 5.3 = 11
θk
ϑτk
τk τk τk
τN τN τN
Unveiling Dynamics

Unveiling Dynamics Two Sources of Error
SA Error θn+1 = θn + αn+1

¯
d
dt ϑt = ¯
f(ϑt)
1 Asymptotic Covariance: αn = g
(1+n/ne)ρ
1
αn
[θn − ϑτn ] ≈ N(0, Σ) without averaging
11 / 37


¯
d
dt ϑt = ¯
f(ϑt)
1 1
αn
[θn − ϑτn ] ≈ N(0, Σ) without averaging where τn =
n
X
k=1
αk
11 / 37


¯
d
dt ϑt = ¯
f(ϑt)
1 1
αn
n
X
k=1
αk
2 ϑt → θ∗ exponentially fast, but
11 / 37


¯
d
dt ϑt = ¯
f(ϑt)
1 1
αn
n
X
k=1
αk
2 ϑt → θ∗ exponentially fast, but τn is increasing slowly,
and
11 / 37


¯
d
dt ϑt = ¯
f(ϑt)
1 1
αn
n
X
k=1
αk
and nonlinear dynamics can complicate gain selection
11 / 37


¯
d
dt ϑt = ¯
f(ϑt)
1 1
αn
n
X
k=1
αk
and nonlinear dynamics can complicate gain selection
What can happen in applications, using αn+1 = g/(1 + n/ne)ρ:
θn far from θ∗, the dynamics are slow, need large g
θn ≈ θ∗, best gain is far smaller
11 / 37

Two Sources of Error. Example: SGD αn = g/(1 + n/ne)ρ
Stochastic Gradient Descent:
L̄(θ) = E[L(θ, Φn)]
¯
f(θ) = −∇L̄(θ)
θ
-20 -15 -10 -5 0 5 10 15 20
-25
-20
-15
-10
-5
0
5
10
15
20
25
S
l
o
p
e
-
4
f(θ)
Slope
-1
L̄(θ)
θn+1 = θn −
g
nρ
∇L(θn, Φn+1)
12 / 37

ODE bound using ρ = 1 and setting ne = 1
|ϑτn − θ∗
| ≤ |ϑ0 − θ∗
|e0.34g
n−0.34g
¯
f(θ) = −∇L̄(θ)
θ
-20 -15 -10 -5 0 5 10 15 20
-25
-20
-15
-10
-5
0
5
10
15
20
25
S
l
o
p
e
-
4
f(θ) L̄(θ)
Slope
-1
Slope -0.34
θn+1 = θn −
g
nρ
∇L(θn, Φn+1)
d
dt ϑt = ¯
f(ϑt)
|ϑt − θ∗| ≤ |ϑ0 − θ∗|e−0.34t
12 / 37

|ϑτn − θ∗
| ≤ |ϑ0 − θ∗
|e0.34g
n−0.34g
g ≥ 3 to kill deterministic behavior,
but g∗ = −[A∗]−1 = 1/4 gives optimal variance
¯
f(θ) = −∇L̄(θ)
θ
-20 -15 -10 -5 0 5 10 15 20
-25
-20
-15
-10
-5
0
5
10
15
20
25
S
l
o
p
e
-
4
f(θ) L̄(θ)
Slope
-1
Slope -0.34
θn+1 = θn −
g
nρ
∇L(θn, Φn+1)
d
dt ϑt = ¯
f(ϑt)
|ϑt − θ∗| ≤ |ϑ0 − θ∗|e−0.34t
12 / 37

|ϑτn − θ∗
| ≤ |ϑ0 − θ∗
|e0.34g
n−0.34g
but g∗ = 1/4 gives optimal variance
Dynamics for g∗ = 1/4
0 5 10 5 10 5 10
0
20
40
60
80
100
0
0
20
40
60
80
100
0
0
20
40
60
80
100
ρ = 1.0 ρ = 0.9 ρ = 0.8
= 2.8 = 5.3 = 11
θk
ϑτk
τk τk τk
τN τN τN
τN 3 for N = one million
¯
f(θ) = −∇L̄(θ)
θ
-20 -15 -10 -5 0 5 10 15 20
-25
-20
-15
-10
-5
0
5
10
15
20
25
S
l
o
p
e
-
4
f(θ) L̄(θ)
Slope
-1
Slope -0.34
θn+1 = θn −
g
nρ
∇L(θn, Φn+1)
d
dt ϑt = ¯
f(ϑt)
|ϑt − θ∗| ≤ |ϑ0 − θ∗|e−0.34t
12 / 37

|ϑτn − θ∗
| ≤ |ϑ0 − θ∗
|e0.34g
n−0.34g
0 5 10 5 10 5 10
0
20
40
60
80
100
0
0
20
40
60
80
100
0
0
20
40
60
80
100
ρ = 1.0 ρ = 0.9 ρ = 0.8
= 2.8 = 5.3 = 11
θk
ϑτk
τk τk τk
τN τN τN
¯
f(θ) = −∇L̄(θ)
θ
-20 -15 -10 -5 0 5 10 15 20
-25
-20
-15
-10
-5
0
5
10
15
20
25
S
l
o
p
e
-
4
f(θ) L̄(θ)
Slope
-1
Slope -0.34
θn+1 = θn −
g
nρ
∇L(θn, Φn+1)
d
dt ϑt = ¯
f(ϑt)
|ϑt − θ∗| ≤ |ϑ0 − θ∗|e−0.34t
CLT approximation: rapid for θ0 = 0
12 / 37

|ϑτn − θ∗
| ≤ |ϑ0 − θ∗
|e0.34g
n−0.34g
0 5 10 5 10 5 10
0
20
40
60
80
100
0
0
20
40
60
80
100
0
0
20
40
60
80
100
ρ = 1.0 ρ = 0.9 ρ = 0.8
= 2.8 = 5.3 = 11
θk
ϑτk
τk τk τk
τN τN τN
¯
f(θ) = −∇L̄(θ)
θ
-20 -15 -10 -5 0 5 10 15 20
-25
-20
-15
-10
-5
0
5
10
15
20
25
S
l
o
p
e
-
4
f(θ) L̄(θ)
Slope
-1
Slope -0.34
θn+1 = θn −
g
nρ
∇L(θn, Φn+1)
d
dt ϑt = ¯
f(ϑt)
|ϑt − θ∗| ≤ |ϑ0 − θ∗|e−0.34t
CLT approximation: rapid for θ0 = 0 slow for θ0 = 100
12 / 37

|ϑτn − θ∗
| ≤ |ϑ0 − θ∗
|e0.34g
n−0.34g
Polyak-Ruppert to the rescue
0
20
40
60
√
Nθ̃N σθ =
5
2
ρ = 1.0 ρ = 0.9 ρ = 0.8
-20 0 20 -20 0 20 -20 0 20
0
20
40
60
√
Nθ̃N
g =
1
4
g =
1
0.34
Histograms from Polyak-Ruppert averaging: big and small g
¯
f(θ) = −∇L̄(θ)
θ
-20 -15 -10 -5 0 5 10 15 20
-25
-20
-15
-10
-5
0
5
10
15
20
25
S
l
o
p
e
-
4
f(θ) L̄(θ)
Slope
-1
Slope -0.34
θn+1 = θn −
g
nρ
∇L(θn, Φn+1)
d
dt ϑt = ¯
f(ϑt)
|ϑt − θ∗| ≤ |ϑ0 − θ∗|e−0.34t
12 / 37

Two Sources of Error. Example: Tabular Q-Learning
g ≥ 1/(1 − γ) required, if using ρ = 1*
1 2 3 4 5 6 7 8 9 10
One million samples 10
5
0
20
40
60
80
100
120
MaxBEQ
n
n
Q-learning g = gAD
αn =
g
n(x, u)
Awesome performance! (?)
Generic tabular Q-learning example. Discount factor γ
13 / 37

Two Sources of Error. Example: Tabular Q-Learning
g ≥ 1/(1 − γ) required, if using ρ = 1* Stay tuned!
1 2 3 4 5 6 7 8 9 10
One million samples 10
5
0
20
40
60
80
100
120
MaxBEQ
n
n
Q-learning g = 1/(1 − γ)
Q-learning g = gAD
αn =
g
n(x, u)
Generic tabular Q-learning example. Discount factor γ
*See Devraj M 2017 (extension in Wainwright 2019); see also Szepesvári 1997
13 / 37

Introduction to Zap Newton-Raphson flow
Taming Nonlinear Dynamics
What if stability of d
dt ϑ = ¯
f(ϑ) is unknown? [typically the case in RL]
Newton Raphson Flow [Smale 1976]
Idea: Interpret ¯
f as the “parameter”: d
dt
¯
ft = V( ¯
ft) and design V
14 / 37

dt ϑ = ¯
Idea: Interpret ¯
dt
¯
ft = − ¯
ft
d
dt
¯
f(ϑt) = − ¯
f(ϑt) giving ¯
f(ϑt) = ¯
f(ϑ0)e−t
Linear dynamics
14 / 37

dt ϑ = ¯
Idea: Interpret ¯
dt
¯
ft = − ¯
ft
d
dt
¯
f(ϑt) = − ¯
f(ϑt)
d
dt
ϑt = −[A(ϑt)]−1 ¯
f(ϑt)
SA translation: Zap Stochastic Approximation
14 / 37

Zap Algorithm Designed to emulate Newton-Raphson flow
d
dt
ϑt = −[A(ϑt)]−1 ¯
f(ϑt), A(θ) = ∂
∂θ
¯
f(θ)
Zap-SA (designed to emulate deterministic Newton-Raphson)
θn+1 = θn + αn+1[− b
An+1]−1
f(θn, ξn+1)
b
An+1 = b
An + βn+1(An+1 − b
An), An+1 = ∂θf(θn, ξn+1)
15 / 37

d
dt
ϑt = −[A(ϑt)]−1 ¯
f(ϑt), A(θ) = ∂
∂θ
¯
f(θ)
θn+1 = θn + αn+1[− b
An+1]−1
f(θn, ξn+1)
b
An+1 = b
An + βn+1(An+1 − b
Requires b
An+1 ≈ A(θn)
def
= ∂θ
¯
f (θn)
15 / 37

d
dt
ϑt = −[A(ϑt)]−1 ¯
f(ϑt), A(θ) = ∂
∂θ
¯
f(θ)
θn+1 = θn + αn+1[− b
An+1]−1
f(θn, ξn+1)
b
An+1 = b
An + βn+1(An+1 − b
b
An+1 ≈ A(θn) requires high-gain,
βn
αn
→ ∞, n → ∞
15 / 37

d
dt
ϑt = −[A(ϑt)]−1 ¯
f(ϑt), A(θ) = ∂
∂θ
¯
f(θ)
θn+1 = θn + αn+1[− b
An+1]−1
f(θn, ξn+1)
b
An+1 = b
An + βn+1(An+1 − b
b
βn
αn
→ ∞, n → ∞
Can use αn = 1/n (without averaging).
Numerics to come: use this choice, and βn = (1/n)ρ, ρ ∈ (0.5, 1)
15 / 37

d
dt
ϑt = −[A(ϑt)]−1 ¯
f(ϑt), A(θ) = ∂
∂θ
¯
f(θ)
θn+1 = θn + αn+1[− b
An+1]−1
f(θn, ξn+1)
b
An+1 = b
An + βn+1(An+1 − b
b
βn
αn
→ ∞, n → ∞
Can use αn = 1/n (without averaging).
Numerics to come: use this choice, and βn = (1/n)ρ, ρ ∈ (0.5, 1)
Stability? Virtually universal
Optimal variance, too!
Based on ancient theory from Ruppert Polyak [80, 81, 79]
15 / 37

Conclusions
Thank you Lyapunov and Polyak!
0
20
40
60
80
100
ρ = 0.8
τk
5 10
0
Outcomes
from
M
repeated
runs
-20 0 20
Crank Up Gain: Tame Transients
Average: Tame Variance
θk
ϑτk
N − N0 θ̃PR
N
Steps to a Successful Design
1 Design ¯
f for mean flow d
dt ϑ = ¯
f(ϑ) GAS, Hurwitz A∗
16 / 37

Conclusions
0
20
40
60
80
100
ρ = 0.8
τk
5 10
0
Outcomes
from
M
repeated
runs
-20 0 20
θk
ϑτk
N − N0 θ̃PR
N
1 Design ¯
f for mean flow d
dt ϑ = ¯
2 Design step-size: αn = g/(1 + n/ne)ρ 1
2 ρ 1
16 / 37

Conclusions
0
20
40
60
80
100
ρ = 0.8
τk
5 10
0
Outcomes
from
M
repeated
runs
-20 0 20
θk
ϑτk
N − N0 θ̃PR
N
1 Design ¯
f for mean flow d
dt ϑ = ¯
2 ρ 1
3 Perform PR Averaging
θPR
N =
1
N − N0
N
X
n=N0+1
θn
16 / 37

Conclusions
0
20
40
60
80
100
ρ = 0.8
τk
5 10
0
Outcomes
from
M
repeated
runs
-20 0 20
θk
ϑτk
N − N0 θ̃PR
N
1 Design ¯
f for mean flow d
dt ϑ = ¯
2 ρ 1
4 Repeat!
16 / 37

Conclusions
0
20
40
60
80
100
ρ = 0.8
τk
5 10
0
Outcomes
from
M
repeated
runs
-20 0 20
θk
ϑτk
N − N0 θ̃PR
N
1 Design ¯
f for mean flow d
dt ϑ = ¯
2 ρ 1
4 Repeat! Obtain histogram
√
N − N0 θ̃PR (m)
N : 1 ≤ m ≤ M

with θ
(m)
0 widely dispersed and N relatively small
16 / 37

Conclusions
0
20
40
60
80
100
ρ = 0.8
τk
5 10
0
Outcomes
from
M
repeated
runs
-20 0 20
θk
ϑτk
N − N0 θ̃PR
N
1 Design ¯
f for mean flow d
dt ϑ = ¯
2 ρ 1
√
N − N0 θ̃PR (m)
N : 1 ≤ m ≤ M

with θ
(m)
What you will learn:
How big N needs to be for a meaningful estimate
Approximate confidence bounds
Heuristic: θPR
N ≈ θ∗ + 1
√
N−N0
Z, Z ∼ N(0, ΣPR
)
16 / 37

Conclusions
0
20
40
60
80
100
ρ = 0.8
τk
5 10
0
Outcomes
from
M
repeated
runs
-20 0 20
θk
ϑτk
N − N0 θ̃PR
N
1 Design ¯
f for mean flow d
dt ϑ = ¯
2 ρ 1
√
N − N0 θ̃PR (m)
N : 1 ≤ m ≤ M

with θ
(m)
What you will learn:
How big N needs to be for a meaningful estimate
Approximate confidence bounds
Next Steps:
A flash-crash control course!
Applications to RL
16 / 37

References
Control Techniques
FOR
Complex Networks
Sean Meyn
Pre-publication version for on-line viewing. Monograph available for purchase at your favorite retailer
More information available at http://www.cambridge.org/us/catalogue/catalogue.asp?isbn=9780521884419
Markov Chains
and
Stochastic Stability
S. P. Meyn and R. L. Tweedie
August 2008 Pre-publication version for on-line viewing. Monograph to appear Februrary 2009
π(f
)

∞
∆V (x) ≤ −f(x) + bIC(x)
Pn
(x, · ) − πf → 0
sup
C
E
x
[S
τ
C
(f
)]

∞
References
17 / 37

References
Control Background I
[1] K. J. Åström and R. M. Murray. Feedback Systems: An Introduction for Scientists and
Engineers. Princeton University Press, USA, 2008 (recent edition on-line).
[2] K. J. Åström and B. Wittenmark. Adaptive Control. Addison-Wesley Longman Publishing
Co., Inc., Boston, MA, USA, 2nd edition, 1994.
[3] A. Fradkov and B. T. Polyak. Adaptive and robust control in the USSR.
IFAC–PapersOnLine, 53(2):1373–1378, 2020. 21th IFAC World Congress.
[4] M. Krstic, P. V. Kokotovic, and I. Kanellakopoulos. Nonlinear and adaptive control
design. John Wiley Sons, Inc., 1995.
[5] K. J. Åström. Theory and applications of adaptive control—a survey. Automatica,
19(5):471–486, 1983.
[6] K. J. Åström. Adaptive control around 1960. IEEE Control Systems Magazine,
16(3):44–49, 1996.
[7] L. Ljung. Analysis of recursive stochastic algorithms. IEEE Transactions on Automatic
Control, 22(4):551–575, 1977.
18 / 37

References
Control Background II
[8] N. Matni, A. Proutiere, A. Rantzer, and S. Tu. From self-tuning regulators to
reinforcement learning and back again. In Proc. of the IEEE Conf. on Dec. and Control,
pages 3724–3740, 2019.
19 / 37

References
RL Background I
[9] S. Meyn. Control Systems and Reinforcement Learning. Cambridge University Press,
Cambridge, 2021.
[10] R. Sutton and A. Barto. Reinforcement Learning: An Introduction. MIT Press. On-line
edition at http://www.cs.ualberta.ca/~sutton/book/the-book.html, Cambridge,
MA, 2nd edition, 2018.
[11] C. Szepesvári. Algorithms for Reinforcement Learning. Synthesis Lectures on Artificial
Intelligence and Machine Learning. Morgan Claypool Publishers, 2010.
[12] D. P. Bertsekas. Reinforcement learning and optimal control. Athena Scientific, Belmont,
MA, 2019.
[13] T. Lattimore and C. Szepesvari. Bandit Algorithms. Cambridge University Press, 2020.
[14] R. S. Sutton.Learning to predict by the methods of temporal differences. Mach. Learn.,
3(1):9–44, 1988.
[15] C. J. C. H. Watkins and P. Dayan. Q-learning. Machine Learning, 8(3-4):279–292, 1992.
[16] J. Tsitsiklis. Asynchronous stochastic approximation and Q-learning. Machine Learning,
16:185–202, 1994.
20 / 37

References
RL Background II
[17] T. Jaakola, M. Jordan, and S. Singh. On the convergence of stochastic iterative dynamic
programming algorithms. Neural Computation, 6:1185–1201, 1994.
[18] B. Van Roy. Learning and Value Function Approximation in Complex Decision Processes.
PhD thesis, Massachusetts Institute of Technology, Cambridge, MA, 1998.
[19] J. N. Tsitsiklis and B. Van Roy. Feature-based methods for large scale dynamic
programming. Mach. Learn., 22(1-3):59–94, 1996.
[20] J. N. Tsitsiklis and B. Van Roy. An analysis of temporal-difference learning with function
approximation. IEEE Trans. Automat. Control, 42(5):674–690, 1997.
[21] J. N. Tsitsiklis and B. V. Roy. Average cost temporal-difference learning. Automatica,
35(11):1799–1808, 1999.
[22] J. N. Tsitsiklis and B. Van Roy. Optimal stopping of Markov processes: Hilbert space
theory, approximation algorithms, and an application to pricing high-dimensional financial
derivatives. IEEE Trans. Automat. Control, 44(10):1840–1851, 1999.
[23] D. Choi and B. Van Roy. A generalized Kalman filter for fixed point approximation and
efficient temporal-difference learning. Discrete Event Dynamic Systems: Theory and
Applications, 16(2):207–239, 2006.
21 / 37

References
RL Background III
[24] S. J. Bradtke and A. G. Barto. Linear least-squares algorithms for temporal difference
learning. Mach. Learn., 22(1-3):33–57, 1996.
[25] J. A. Boyan. Technical update: Least-squares temporal difference learning. Mach. Learn.,
49(2-3):233–246, 2002.
[26] A. Nedic and D. Bertsekas. Least squares policy evaluation algorithms with linear function
approximation. Discrete Event Dyn. Systems: Theory and Appl., 13(1-2):79–110, 2003.
[27] C. Szepesvári. The asymptotic convergence-rate of Q-learning. In Proceedings of the 10th
Internat. Conf. on Neural Info. Proc. Systems, 1064–1070. MIT Press, 1997.
[28] E. Even-Dar and Y. Mansour. Learning rates for Q-learning. Journal of Machine Learning
Research, 5(Dec):1–25, 2003.
[29] M. G. Azar, R. Munos, M. Ghavamzadeh, and H. Kappen. Speedy Q-learning. In
Advances in Neural Information Processing Systems, 2011.
[30] D. Huang, W. Chen, P. Mehta, S. Meyn, and A. Surana. Feature selection for
neuro-dynamic programming. In F. Lewis, editor, Reinforcement Learning and
Approximate Dynamic Programming for Feedback Control. Wiley, 2011.
22 / 37

References
RL Background IV
[31] A. M. Devraj and S. P. Meyn. Zap Q-learning. In Proc. of the Intl. Conference on Neural
Information Processing Systems, pages 2232–2241, 2017.
[32] S. Chen, A. M. Devraj, F. Lu, A. Busic, and S. Meyn. Zap Q-Learning with nonlinear
function approximation. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and
H. Lin, editors, Advances in Neural Information Processing Systems, and arXiv e-prints
1910.05405, volume 33, pages 16879–16890, 2020.
[33] S. P. Meyn. Control Techniques for Complex Networks. Cambridge University Press, 2007.
See last chapter on simulation and average-cost TD learning
DQN:
[34] M. Riedmiller. Neural fitted Q iteration – first experiences with a data efficient neural
reinforcement learning method. In J. Gama, R. Camacho, P. B. Brazdil, A. M. Jorge, and
L. Torgo, editors, Machine Learning: ECML 2005, pages 317–328, Berlin, Heidelberg,
2005. Springer Berlin Heidelberg.
[35] S. Lange, T. Gabel, and M. Riedmiller. Batch reinforcement learning. In Reinforcement
learning, pages 45–73. Springer, 2012.
[36] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. A.
Riedmiller. Playing Atari with deep reinforcement learning. ArXiv, abs/1312.5602, 2013.
23 / 37

References
RL Background V
[37] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves,
M. A. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik,
I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis. Human-level
control through deep reinforcement learning. Nature, 518:529–533, 2015.
[38] O. Anschel, N. Baram, and N. Shimkin. Averaged-DQN: Variance reduction and
stabilization for deep reinforcement learning. In Proc. of ICML, pages 176–185.
JMLR.org, 2017.
Actor Critic / Policy Gradient
[39] P. J. Schweitzer. Perturbation theory and finite Markov chains. J. Appl. Prob., 5:401–403,
1968.
[40] C. D. Meyer, Jr. The role of the group generalized inverse in the theory of finite Markov
chains. SIAM Review, 17(3):443–464, 1975.
[41] P. W. Glynn. Stochastic approximation for Monte Carlo optimization. In Proceedings of
the 18th conference on Winter simulation, pages 356–365, 1986.
[42] R. J. Williams. Simple statistical gradient-following algorithms for connectionist
reinforcement learning. Machine learning, 8(3-4):229–256, 1992.
24 / 37

References
RL Background VI
[43] T. Jaakkola, S. P. Singh, and M. I. Jordan. Reinforcement learning algorithm for partially
observable Markov decision problems. In Advances in neural information processing
systems, pages 345–352, 1995.
[44] X.-R. Cao and H.-F. Chen. Perturbation realization, potentials, and sensitivity analysis of
Markov processes. IEEE Transactions on Automatic Control, 42(10):1382–1393, Oct
1997.
[45] P. Marbach and J. N. Tsitsiklis. Simulation-based optimization of Markov reward
processes. IEEE Trans. Automat. Control, 46(2):191–209, 2001.
[46] V. Konda. Actor-critic algorithms. PhD thesis, Massachusetts Institute of Technology,
2002.
[47] V. R. Konda and J. N. Tsitsiklis. Actor-critic algorithms. In Advances in neural
information processing systems, pages 1008–1014, 2000.
[48] R. S. Sutton, D. A. McAllester, S. P. Singh, and Y. Mansour. Policy gradient methods for
reinforcement learning with function approximation. In Advances in neural information
processing systems, pages 1057–1063, 2000.
[49] P. Marbach and J. N. Tsitsiklis. Simulation-based optimization of Markov reward
processes. IEEE Trans. Automat. Control, 46(2):191–209, 2001.
25 / 37

References
RL Background VII
[50] S. M. Kakade. A natural policy gradient. In Advances in neural information processing
systems, pages 1531–1538, 2002.
[51] H. Mania, A. Guy, and B. Recht. Simple random search provides a competitive approach
to reinforcement learning. In Advances in Neural Information Processing Systems, pages
1800–1809, 2018.
MDPs, LPs and Convex Q:
[52] A. S. Manne. Linear programming and sequential decisions. Management Sci.,
6(3):259–267, 1960.
[53] C. Derman. Finite State Markovian Decision Processes, volume 67 of Mathematics in
Science and Engineering. Academic Press, Inc., 1970.
[54] V. S. Borkar. Convex analytic methods in Markov decision processes. In Handbook of
Markov decision processes, volume 40 of Internat. Ser. Oper. Res. Management Sci.,
pages 347–375. Kluwer Acad. Publ., Boston, MA, 2002.
[55] D. P. de Farias and B. Van Roy. The linear programming approach to approximate
dynamic programming. Operations Res., 51(6):850–865, 2003.
26 / 37

References
RL Background VIII
[56] D. P. de Farias and B. Van Roy. A cost-shaping linear program for average-cost
approximate dynamic programming with performance guarantees. Math. Oper. Res.,
31(3):597–620, 2006.
[57] P. G. Mehta and S. P. Meyn. Q-learning and Pontryagin’s minimum principle. In Proc. of
the IEEE Conf. on Dec. and Control, pages 3598–3605, Dec. 2009.
[58] J. Bas Serrano, S. Curi, A. Krause, and G. Neu. Logistic Q-learning. In A. Banerjee and
K. Fukumizu, editors, Proc. of The Intl. Conference on Artificial Intelligence and
Statistics, volume 130, pages 3610–3618, 13–15 Apr 2021.
[59] F. Lu, P. G. Mehta, S. P. Meyn, and G. Neu. Convex Q-learning. In American Control
Conf., pages 4749–4756. IEEE, 2021.
[60] F. Lu, P. G. Mehta, S. P. Meyn, and G. Neu. Convex analytic theory for convex
Q-learning. In Conference on Decision and Control–to appear. IEEE, 2022.
Gator Nation:
[61] A. M. Devraj, A. Bušić, and S. Meyn. Fundamental design principles for reinforcement
learning algorithms. In K. G. Vamvoudakis, Y. Wan, F. L. Lewis, and D. Cansever,
editors, Handbook on Reinforcement Learning and Control, Studies in Systems, Decision
and Control series (SSDC, volume 325). Springer, 2021.
27 / 37

References
RL Background IX
[62] A. M. Devraj and S. P. Meyn. Fastest convergence for Q-learning. ArXiv , July 2017
(extended version of NIPS 2017).
[63] A. M. Devraj. Reinforcement Learning Design with Optimal Learning Rate. PhD thesis,
University of Florida, 2019.
[64] A. M. Devraj and S. P. Meyn. Q-learning with uniformly bounded variance: Large
discounting is not a barrier to fast learning. IEEE Trans Auto Control (and
arXiv:2002.10301), 2021.
[65] A. M. Devraj, A. Bušić, and S. Meyn. On matrix momentum stochastic approximation
and applications to Q-learning. In Allerton Conference on Communication, Control, and
Computing, pages 749–756, Sep 2019.
28 / 37

References
Stochastic Miscellanea I
[66] S. Asmussen and P. W. Glynn. Stochastic Simulation: Algorithms and Analysis, volume 57
of Stochastic Modelling and Applied Probability. Springer-Verlag, New York, 2007.
[67] P. W. Glynn and S. P. Meyn. A Liapounov bound for solutions of the Poisson equation.
Ann. Probab., 24(2):916–931, 1996.
[68] S. P. Meyn and R. L. Tweedie. Markov chains and stochastic stability. Cambridge
University Press, Cambridge, second edition, 2009. Published in the Cambridge
Mathematical Library.
[69] R. Douc, E. Moulines, P. Priouret, and P. Soulier. Markov Chains. Springer, 2018.
29 / 37

DeepLearn2022 2. Variance Matters

More Related Content

Similar to DeepLearn2022 2. Variance Matters (20)

More from Sean Meyn (20)

Recently uploaded (20)

DeepLearn2022 2. Variance Matters