Estimation of the score vector and observed information matrix in intractable models

Estimation of the score vector and observed
information matrix in intractable models
Arnaud Doucet (University of Oxford)
Pierre E. Jacob (University of Oxford)
Sylvain Rubenthaler (Universit´e Nice Sophia Antipolis)
October 30th, 2014
Pierre Jacob Derivative estimation 1/ 40

Outline
1 Context
2 General results and connections
3 Posterior concentration when the prior concentrates
4 Hidden Markov models

Outline
1 Context

Motivation
Derivatives of the likelihood can help optimizing / sampling.
For many complex models, the likelihood isn’t available, let alone its
derivatives.
One can resort to approximation techniques, and plug the estimates
of the derivatives into optimization / sampling methods.

Using derivatives in sampling algorithms
Modiﬁed Adjusted Langevin Algorithm
At step t, given a point θt ∈ Θ, do:
propose
θ⋆
∼ q(dθ | θt) ≡ N(θt +
σ2
2
∇θ log π(θt), σ2
),
with probability
1 ∧
π(θ⋆
)q(θt | θ⋆
)
π(θt)q(θ⋆ | θt)
set θt+1 = θ⋆
, otherwise set θt+1 = θt.

Figure : Proposal mechanism for random walk Metropolis–Hastings.

Figure : Proposal mechanism for MALA.

In what sense is this algorithm better?
Scaling with the dimension of the state space
For Metropolis–Hastings, optimal scaling leads to
σ2
= O(d−1
),
For MALA, optimal scaling leads to
σ2
= O(d−1/3
).
Roberts & Rosenthal, Optimal Scaling for Various Metropolis-Hastings
Algorithms, 2001.

Hidden Markov models
y2
X2X0
y1
X1
...
... yT
XT
θ
Figure : Graph representation of a general hidden Markov model.
Hidden process: initial distribution µθ, transition fθ.
Observations conditional upon the hidden process, from gθ.

Assumptions
Input:
Parameter θ : unknown, prior distribution p.
Initial condition µθ(dx0) : can be sampled from.
Transition fθ(dxt|xt−1) : can be sampled from.
Measurement gθ(yt|xt) : can be evaluated point-wise.
Observations y1:T = (y1, . . . , yT ).
Goals:
score: ∇θ log L(θ; y1:T ) for any θ,
observed information matrix: −∇2
θ log L(θ; y1:T ) for any θ.
Then we could apply any fancy sampling algorithm.

Why is it an intractable model?
The likelihood function does not admit a closed form expression:
L(θ; y1, . . . , yT ) =
∫
XT+1
p(y1, . . . , yT | x0, . . . xT , θ)p(dx0, . . . dxT | θ)
=
∫
XT+1
T∏
t=1
gθ(yt | xt) µθ(dx0)
T∏
t=1
fθ(dxt | xt−1).
Hence the likelihood can only be estimated, e.g. by standard Monte
Carlo, or by particle ﬁlters.
What about the derivatives of the likelihood?

Fisher and Louis’ identities
Write the score as:
∇ℓ(θ) =
∫
∇ log p(x0:T , y1:T | θ)p(dx0:T | y1:T , θ).
which is an integral, with respect to the smoothing distribution
p(dx0:T | y1:T , θ), of
∇ log p(x0:T , y1:T | θ) = ∇ log µθ(x0)
+
T∑
t=1
∇ log fθ(xt | xt−1) +
T∑
t=1
∇ log gθ(yt | xt).
However pointwise evaluations of ∇ log µθ(x0) and ∇ log fθ(xt | xt−1) are
not always available.

New kid on the block: Iterated Filtering
Perturbed model
Hidden states ˜Xt = (˜θt, Xt).
{
˜θ0 ∼ N(θ0, τ2
Σ)
X0 ∼ µ˜θ0
(·)
and
{
˜θt ∼ N(˜θt−1, σ2
Σ)
Xt ∼ f˜θt
(· | Xt−1 = xt−1)
Observations ˜Yt ∼ g˜θt
(· | Xt).
Score estimate
T∑
t=1
VP,t
−1
(
˜θF,t − ˜θF,t−1
)
− ∇ℓ(θ0) ≤ C(τ +
σ2
τ2
)
with VP,t = Cov[˜θt | y1:t−1] and ˜θF,t = E[θt | y1:t].
Ionides, Breto, King, PNAS, 2006.

Iterated Filtering: the mystery
Why is it valid?
Is it related to any known techniques for derivative estimation?
How does it compare to other methods such as ﬁnite diﬀerence?
Can it be extended to estimate the observed information matrix?

Outline
1 Context

Proximity mapping
Given a real function f and a point θ0, consider for any σ2
> 0
θ → f (θ) exp
{
−
1
2σ2
(θ − θ0)2
}
θ
θ0
Figure : Example for f : θ → exp(−|θ|) and three values of σ2
.

Proximity mapping
Proximity mapping
The σ2
-proximity mapping is deﬁned by
proxf : θ0 → argmaxθ∈R f (θ) exp
{
−
1
2σ2
(θ − θ0)2
}
.
Moreau approximation
The σ2
-Moreau approximation is deﬁned by
fσ2 : θ0 → C supθ∈R f (θ) exp
{
−
1
2σ2
(θ − θ0)2
}
where C is a normalizing constant.

Proximity mapping
θ
Figure : θ → f (θ) and θ → fσ2 (θ) for three values of σ2
.

Proximity mapping
Property
Those objects are such that
proxf (θ0) − θ0
σ2
= ∇ log fσ2 (θ0) −−−−→
σ2→0
∇ log f (θ0)
Moreau (1962), Fonctions convexes duales et points proximaux dans un
espace Hilbertien.
Pereyra (2013), Proximal Markov chain Monte Carlo algorithms.

Proximity mapping
Bayesian interpretation
If f is a seen as a likelihood function then
θ → f (θ) exp
{
−
1
2σ2
(θ − θ0)2
}
is an unnormalized posterior density function based on a Normal prior
with mean θ0 and variance σ2
.
Hence
proxf (θ0) − θ0
σ2
−−−−→
σ2→0
∇ log f (θ0)
can be read
posterior mode − prior mode
prior variance
≈ score.

Iterated Filtering
Posterior expectation instead of mode
Based on a prior θ ∼ N(θ0, σ2
),
|σ−2
(E[θ|Y ] − θ0) − ∇ log f (θ0)| ≤ Cσ2
.
Phrased simply,
posterior mean − prior mean
prior variance
≈ score.
Result from Ionides, Bhadra, Atchad´e, King, Iterated ﬁltering, 2011.

Extension of Iterated Filtering
Observed information matrix
Second-order moments give second-order derivatives:
|σ−4
(
Cov[θ|Y ] − σ2
)
− ∇2
log f (θ0)| ≤ Cσ2
.
Phrased simply,
posterior variance − prior variance
prior variance2 ≈ −observed information matrix.
Result from Doucet, Jacob, Rubenthaler on arXiv, 2013.

A connection with Stein’s method
Stein’s lemma states that
θ ∼ N(θ0, σ2
)
if and only if for any function g such that the following objects exist,
E [(θ − θ0) g (θ)] = σ2
E [∇g (θ)] .
If we choose the function g : θ → exp ℓ (θ) /E [exp ℓ (θ)] and apply
Stein’s lemma we obtain
E [(θ − θ0)g(θ)] = σ2
E [∇g (θ)]
= σ2 E [∇ℓ (θ) exp (ℓ (θ))]
E [exp ℓ (θ)]

A connection with Stein’s method
Hence we obtain
E [(θ − θ0) exp ℓ(θ)]
E [exp ℓ (θ)]
= σ2 E [∇ℓ (θ) exp (ℓ (θ))]
E [exp ℓ (θ)]
On the left we have E[θ | Y ] − θ0. On the right we have σ2
E[∇ℓ(θ) | Y ].
When σ2
→ 0, E[∇ℓ(θ) | Y ] should go to ∇ℓ(θ0).
The Iterated Filtering method indeed relies on the approximation
E [θ | Y ] − θ0 ≈ σ2
∇ℓ (θ0) .

Outline
1 Context

Core Idea
Let’s take an informal look at proofs, in one-dimensional notations.
Introduce a normal prior distribution: N(θ0, σ2
).
Posterior concentration induced by the prior
Under minimal assumptions, when σ → 0:
the posterior is going to look more and more like the prior,
the diﬀerence between the posterior and the prior moments is
related to the derivatives of the log-likelihood.
∇ℓ(θ0) − σ−2
{E (θ|y) − θ0} ≤ Cσ2
∇2
ℓ(θ0) − σ−4
{
Cov (θ|y) − σ2
}
≤ C′
σ2

Details
Assumptions
1 Prior p(θ) = σ−d
κ(θ−θ0
σ ) where κ is symmetric, has ﬁnite moments
of all orders, and unit variance.
2 κ has tails that decrease at a faster rate than the likelihood
increases.
3 The log-likelihood is four times continuously diﬀerentiable.
Introduce a test function h such that |h(u)| < c|u|α
for some c, α.

Details
We start by writing
E {h (θ − θ0)| y} =
∫
h (σu) exp {ℓ (θ0 + σu) − ℓ(θ0)} κ (u) du
∫
exp {ℓ (θ0 + σu) − ℓ(θ0)} κ (u) du
using u = (θ − θ0)/σ and then focus on the numerator
∫
since the denominator is a particular instance of this expression with
h : u → 1.

Details
For the numerator:
∫
we use a Taylor expansion of ℓ around θ0 and a Taylor expansion of exp
around 0, and then take the integral with respect to κ.
Notation:
ℓ(k)
(θ).u⊗k
=
∑
1≤i1,...,ik ≤d
∂k
ℓ(θ)
∂θi1 . . . ∂θik
ui1 . . . uik
which in one dimension becomes
ℓ(k)
(θ).u⊗k
=
dk
f (θ)
dθk
uk
.

Details
Main expansion:
∫
h(σu) exp {ℓ (θ0 + σu) − ℓ(θ0)} κ(u)du =
∫
h(σu)κ(u)du + σ
∫
h(σu)ℓ(1)
(θ0).u κ(u)du
+ σ2
∫
h(σu)
{
1
2
ℓ(2)
(θ0).u⊗2
+
1
2
(ℓ(1)
(θ0).u)2
}
κ(u)du
+ σ3
∫
h(σu)
{
1
3!
(ℓ(1)
(θ0).u)3
+
1
2
(ℓ(1)
(θ0).u)(ℓ(2)
(θ0).u⊗2
)
+
1
3!
ℓ(3)
(θ0).u⊗3
}
κ(u)du + O(σ4+α
).
The assumptions on the tails of the prior and the likelihood are used to
control the remainder terms and to ensure there are O(σ4+α
).

Details
We cut the integral into two bits:
∫
h(σu) exp {ℓ (θ0 + σu) − ℓ(θ0)} κ(u)du
=
∫
σ|u|≤ρ
+
∫
σ|u|>ρ
The expansion stems from the ﬁrst term, where σ|u| is small.
The second term ends up in the remainder in O(σ4+α
) using the
assumptions.
Classic technique in Bayesian asymptotics theory, but here the likelihood
is ﬁxed and the prior concentrates, instead of the other way around.

Details
To get the score from the expansion, choose
h : u → u.
To get the observed information matrix from the expansion, choose
h : u → u2
,
and surprisingly (?) further assume that κ is mesokurtic, i.e.
∫
u4
κ(u)du = 3
(∫
u2
κ(u)du
)2
⇒ choose a Gaussian prior to obtain the observed information matrix.

Outline
1 Context

y2
X2X0
y1
X1
...
... yT
XT
θ
Figure : Graph representation of a general hidden Markov model.

Direct application of the previous results
1 Prior distribution N(θ0, σ2
) on the parameter θ.
2 The derivative approximations involve E[θ|Y ] and Cov[θ|Y ].
3 Posterior moments for HMMs can be estimated by
particle MCMC,
SMC2
,
ABC
or your favourite method.
Ionides et al. proposed another approach, more speciﬁc to HMMs.

Iterated Filtering
Modiﬁcation of the model: θ is allowed to be diﬀerent at each time.
The associated loglikelihood is
¯ℓ(θ1:T ) = log p(y1:T ; θ1:T )
= log
∫
XT+1
T∏
t=1
g(yt | xt, θt) µ(dx1 | θ1)
T∏
t=2
f (dxt | xt−1, θt).
Introducing θ → (θ, θ, . . . , θ) := θ[T]
∈ RT
, we have
¯ℓ(θ[T]
) = ℓ(θ)
and the chain rule yields
dℓ(θ)
dθ
=
T∑
t=1
∂¯ℓ(θ[T]
)
∂θt
.

Iterated Filtering
Choice of prior on θ1:T :
θ1 = θ0 + V1, V1 ∼ τ−1
κ
{
τ−1
(·)
}
θt+1 − θ0 = ρ
(
θt − θ0
)
+ Vt+1, Vt+1 ∼ σ−1
κ
{
σ−1
(·)
}
Choose σ2
such that τ2
= σ2
/(1 − ρ2
). Covariance of the prior on θ1:T :
ΣT = τ2











1 ρ · · · · · · · · · ρT−1
ρ 1 ρ · · · · · · ρT−2
ρ2
ρ 1
... ρT−3
...
...
...
...
...
ρT−2 ... 1 ρ
ρT−1
· · · · · · · · · ρ 1











.

Iterated Filtering
Applying the general results for this prior yields, with |x| =
∑T
t=1 |xi|:
|∇¯ℓ(θ
[T]
0 ) − Σ−1
T
(
E
[
θ1:T | Y
]
− θ
[T]
0
)
| ≤ Cτ2
Moreover we have
T∑
t=1
∂¯ℓ(θ[T]
)
∂θt
−
T∑
t=1
{
Σ−1
T
(
E
[
θ1:T | Y
]
− θ
[T]
0
)}
t
≤
T∑
t=1
∂¯ℓ(θ[T]
)
∂θt
−
{
Σ−1
T
(
E
[
θ1:T | Y
]
− θ
[T]
0
)}
t
and
dℓ(θ)
dθ
=
T∑
t=1
∂¯ℓ(θ[T]
)
∂θt
.

Iterated Filtering
The estimator of the score is thus given by
T∑
t=1
{
Σ−1
T
(
E
[
θ1:T | Y
]
− θ
[T]
0
)}
t
which can be reduced to
Sτ,ρ,T (θ0) =
τ−2
1 + ρ
[
(1 − ρ)
{T−1∑
t=2
E
(
θt Y
)
}
− {(1 − ρ) T + 2ρ} θ0
+E
(
θ1 Y
)
+ E
(
θT Y
)]
,
given the form of Σ−1
T . Note that in the quantities E(θt | Y ), Y = Y1:T
is the complete dataset, thus those expectations are with respect to the
smoothing distribution.

Iterated Filtering
If ρ = 1, then the parameters follow a random walk:
θ1 = θ0 + N(0, τ2
) and θt+1 = θt + N(0, σ2
).
In this case Ionides et al. proposed the estimator
Sτ,σ,T = τ−2
(
E
(
θT | Y
)
− θ0
)
as well as
S
(bis)
τ,σ,T =
T∑
t=1
VP,t
−1
(
˜θF,t − ˜θF,t−1
)
with VP,t = Cov[˜θt | y1:t−1] and ˜θF,t = E[θt | y1:t].
Those expressions only involve expectations with respect to ﬁltering
distributions.

Iterated Filtering
If ρ = 0, then the parameters are i.i.d:
θ1 = θ0 + N(0, τ2
) and θt+1 = θ0 + N(0, σ2
).
In this case the expression of the score estimator reduces to
Sτ,T = τ−2
T∑
t=1
(
E
(
θt | Y
)
− θ0
)
which involves smoothing distributions.
There’s only one parameter τ2
to choose for the prior.
However smoothing for general hidden Markov models is diﬃcult,
and typically resorts to “ﬁxed lag approximations”.

Iterated Smoothing
Only for the case ρ = 0 are we able to obtain simple expressions for the
observed information matrix. We propose the following estimator:
Iτ,T (θ0) = −τ−4
{ T∑
s=1
T∑
t=1
Cov
(
θs, θt Y
)
− τ2
T
}
.
for which we can show that
Iτ,T − (−∇2
ℓ(θ0)) ≤ Cτ2
.

Numerical results
Linear Gaussian state space model where the ground truth is available
through the Kalman ﬁlter.
X0 ∼ N(0, 1) and Xt = ρXt−1 + N(0, V )
Yt = ηXt + N(0, W ).
The parameters are (ρ, V , η, W ). We generate T = 100 observations.
Easy set of parameters: ρ = 0.8, V = 0.72
, η = 0.9, W = 12
.
Hard set of parameters: ρ = 0.8, V = 12
, η = 0.9, W = 0.12
.
The gradient being four-dimensional, we plot only the ﬁrst component
estimated over time.

Numerical results
IS FD
0
100
200
300
400
0 25 50 75 100 0 25 50 75 100
time
estimates
Figure : 250 runs for Iterated Smoothing and Finite Diﬀerence, “easy” scenario.

Numerical results
IS FD
0
500
1000
1500
0 25 50 75 100 0 25 50 75 100
time
estimates
Figure : 250 runs for Iterated Smoothing and Finite Diﬀerence, “hard” scenario.

Bibliography
Main references:
Inference for nonlinear dynamical systems, Ionides, Breto, King,
PNAS, 2006.
Iterated filtering, Ionides, Bhadra, Atchadé, King, Annals of
Statistics, 2011.
Efficient iterated filtering, Lindström, Ionides, Frydendall, Madsen,
16th IFAC Symposium on System Identification.
Derivative-Free Estimation of the Score Vector
and Observed Information Matrix,
Doucet, Jacob, Rubenthaler, 2013 (on arXiv).

Estimation of the score vector and observed information matrix in intractable models

More Related Content

What's hot (20)

Similar to Estimation of the score vector and observed information matrix in intractable models (20)

More from Pierre Jacob (16)

Recently uploaded (20)

Estimation of the score vector and observed information matrix in intractable models