Estimation of the score vector and observed information matrix in intractable models

Estimation of the score vector and observed
information matrix in intractable models
Arnaud Doucet (University of Oxford)
Pierre E. Jacob (University of Oxford)
Sylvain Rubenthaler (Universit´e Nice Sophia Antipolis)
April 15th, 2015
Pierre Jacob Derivative estimation 1/ 35

Outline
1 Context
2 General results
3 Monte Carlo
4 Hidden Markov models

Outline
1 Context
2 General results
3 Monte Carlo

Motivation
Derivatives of the likelihood help optimizing / sampling.
For many models they are not available.
One can resort to approximation techniques.

Motivation
Let ℓ(θ) denote a “log-likelihood” but it could be any
function.
Let L(θ) = exp ℓ(θ), the “likelihood”.
Assume that we have access to estimators L(θ) of L(θ)
such that
E[L(θ)] = L(θ)
and
V[L(θ)] =
L(θ)2v(θ)
M
.

Finite diﬀerence
First derivative:
ℓ(1)
(θ⋆
) =
log L(θ⋆ + h) − log L(θ⋆ − h)
2h
.
converges to ∇ℓ(θ⋆) when M → ∞ and h → 0.
Second derivative:
ℓ(2)
(θ) =
log L(θ + h) − 2 log L(θ) + log L(θ − h)
h2
.
converges to ∇2ℓ(θ⋆) when M → ∞ and h → 0.

Finite diﬀerence
Optimal rate of convergence for the ﬁrst derivative:
h ∼ M−1/6
leading to MSE ∼ M−2/3
.
For the second derivative:
h ∼ M−1/8
leading to MSE ∼ M−1/2
.

Outline
1 Context
2 General results
3 Monte Carlo

Iterated Filtering
Given a log likelihood ℓ and a given point, consider a prior
θ ∼ N(θ⋆
, τ2
).
Posterior expectation when the prior variance goes to zero
First-order moments give first-order derivatives:
|τ−2
(E[θ|Y ] − θ⋆
) − ∇ℓ(θ⋆
)| ≤ Cτ.
Phrased simply,
posterior mean − prior mean
prior variance
≈ score.
Result from Ionides, Bhadra, Atchadé, King, Iterated filtering,
2011.

Stein’s lemma
Stein’s lemma states that
θ ∼ N(θ⋆
, τ2
)
if and only if for any function g such that E [|∇g(θ)|] < ∞,
E [(θ − θ⋆
) g (θ)] = τ2
E [∇g (θ)] .
If we choose the function g : θ → exp ℓ (θ) /Z with
Z = E [exp ℓ (θ)] and apply Stein’s lemma we obtain
1
Z
E [θ exp ℓ(θ)] − θ0 =
τ2
Z
E [∇ℓ (θ) exp (ℓ (θ))]
⇔ τ−2
(E [θ | Y ] − θ0) = E [∇ℓ (θ) | Y ] .
Notation: E[φ(θ) | Y ] := E[φ(θ) exp ℓ(θ)]/Z.

Stein’s lemma
For the second derivative, we consider
h : θ → (θ − θ⋆
) exp ℓ (θ) /Z.
Then
E
[
(θ − θ⋆
)2
| Y
]
= τ2
+ τ4
E
[
∇2
ℓ(θ) + ∇ℓ(θ)2
| Y
]
.
Adding and subtracting terms also yields
τ−4
(
V [θ | Y ] − τ2
)
= E
[
∇2
ℓ(θ) | Y
]
+
{
E
[
∇ℓ(θ)2
| Y
]
− (E [∇ℓ(θ) | Y ])2
}
.
. . . but what we really want is
∇ℓ(θ⋆
), ∇ℓ2
(θ⋆
)
and not
E [∇ℓ(θ) | Y ] , E
[
∇ℓ2
(θ) | Y
]
.

Core Idea
The prior is a normal distribution N(θ⋆, τ2).
The prior moments behave like:
Eτ [φ (Θ)] = φ (θ⋆
) +
τ2
2
∇2
φ (θ⋆
) + O
(
τ4
)
.
The posterior moments behave like:
Eτ [φ (Θ) | Y ] = φ (θ⋆
)+
τ2
2
(
∇2
φ (θ⋆
) + 2∇φ (θ⋆
) ∇ℓ (θ⋆
)
)
+O
(
τ4
)
.
Our arXived proof suﬀers from an overdose of Taylor
expansions.

Proof: prior moments
Let φ : R → R be a four times continuously diﬀerentiable
function. Assume that there exists a constant M < ∞ and
δ > 0 such that d4φ(θ)
dθ4 ≤ M for all θ ∈ B(θ⋆, δ).
Cut the expectation into two parts:
Eτ [φ (Θ)] =
∫
B(θ⋆,δ)
φ (θ) pτ (θ) dθ +
∫
Bc(θ⋆,δ)
φ (θ) pτ (θ) dθ
The second term is o(τk) for any power k.

Let’s deal with the ﬁrst term:
∫
B(θ⋆,δ)
φ (θ) pτ (dθ) .
Taylor expansion:
∀θ ∈ B(θ⋆
, δ) φ(θ) = φ (θ⋆
) +
∑
k=1,2,3
dkφ(θ⋆)
dθk
1
k!
(θ − θ⋆
)k
+ R3 (θ, θ⋆
) .
The Gaussian prior integrates any (θ − θ⋆)k with odd k to zero
over B(θ⋆, δ), so
∫
B(θ⋆,δ)
φ(θ)pτ (dθ) = φ (θ⋆
) +
d2φ(θ⋆)
dθ2
∫
B(θ⋆,δ)
1
2
(θ − θ⋆
)2
pτ (dθ)
+
∫
B(θ⋆,δ)
R3 (θ, θ⋆
) pτ (dθ) .

For
d2φ(θ⋆)
dθ2
∫
B(θ⋆,δ)
1
2
(θ − θ⋆
)2
pτ (dθ)
we “complete the integral” over all R and subtract an integral
over Bc(θ⋆, δ) which is o(τk) for any k. We are left with
d2φ(θ⋆)
dθ2
τ2
2
+ o(τk
) for any k
For ∫
B(θ⋆,δ)
R3 (θ, θ⋆
) pτ (dθ) .
we use the assumption to say that for all θ there is ˜θ in B(θ⋆, δ)
such that
R3(θ, θ⋆
) =
d4φ(˜θ)
dθ4
(θ − θ⋆)4
4!
.

Since d4φ(˜θ)
dθ4 is upper bounded by some M by assumption:
|
∫
B(θ⋆,δ)
R3 (θ, θ⋆
) pτ (dθ) |
≤ M
∫
B(θ⋆,δ)
(θ − θ⋆)4
4!
pτ (dθ)
≤ M
∫
R
(θ − θ⋆)4
4!
pτ (dθ)
= τ4
× C.
Combining all the terms, we obtain
Eτ [φ (Θ)] = φ (θ⋆
) +
τ2
2
∇2
φ (θ⋆
) + O
(
τ4
)
.

Proof: posterior moments
We want to obtain the posterior moments:
Eτ [φ (Θ) | Y ] = φ (θ⋆
)+
τ2
2
(
∇2
φ (θ⋆
) + 2∇φ (θ⋆
) ∇ℓ (θ⋆
)
)
+O
(
τ4
)
.
We write:
Eτ [φ (Θ) | Y ] =
Eτ [φ (Θ) L (Θ)]
Eτ [L (Θ)]
.
Then we apply the prior moment expansion for φ × L and for
L, and we simplify the ratio of two expansions.
We need to assume that the likelihood is four times continuously
diﬀerentiable, with bounded fourth derivatives around θ⋆.

Main results
In general, with a prior N(θ⋆, τ2Σ), when Σ is ﬁxed and τ goes
to zero,
τ−2
Σ−1
Epost [(Θ − θ⋆
)] = ∇ℓ(θ⋆
) + O
(
τ2
)
,
{
τ−4
Σ−1
(
Vpost [Θ] − τ2
Σ
)
Σ−1
}
= ∇2
ℓ(θ⋆
) + O
(
τ2
)
.

Extension of Iterated Filtering
Posterior variance when the prior variance goes to zero
Second-order moments give second-order derivatives:
|τ−4
(
Cov[θ|Y ] − τ2
)
− ∇2
ℓ(θ⋆
)| ≤ Cτ2
.
Phrased simply,
posterior variance − prior variance
prior variance2 ≈ hessian.
Result from Doucet, Jacob, Rubenthaler on arXiv, 2013.

Proximity mapping
Given a real function f and a point θ⋆, consider for any τ2 > 0
θ → f (θ) exp
{
−
1
2τ2
(θ − θ⋆
)2
}
θ
θ0
Figure : Example for f : θ → exp(−|θ|) and three values of τ2
.

Proximity mapping
Proximity mapping
The τ2-proximity mapping is deﬁned by
proxf : θ0 → argmaxθ∈R f (θ) exp
{
−
1
2τ2
(θ − θ0)2
}
.
Moreau approximation
The τ2-Moreau approximation is deﬁned by
fτ2 : θ0 → C supθ∈R f (θ) exp
{
−
1
2τ2
(θ − θ0)2
}
where C is a normalizing constant.

Proximity mapping
θ
Figure : θ → f (θ) and θ → fτ2 (θ) for three values of τ2
.

Proximity mapping
Property
Those objects are such that
proxf (θ0) − θ0
τ2
= ∇ log fτ2 (θ0) −−−→
τ2→0
∇ log f (θ0)
Moreau (1962), Fonctions convexes duales et points proximaux
dans un espace Hilbertien.
Pereyra (2013), Proximal Markov chain Monte Carlo
algorithms.

Proximity mapping
Bayesian interpretation
If f is a seen as a likelihood function then
θ → f (θ) exp
{
−
1
2τ2
(θ − θ0)2
}
is an unnormalized posterior density function based on a
Normal prior with mean θ0 and variance τ2.
Hence
proxf (θ0) − θ0
τ2
−−−→
τ2→0
∇ log f (θ0)
can be read
posterior mode − prior mode
prior variance
≈ score.

Outline
1 Context
2 General results
3 Monte Carlo

Moment shift estimator
How to estimate:
S(θ⋆
) = τ−2
Σ−1
Epost [(Θ − θ⋆
)] = ∇ℓ(θ⋆
) + O
(
τ2
)
?
Importance Sampling estimator from the prior :
SN (θ⋆
) = τ−2
Σ−1
(
1
N
N∑
i=1
ˆWiθi − θ⋆
)
where
ˆWi =
L(θi)
∑N
j=1 L(θj)
.
Turns out it is better to use
SN (θ⋆
) = τ−2
Σ−1
(
1
N
N∑
i=1
ˆWiθi −
1
N
N∑
i=1
θi
)
.
Any idea why?

For the second order derivative, we want to estimate:
{
τ−4
Σ−1
(
Vpost [Θ] − τ2
Σ
)
Σ−1
}
= ∇2
ℓ(θ⋆
) + O
(
τ2
)
.
We propose:
τ−4
Σ−1


N∑
i=1
ˆWi
(
θi −
N∑
i=1
ˆWjθj
)2
−
1
N
N∑
i=1
(
θi −
1
N
N∑
i=1
θi
)2

 Σ−1
.

We retrieve the same rates of convergence as in finite difference,
e.g. if τ ∼ N−1/6, the MSE is in N−2/3.
Then why bother? Why has better performance been observed
in practice in some scenarios?
Different behaviour in the non-asympotic regime.
For any function υ and tuning parameter M,
V
(
SN (θ⋆
)
)
≤ τ−2
CN ,
where CN does not depend on υ nor M.

Outline
1 Context
2 General results
3 Monte Carlo

Hidden Markov models
y2
X2X0
y1
X1
...
... yT
XT
θ
Figure : Graph representation of a general hidden Markov model.

Hidden Markov models
Direct application of the previous results
1 Prior distribution N(θ0, σ2) on the parameter θ.
2 The derivative approximations involve E[θ|Y ] and
Cov[θ|Y ].
3 Posterior moments for HMMs can be estimated by
particle MCMC,
SMC2
,
ABC
or your favourite method.
Ionides et al. proposed another approach.

Iterated Filtering
Modiﬁcation of the model: θ is time-varying.
The associated loglikelihood is
¯ℓ(θ1:T ) = log p(y1:T ; θ1:T )
= log
∫
XT+1
T∏
t=1
g(yt | xt, θt) µ(dx1 | θ1)
T∏
t=2
f (dxt | xt−1, θt).
Introducing θ → (θ, θ, . . . , θ) := θ[T] ∈ RT , we have
¯ℓ(θ[T]
) = ℓ(θ)
and the chain rule yields
dℓ(θ)
dθ
=
T∑
t=1
∂¯ℓ(θ[T])
∂θt
.

Iterated Filtering
Choice of prior on θ1:T :
θ1 = θ0 + V1, V1 ∼ τ−1
κ
{
τ−1
(·)
}
θt+1 − θ0 = ρ
(
θt − θ0
)
+ Vt+1, Vt+1 ∼ σ−1
κ
{
σ−1
(·)
}
Choose σ2 such that τ2 = σ2/(1 − ρ2). Covariance of the prior
on θ1:T :
ΣT = τ2












1 ρ · · · · · · · · · ρT−1
ρ 1 ρ · · · · · · ρT−2
ρ2 ρ 1
... ρT−3
...
...
...
...
...
ρT−2 ... 1 ρ
ρT−1 · · · · · · · · · ρ 1












.

Iterated Filtering
Applying the general results for this prior yields, with
|x| =
∑T
t=1 |xi|:
|∇¯ℓ(θ
[T]
0 ) − Σ−1
T
(
E
[
θ1:T | Y
]
− θ
[T]
0
)
| ≤ Cτ2
Moreover we have
T∑
t=1
∂¯ℓ(θ[T])
∂θt
−
T∑
t=1
{
Σ−1
T
(
E
[
θ1:T | Y
]
− θ
[T]
0
)}
t
≤
T∑
t=1
∂¯ℓ(θ[T])
∂θt
−
{
Σ−1
T
(
E
[
θ1:T | Y
]
− θ
[T]
0
)}
t
and
dℓ(θ)
dθ
=
T∑
t=1
∂¯ℓ(θ[T])
∂θt
.

Iterated Filtering
The estimator of the score is thus given by
T∑
t=1
{
Σ−1
T
(
E
[
θ1:T | Y
]
− θ
[T]
0
)}
t
which can be reduced to
Sτ,ρ,T (θ0) =
τ−2
1 + ρ
[
(1 − ρ)
{T−1∑
t=2
E
(
θt Y
)
}
− {(1 − ρ) T + 2ρ} θ0
+E
(
θ1 Y
)
+ E
(
θT Y
)]
,
given the form of Σ−1
T . Note that in the quantities E(θt | Y ),
Y = Y1:T is the complete dataset, thus those expectations are
with respect to the smoothing distribution.

Iterated Filtering
If ρ = 1, then the parameters follow a random walk:
θ1 = θ0 + N(0, τ2
) and θt+1 = θt + N(0, σ2
).
In this case Ionides et al. proposed the estimator
Sτ,σ,T = τ−2
(
E
(
θT | Y
)
− θ0
)
as well as
S
(bis)
τ,σ,T =
T∑
t=1
VP,t
−1
(
˜θF,t − ˜θF,t−1
)
with VP,t = Cov[˜θt | y1:t−1] and ˜θF,t = E[θt | y1:t].
Those expressions only involve expectations with respect to
ﬁltering distributions.

Iterated Filtering
If ρ = 0, then the parameters are i.i.d:
θ1 = θ0 + N(0, τ2
) and θt+1 = θ0 + N(0, τ2
).
In this case the expression of the score estimator reduces to
Sτ,T = τ−2
T∑
t=1
(
E
(
θt | Y
)
− θ0
)
which involves smoothing distributions.
There’s only one parameter τ2 to choose for the prior.
However smoothing for general hidden Markov models is
diﬃcult, and typically resorts to “ﬁxed lag approximations”.

Numerical results
Linear Gaussian state space model where the ground truth is
available through the Kalman ﬁlter.
X0 ∼ N(0, 1) and Xt = ρXt−1 + N(0, V )
Yt = ηXt + N(0, W ).
Generate T = 100 observations and set
ρ = 0.9, V = 0.7, η = 0.9 and W = 0.1, 0.2, 0.4, 0.9.
240 independent runs, matching the computational costs
between methods in terms of number of calls to the transition
kernel.

Numerical results
q
q
q
q
q
q
q q q
q q q q q q q q q qq
q
q
qq
q
q
q
q q q q q q q q q q q
q
q
q
q
q
q
q
q
q
q
q q q
q q q q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q q
q
q
q
q
10
100
1000
10000
0.0 0.1 0.2 0.3
h
RMSE
parameter q q q q
1 2 3 4
Finite Difference
qq q q q q q q q q q q q q q q q q q q
q
q q q q q q q q q q q q q q q q q q q
10
100
1000
10000
0.1 0.2 0.3 0.4 0.5
tau
RMSE
parameter q q q q
1 2 3 4
Iterated Smoothing
Figure : 240 runs for Iterated Smoothing and Finite Diﬀerence.

Numerical results
qqqqqqq q q q q q q q q q q q q q
q
qqqqqq q q q q q q q q q q q q q
10
100
1000
10000
0.1 0.2 0.3 0.4 0.5
tau
RMSE
parameter q q q q
1 2 3 4
Iterated Smoothing
qq
q
q
q
q q q q q
q
q
q
q q
q q q q q
q
q
q
q
q
q
q q q q
q
q
q
q
q q q q q q
10
100
1000
10000
0.0 0.1 0.2 0.3 0.4 0.5
tau
RMSE
parameter q q q q
1 2 3 4
Iterated Filtering 1
qqqq q q q q q q
qqqq q q q q q q
qqqq q q
q
q q q
qqqq
q
q
q q q q
10
100
1000
10000
0.0 0.1 0.2 0.3 0.4 0.5
tau
RMSE
parameter q q q q
1 2 3 4
Iterated Filtering 2
Figure : 240 runs for Iterated Smoothing and Iterated Filtering.

Bibliography
Main references:
Inference for nonlinear dynamical systems, Ionides, Breto,
King, PNAS, 2006.
Iterated filtering, Ionides, Bhadra, Atchadé, King, Annals
of Statistics, 2011.
Efficient iterated filtering, Lindström, Ionides, Frydendall,
Madsen, 16th IFAC Symposium on System Identification.
Derivative-Free Estimation of the Score Vector
and Observed Information Matrix,
Doucet, Jacob, Rubenthaler, 2013 (on arXiv).

Estimation of the score vector and observed information matrix in intractable models

More Related Content

What's hot (20)

Similar to Estimation of the score vector and observed information matrix in intractable models (20)

More from Pierre Jacob (16)

Recently uploaded (20)

Estimation of the score vector and observed information matrix in intractable models