Estimation of the score vector and observed information matrix in intractable models

Estimation of the score vector and observed
information matrix in intractable models
Arnaud Doucet (University of Oxford)
Pierre E. Jacob (University of Oxford)
Sylvain Rubenthaler (Universit´e Nice Sophia Antipolis)
June 15th, 2015
Pierre Jacob Derivative estimation 1/ 25

Outline
1 Context: intractable derivatives
2 Shift estimators
3 Monte Carlo shift estimators

Outline
2 Shift estimators

Motivation
Derivatives of the likelihood help optimizing / sampling.
They become crucial when the parameter space is
high-dimensional.
For many models, these derivatives are intractable.
One can resort to approximation techniques.

Motivation: hidden Markov models
y2
X2X0
y1
X1
...
... yT
XT
θ
Figure : Graph representation of a general hidden Markov model.

The associated loglikelihood is
(θ) = log p(y1:T ; θ)
= log
XT+1
T
t=1
g(yt | xt, θ) µ(dx1 | θ)
T
t=2
f (dxt | xt−1, θ).
This high-dimensional integral can be approximated
efficiently using particle filters.
However, the approximation itself is “discontinuous” as a
function of the parameter θ, even when the “seed” is fixed.
Figures coming.

−198
−196
−194
−192
−190
0.500 0.525 0.550 0.575 0.600
θ
loglikelihood
different seed truth
Figure : Approximation of the log-likelihood by particle ﬁlters.

−196
−194
−192
−190
0.500 0.525 0.550 0.575 0.600
θ
loglikelihood
same seed truth
Figure : Approximation of the log-likelihood by particle ﬁlters.

Motivation: general context
Let (θ) denote a “log-likelihood” but it could be any
function.
Let L(θ) = exp (θ), the “likelihood”.
Assume that we have access to estimators L(θ) of L(θ)
such that
E[L(θ)] = L(θ)
and
V
L(θ)
L(θ)
=
v(θ)
M
≈
v
M
(∀θ).

Finite diﬀerence
First derivative:
(θ ) =
log L(θ + h) − log L(θ − h)
2h
converges to (θ ) when M → ∞ and h → 0.
Second derivative:
2 (θ) =
log L(θ + h) − 2 log L(θ) + log L(θ − h)
h2
converges to 2 (θ ) when M → ∞ and h → 0.

Finite diﬀerence
Bias:
E (θ ) − (θ ) = O(h2
).
Variance:
V (θ ) = O
1
Mh2
.
Optimal rate of convergence for the ﬁrst derivative:
h ∼ M−1/6
leading to MSE ∼ M−2/3
.
For the second derivative:
h ∼ M−1/8
leading to MSE ∼ M−1/2
.

Outline
2 Shift estimators

Iterated Filtering
Ionides, Bhadra, Atchad´e, King, Iterated ﬁltering, AoS, 2011.
Given a log-likelihood and a given point θ , consider a
“prior” distribution centered on θ , for instance
Θ ∼ N(θ , τ2
Σ).
We can consider “posterior” expectations, when they exist:
Epost [ϕ (Θ)] =
E [ϕ (Θ) exp (Θ)]
E [exp (Θ)]
,
where the expectations E are with respect to the “prior”.

Iterated Filtering
Under conditions on the prior distribution, and on the
log-likelihood, one gets the following result.
Posterior expectation when the prior variance goes to zero
For any θ , there exists a C such that, for τ small enough,
|τ−2
Σ−1
(Epost[Θ] − θ ) − (θ )| ≤ Cτ.
Phrased simply,
posterior mean − prior mean
prior variance
≈ score.
Result from Ionides, Bhadra, Atchad´e, King, Iterated ﬁltering,
2011.

Normal priors
Let us focus on the normal case:
Θ ∼ N(θ , τ2
Σ).
Stein’s Lemma says...
Assume that E [| iL (Θ)|] < ∞ and E 2
ijL (Θ) < ∞ for
i, j ∈ {1, . . . , d}. Then we have
Epost [Θ − θ ] = τ2
Σ Epost [ (Θ)] ,
and
Epost (Θ − θ ) (Θ − θ )T
= τ2
Σ + τ4
ΣEpost
2
(Θ) + (Θ) (Θ)T
Σ.

Expected derivatives and derivatives
We do not want to estimate
Epost [ (Θ)]
but rather
(θ ).
Posterior moment expansion
When τ → 0, we have for ϕ satisfying some conditions,
Epost [ϕ (Θ)] = ϕ (θ )
+
τ2
2
d
i=1
d
j=1
2
ijϕ (θ ) + 2 iϕ (θ ) j (θ ) Σij
+ O τ4
.

Expected derivatives and derivatives
Conditions for this to hold:
Let ϕ : Rd → R be a four times continuously diﬀerentiable
function. Assume that there exists a constant M < ∞ and
δ > 0 such that ∂4ϕ(θ)
∂θi∂θj∂θk∂θl
≤ M for all θ ∈ BΣ(θ , δ), all
(i, j, k, l) ∈ {1, . . . , d}4
.
The likelihood function L is such that both θ → L (θ) and
θ → ϕ (θ) × L (θ) satisfy the same assumption as ϕ.
There exists τ0 > 0 such that Eτ0 [|ϕ (Θ)|] < ∞,
Eτ0 [|L (Θ)|] < ∞ and Eτ0 [|ϕ (Θ) L (Θ)|] < ∞.

Shift estimators
Shift estimators
Sτ (θ ) = τ−2
Σ−1
Epost [Θ − θ ] = (θ ) + O τ2
,
Iτ (θ ) = τ−4
Σ−1
Vpost [Θ] − τ2
Σ Σ−1
= 2
(θ ) + O τ2
.
Using Monte Carlo approximations of
Epost [Θ] and Vpost [Θ] ,
these formulas provide estimators of
(θ ) and 2
(θ ).

Example: normal likelihood
The parameter θ represents a d-dimensional location, and
Y ∼ N(θ, Λ−1
y ).
Thus the derivatives of the log-likelihood are
(θ) = −Λy(θ − y),
2
(θ) = −Λy.
Using a prior N θ , τ2Σ , the posterior is conjugate and the
shift estimators are given by
Sτ (θ ) =
1
(1 + τ2ΣΛy)
(−Λy(θ − y)),
Iτ (θ ) =
1
(1 + τ2ΣΛy)
(−Λy) .

Outline
2 Shift estimators

Monte Carlo shift estimators
Draw θi ∼ N θ , τ2Σ , for each i ∈ {1, . . . , N}.
Draw ˆwi = L (θi), for each i ∈ {1, . . . , N}.
Normalize the weights: for each i ∈ {1, . . . , N},
ˆWi = ˆwi/
N
j=1
ˆwj.
Then we can estimate
τ−2
Σ−1
(Epost[Θ] − θ )
by
τ−2
Σ−1
N
i=1
ˆWiθi − θ ,
or, even better,
τ−2
Σ−1
N
i=1
ˆWiθi −
1
N
N
i=1
θi .

For the second order derivative, recall that:
τ−4
Σ−1
Vpost [Θ] − τ2
Σ Σ−1
= 2
(θ ) + O τ2
.
Importance sampling strategy (with control variates) leads to:
τ−4
Σ−1


N
i=1
ˆWi θi −
N
i=1
ˆWjθj
2
−
1
N
N
i=1
θi −
1
N
N
i=1
θi
2

 Σ−1
.
with obvious extension in the multivariate case.

Score estimator:
SN,τ (θ ) = τ−2
Σ−1
N
i=1
ˆWiθi −
1
N
N
i=1
θi .
Bias:
E SN,τ (θ ) − (θ ) = O τ2
.
Variance:
V SN,τ (θ ) = O
1
Nτ2
.
⇒ same rate of convergence than the ﬁnite diﬀerence method!
Similar results for the estimator of 2 (θ ).

The shift estimators are competitive with the gold
standard!
The shift estimators are not better than the gold standard!
So why bother?
Why is Iterated Filtering used in practice?
Let’s have a look at the non-asymptotic behaviour.

Robustness
Recall:
SN,τ (θ ) = τ−2
Σ−1
N
i=1
ˆWiθi −
1
N
N
i=1
θi ,
where
ˆWi =
L(θi)
N
j=1 L(θj)
,
and V(L(θi)/L(θi)) = v/M.
Scenario: v is very large. Then N
i=1
ˆWiθi ≈ θj for some j.
We obtain, for all v:
V(SN,τ (θ ) ≤ Cτ−2
.
On the other hand, the variance of ﬁnite diﬀerence estimators
increases to ∞ with v.

Example: normal likelihood
Latent variable model:
X ∼ N(θ, Λ−1
y − λΛ−1
y|x), Y | X = x ∼ N(x, λΛ−1
y|x),
for fixed matrices Λy and Λy|x, and λ ∈ (0, 1).
Still corresponds to
Y ∼ N(θ, Λ−1
y ),
but a naive Monte Carlo approach has an infinite relative
variance when λ goes to zero.
The following figure represents the behaviour of the MSE for
the Monte Carlo shift estimator and the finite difference
estimator as λ goes to zero, i.e. v → ∞.

Robustness
1e+01
1e+04
1e+07
1e−01 1e−02 1e−03 1e−04 1e−05 1e−06
λ
MSE
Monte Carlo shift finite difference
Figure : Robustness of the Monte Carlo shift estimator when the
variance of the likelihood estimator increases.

Discussion
Monte Carlo shift estimators and finite difference
estimators: equivalent in mean squared error (N → ∞).
Monte Carlo shift estimators are robust to the noise of the
likelihood estimators (v → ∞). This might explain the
observed gain within optimization procedures.
There are specific forms of shift estimators for hidden
Markov models (e.g. Iterated Filtering).
Behaviour of the posterior distribution when the prior is a
concentrated normal distribution, interesting on its own
right?

Bibliography
Main references:
Inference for nonlinear dynamical systems, Ionides, Breto,
King, PNAS, 2006.
Iterated ﬁltering, Ionides, Bhadra, Atchad´e, King, Annals
of Statistics, 2011.
Derivative-Free Estimation of the Score Vector
and Observed Information Matrix,
Doucet, Jacob, Rubenthaler, (on arXiv, to be updated).

Estimation of the score vector and observed information matrix in intractable models

More Related Content

What's hot (20)

Similar to Estimation of the score vector and observed information matrix in intractable models (20)

More from Pierre Jacob (14)

Recently uploaded (20)

Estimation of the score vector and observed information matrix in intractable models