SlideShare a Scribd company logo
Estimation of the score vector and observed
information matrix in intractable models
Arnaud Doucet (University of Oxford)
Pierre E. Jacob (University of Oxford)
Sylvain Rubenthaler (Universit´e Nice Sophia Antipolis)
June 15th, 2015
Pierre Jacob Derivative estimation 1/ 25
Outline
1 Context: intractable derivatives
2 Shift estimators
3 Monte Carlo shift estimators
Pierre Jacob Derivative estimation 1/ 25
Outline
1 Context: intractable derivatives
2 Shift estimators
3 Monte Carlo shift estimators
Pierre Jacob Derivative estimation 2/ 25
Motivation
Derivatives of the likelihood help optimizing / sampling.
They become crucial when the parameter space is
high-dimensional.
For many models, these derivatives are intractable.
One can resort to approximation techniques.
Pierre Jacob Derivative estimation 2/ 25
Motivation: hidden Markov models
y2
X2X0
y1
X1
...
... yT
XT
θ
Figure : Graph representation of a general hidden Markov model.
Pierre Jacob Derivative estimation 3/ 25
Motivation: hidden Markov models
The associated loglikelihood is
(θ) = log p(y1:T ; θ)
= log
XT+1
T
t=1
g(yt | xt, θ) µ(dx1 | θ)
T
t=2
f (dxt | xt−1, θ).
This high-dimensional integral can be approximated
efficiently using particle filters.
However, the approximation itself is “discontinuous” as a
function of the parameter θ, even when the “seed” is fixed.
Figures coming.
Pierre Jacob Derivative estimation 4/ 25
Motivation: hidden Markov models
−198
−196
−194
−192
−190
0.500 0.525 0.550 0.575 0.600
θ
loglikelihood
different seed truth
Figure : Approximation of the log-likelihood by particle filters.
Pierre Jacob Derivative estimation 5/ 25
Motivation: hidden Markov models
−196
−194
−192
−190
0.500 0.525 0.550 0.575 0.600
θ
loglikelihood
same seed truth
Figure : Approximation of the log-likelihood by particle filters.
Pierre Jacob Derivative estimation 6/ 25
Motivation: general context
Let (θ) denote a “log-likelihood” but it could be any
function.
Let L(θ) = exp (θ), the “likelihood”.
Assume that we have access to estimators L(θ) of L(θ)
such that
E[L(θ)] = L(θ)
and
V
L(θ)
L(θ)
=
v(θ)
M
≈
v
M
(∀θ).
Pierre Jacob Derivative estimation 7/ 25
Finite difference
First derivative:
(θ ) =
log L(θ + h) − log L(θ − h)
2h
converges to (θ ) when M → ∞ and h → 0.
Second derivative:
2 (θ) =
log L(θ + h) − 2 log L(θ) + log L(θ − h)
h2
converges to 2 (θ ) when M → ∞ and h → 0.
Pierre Jacob Derivative estimation 8/ 25
Finite difference
Bias:
E (θ ) − (θ ) = O(h2
).
Variance:
V (θ ) = O
1
Mh2
.
Optimal rate of convergence for the first derivative:
h ∼ M−1/6
leading to MSE ∼ M−2/3
.
For the second derivative:
h ∼ M−1/8
leading to MSE ∼ M−1/2
.
Pierre Jacob Derivative estimation 9/ 25
Outline
1 Context: intractable derivatives
2 Shift estimators
3 Monte Carlo shift estimators
Pierre Jacob Derivative estimation 10/ 25
Iterated Filtering
Ionides, Bhadra, Atchad´e, King, Iterated filtering, AoS, 2011.
Given a log-likelihood and a given point θ , consider a
“prior” distribution centered on θ , for instance
Θ ∼ N(θ , τ2
Σ).
We can consider “posterior” expectations, when they exist:
Epost [ϕ (Θ)] =
E [ϕ (Θ) exp (Θ)]
E [exp (Θ)]
,
where the expectations E are with respect to the “prior”.
Pierre Jacob Derivative estimation 10/ 25
Iterated Filtering
Under conditions on the prior distribution, and on the
log-likelihood, one gets the following result.
Posterior expectation when the prior variance goes to zero
For any θ , there exists a C such that, for τ small enough,
|τ−2
Σ−1
(Epost[Θ] − θ ) − (θ )| ≤ Cτ.
Phrased simply,
posterior mean − prior mean
prior variance
≈ score.
Result from Ionides, Bhadra, Atchad´e, King, Iterated filtering,
2011.
Pierre Jacob Derivative estimation 11/ 25
Normal priors
Let us focus on the normal case:
Θ ∼ N(θ , τ2
Σ).
Stein’s Lemma says...
Assume that E [| iL (Θ)|] < ∞ and E 2
ijL (Θ) < ∞ for
i, j ∈ {1, . . . , d}. Then we have
Epost [Θ − θ ] = τ2
Σ Epost [ (Θ)] ,
and
Epost (Θ − θ ) (Θ − θ )T
= τ2
Σ + τ4
ΣEpost
2
(Θ) + (Θ) (Θ)T
Σ.
Pierre Jacob Derivative estimation 12/ 25
Expected derivatives and derivatives
We do not want to estimate
Epost [ (Θ)]
but rather
(θ ).
Posterior moment expansion
When τ → 0, we have for ϕ satisfying some conditions,
Epost [ϕ (Θ)] = ϕ (θ )
+
τ2
2
d
i=1
d
j=1
2
ijϕ (θ ) + 2 iϕ (θ ) j (θ ) Σij
+ O τ4
.
Pierre Jacob Derivative estimation 13/ 25
Expected derivatives and derivatives
Conditions for this to hold:
Let ϕ : Rd → R be a four times continuously differentiable
function. Assume that there exists a constant M < ∞ and
δ > 0 such that ∂4ϕ(θ)
∂θi∂θj∂θk∂θl
≤ M for all θ ∈ BΣ(θ , δ), all
(i, j, k, l) ∈ {1, . . . , d}4
.
The likelihood function L is such that both θ → L (θ) and
θ → ϕ (θ) × L (θ) satisfy the same assumption as ϕ.
There exists τ0 > 0 such that Eτ0 [|ϕ (Θ)|] < ∞,
Eτ0 [|L (Θ)|] < ∞ and Eτ0 [|ϕ (Θ) L (Θ)|] < ∞.
Pierre Jacob Derivative estimation 14/ 25
Shift estimators
Shift estimators
Sτ (θ ) = τ−2
Σ−1
Epost [Θ − θ ] = (θ ) + O τ2
,
Iτ (θ ) = τ−4
Σ−1
Vpost [Θ] − τ2
Σ Σ−1
= 2
(θ ) + O τ2
.
Using Monte Carlo approximations of
Epost [Θ] and Vpost [Θ] ,
these formulas provide estimators of
(θ ) and 2
(θ ).
Pierre Jacob Derivative estimation 15/ 25
Example: normal likelihood
The parameter θ represents a d-dimensional location, and
Y ∼ N(θ, Λ−1
y ).
Thus the derivatives of the log-likelihood are
(θ) = −Λy(θ − y),
2
(θ) = −Λy.
Using a prior N θ , τ2Σ , the posterior is conjugate and the
shift estimators are given by
Sτ (θ ) =
1
(1 + τ2ΣΛy)
(−Λy(θ − y)),
Iτ (θ ) =
1
(1 + τ2ΣΛy)
(−Λy) .
Pierre Jacob Derivative estimation 16/ 25
Outline
1 Context: intractable derivatives
2 Shift estimators
3 Monte Carlo shift estimators
Pierre Jacob Derivative estimation 17/ 25
Monte Carlo shift estimators
Draw θi ∼ N θ , τ2Σ , for each i ∈ {1, . . . , N}.
Draw ˆwi = L (θi), for each i ∈ {1, . . . , N}.
Normalize the weights: for each i ∈ {1, . . . , N},
ˆWi = ˆwi/
N
j=1
ˆwj.
Then we can estimate
τ−2
Σ−1
(Epost[Θ] − θ )
by
τ−2
Σ−1
N
i=1
ˆWiθi − θ ,
or, even better,
τ−2
Σ−1
N
i=1
ˆWiθi −
1
N
N
i=1
θi .
Pierre Jacob Derivative estimation 17/ 25
Monte Carlo shift estimators
For the second order derivative, recall that:
τ−4
Σ−1
Vpost [Θ] − τ2
Σ Σ−1
= 2
(θ ) + O τ2
.
Importance sampling strategy (with control variates) leads to:
τ−4
Σ−1


N
i=1
ˆWi θi −
N
i=1
ˆWjθj
2
−
1
N
N
i=1
θi −
1
N
N
i=1
θi
2

 Σ−1
.
with obvious extension in the multivariate case.
Pierre Jacob Derivative estimation 18/ 25
Monte Carlo shift estimators
Score estimator:
SN,τ (θ ) = τ−2
Σ−1
N
i=1
ˆWiθi −
1
N
N
i=1
θi .
Bias:
E SN,τ (θ ) − (θ ) = O τ2
.
Variance:
V SN,τ (θ ) = O
1
Nτ2
.
⇒ same rate of convergence than the finite difference method!
Similar results for the estimator of 2 (θ ).
Pierre Jacob Derivative estimation 19/ 25
Monte Carlo shift estimators
The shift estimators are competitive with the gold
standard!
The shift estimators are not better than the gold standard!
So why bother?
Why is Iterated Filtering used in practice?
Let’s have a look at the non-asymptotic behaviour.
Pierre Jacob Derivative estimation 20/ 25
Robustness
Recall:
SN,τ (θ ) = τ−2
Σ−1
N
i=1
ˆWiθi −
1
N
N
i=1
θi ,
where
ˆWi =
L(θi)
N
j=1 L(θj)
,
and V(L(θi)/L(θi)) = v/M.
Scenario: v is very large. Then N
i=1
ˆWiθi ≈ θj for some j.
We obtain, for all v:
V(SN,τ (θ ) ≤ Cτ−2
.
On the other hand, the variance of finite difference estimators
increases to ∞ with v.
Pierre Jacob Derivative estimation 21/ 25
Example: normal likelihood
Latent variable model:
X ∼ N(θ, Λ−1
y − λΛ−1
y|x), Y | X = x ∼ N(x, λΛ−1
y|x),
for fixed matrices Λy and Λy|x, and λ ∈ (0, 1).
Still corresponds to
Y ∼ N(θ, Λ−1
y ),
but a naive Monte Carlo approach has an infinite relative
variance when λ goes to zero.
The following figure represents the behaviour of the MSE for
the Monte Carlo shift estimator and the finite difference
estimator as λ goes to zero, i.e. v → ∞.
Pierre Jacob Derivative estimation 22/ 25
Robustness
1e+01
1e+04
1e+07
1e−01 1e−02 1e−03 1e−04 1e−05 1e−06
λ
MSE
Monte Carlo shift finite difference
Figure : Robustness of the Monte Carlo shift estimator when the
variance of the likelihood estimator increases.
Pierre Jacob Derivative estimation 23/ 25
Discussion
Monte Carlo shift estimators and finite difference
estimators: equivalent in mean squared error (N → ∞).
Monte Carlo shift estimators are robust to the noise of the
likelihood estimators (v → ∞). This might explain the
observed gain within optimization procedures.
There are specific forms of shift estimators for hidden
Markov models (e.g. Iterated Filtering).
Behaviour of the posterior distribution when the prior is a
concentrated normal distribution, interesting on its own
right?
Pierre Jacob Derivative estimation 24/ 25
Bibliography
Main references:
Inference for nonlinear dynamical systems, Ionides, Breto,
King, PNAS, 2006.
Iterated filtering, Ionides, Bhadra, Atchad´e, King, Annals
of Statistics, 2011.
Derivative-Free Estimation of the Score Vector
and Observed Information Matrix,
Doucet, Jacob, Rubenthaler, (on arXiv, to be updated).
Pierre Jacob Derivative estimation 25/ 25

More Related Content

PDF
Estimation of the score vector and observed information matrix in intractable...
PDF
Estimation of the score vector and observed information matrix in intractable...
PDF
Estimation of the score vector and observed information matrix in intractable...
PDF
Tensor Train data format for uncertainty quantification
PDF
Unbiased Hamiltonian Monte Carlo
PDF
Jere Koskela slides
PDF
QMC: Operator Splitting Workshop, Perturbed (accelerated) Proximal-Gradient A...
PDF
Numerical integration based on the hyperfunction theory
Estimation of the score vector and observed information matrix in intractable...
Estimation of the score vector and observed information matrix in intractable...
Estimation of the score vector and observed information matrix in intractable...
Tensor Train data format for uncertainty quantification
Unbiased Hamiltonian Monte Carlo
Jere Koskela slides
QMC: Operator Splitting Workshop, Perturbed (accelerated) Proximal-Gradient A...
Numerical integration based on the hyperfunction theory

What's hot (20)

PDF
Monte Carlo Statistical Methods
PDF
Rao-Blackwellisation schemes for accelerating Metropolis-Hastings algorithms
PDF
Signal Processing Introduction using Fourier Transforms
PDF
An introduction to moment closure techniques
PDF
Levitan Centenary Conference Talk, June 27 2014
PDF
Approximate Bayesian Computation with Quasi-Likelihoods
PDF
Bayesian Experimental Design for Stochastic Kinetic Models
PDF
An application of the hyperfunction theory to numerical integration
PDF
Doering Savov
 
PDF
Tutorial of topological_data_analysis_part_1(basic)
DOC
Chapter 5 (maths 3)
PDF
Stochastic Frank-Wolfe for Constrained Finite Sum Minimization @ Montreal Opt...
PDF
Bayesian hybrid variable selection under generalized linear models
PDF
IVR - Chapter 1 - Introduction
PDF
Monte Carlo Statistical Methods
PPT
Convex Optimization Modelling with CVXOPT
PDF
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
PDF
A series of maximum entropy upper bounds of the differential entropy
PDF
Lesson 15: Exponential Growth and Decay (slides)
PDF
Hyperfunction method for numerical integration and Fredholm integral equation...
Monte Carlo Statistical Methods
Rao-Blackwellisation schemes for accelerating Metropolis-Hastings algorithms
Signal Processing Introduction using Fourier Transforms
An introduction to moment closure techniques
Levitan Centenary Conference Talk, June 27 2014
Approximate Bayesian Computation with Quasi-Likelihoods
Bayesian Experimental Design for Stochastic Kinetic Models
An application of the hyperfunction theory to numerical integration
Doering Savov
 
Tutorial of topological_data_analysis_part_1(basic)
Chapter 5 (maths 3)
Stochastic Frank-Wolfe for Constrained Finite Sum Minimization @ Montreal Opt...
Bayesian hybrid variable selection under generalized linear models
IVR - Chapter 1 - Introduction
Monte Carlo Statistical Methods
Convex Optimization Modelling with CVXOPT
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
A series of maximum entropy upper bounds of the differential entropy
Lesson 15: Exponential Growth and Decay (slides)
Hyperfunction method for numerical integration and Fredholm integral equation...
Ad

Similar to Estimation of the score vector and observed information matrix in intractable models (20)

PDF
Recent developments on unbiased MCMC
PDF
Unbiased MCMC with couplings
PDF
PDF
Moment Closure Based Parameter Inference of Stochastic Kinetic Models
PDF
Wavelet Tour of Signal Processing 3rd Edition Mallat Solutions Manual
PDF
Unbiased Markov chain Monte Carlo methods
PDF
Hierarchical matrices for approximating large covariance matries and computin...
PDF
Wavelet Tour of Signal Processing 3rd Edition Mallat Solutions Manual
PDF
Low rank tensor approximation of probability density and characteristic funct...
PPTX
Resurgence2020-Sueishi análise wkb com in
PDF
Wavelet Tour of Signal Processing 3rd Edition Mallat Solutions Manual
PDF
SLC 2015 talk improved version
PDF
Problem_Session_Notes
PDF
An investigation of inference of the generalized extreme value distribution b...
PDF
Greens Function Estimates For Lattice Schrdinger Operators And Applications A...
PDF
Computing f-Divergences and Distances of\\ High-Dimensional Probability Densi...
PDF
Basic concepts and how to measure price volatility
PDF
Seminar Talk: Multilevel Hybrid Split Step Implicit Tau-Leap for Stochastic R...
PDF
Complete l fuzzy metric spaces and common fixed point theorems
PDF
On estimating the integrated co volatility using
Recent developments on unbiased MCMC
Unbiased MCMC with couplings
Moment Closure Based Parameter Inference of Stochastic Kinetic Models
Wavelet Tour of Signal Processing 3rd Edition Mallat Solutions Manual
Unbiased Markov chain Monte Carlo methods
Hierarchical matrices for approximating large covariance matries and computin...
Wavelet Tour of Signal Processing 3rd Edition Mallat Solutions Manual
Low rank tensor approximation of probability density and characteristic funct...
Resurgence2020-Sueishi análise wkb com in
Wavelet Tour of Signal Processing 3rd Edition Mallat Solutions Manual
SLC 2015 talk improved version
Problem_Session_Notes
An investigation of inference of the generalized extreme value distribution b...
Greens Function Estimates For Lattice Schrdinger Operators And Applications A...
Computing f-Divergences and Distances of\\ High-Dimensional Probability Densi...
Basic concepts and how to measure price volatility
Seminar Talk: Multilevel Hybrid Split Step Implicit Tau-Leap for Stochastic R...
Complete l fuzzy metric spaces and common fixed point theorems
On estimating the integrated co volatility using
Ad

More from Pierre Jacob (14)

PDF
Talk at CIRM on Poisson equation and debiasing techniques
PDF
ISBA 2022 Susie Bayarri lecture
PDF
Couplings of Markov chains and the Poisson equation
PDF
Monte Carlo methods for some not-quite-but-almost Bayesian problems
PDF
Monte Carlo methods for some not-quite-but-almost Bayesian problems
PDF
Markov chain Monte Carlo methods and some attempts at parallelizing them
PDF
Current limitations of sequential inference in general hidden Markov models
PDF
On non-negative unbiased estimators
PDF
Path storage in the particle filter
PDF
Density exploration methods
PDF
SMC^2: an algorithm for sequential analysis of state-space models
PDF
PAWL - GPU meeting @ Warwick
PDF
Presentation of SMC^2 at BISP7
PDF
Presentation MCB seminar 09032011
Talk at CIRM on Poisson equation and debiasing techniques
ISBA 2022 Susie Bayarri lecture
Couplings of Markov chains and the Poisson equation
Monte Carlo methods for some not-quite-but-almost Bayesian problems
Monte Carlo methods for some not-quite-but-almost Bayesian problems
Markov chain Monte Carlo methods and some attempts at parallelizing them
Current limitations of sequential inference in general hidden Markov models
On non-negative unbiased estimators
Path storage in the particle filter
Density exploration methods
SMC^2: an algorithm for sequential analysis of state-space models
PAWL - GPU meeting @ Warwick
Presentation of SMC^2 at BISP7
Presentation MCB seminar 09032011

Recently uploaded (20)

PPTX
cpcsea ppt.pptxssssssssssssssjjdjdndndddd
PPTX
2. Earth - The Living Planet earth and life
PDF
Biophysics 2.pdffffffffffffffffffffffffff
PPTX
ognitive-behavioral therapy, mindfulness-based approaches, coping skills trai...
PPTX
2Systematics of Living Organisms t-.pptx
PPTX
7. General Toxicologyfor clinical phrmacy.pptx
PPTX
2. Earth - The Living Planet Module 2ELS
PPTX
neck nodes and dissection types and lymph nodes levels
PDF
Placing the Near-Earth Object Impact Probability in Context
PPTX
microscope-Lecturecjchchchchcuvuvhc.pptx
PDF
Sciences of Europe No 170 (2025)
PPTX
DRUG THERAPY FOR SHOCK gjjjgfhhhhh.pptx.
PDF
HPLC-PPT.docx high performance liquid chromatography
PDF
IFIT3 RNA-binding activity primores influenza A viruz infection and translati...
PPTX
Protein & Amino Acid Structures Levels of protein structure (primary, seconda...
PPTX
INTRODUCTION TO EVS | Concept of sustainability
PDF
Unveiling a 36 billion solar mass black hole at the centre of the Cosmic Hors...
PPTX
Microbiology with diagram medical studies .pptx
PPTX
TOTAL hIP ARTHROPLASTY Presentation.pptx
PDF
AlphaEarth Foundations and the Satellite Embedding dataset
cpcsea ppt.pptxssssssssssssssjjdjdndndddd
2. Earth - The Living Planet earth and life
Biophysics 2.pdffffffffffffffffffffffffff
ognitive-behavioral therapy, mindfulness-based approaches, coping skills trai...
2Systematics of Living Organisms t-.pptx
7. General Toxicologyfor clinical phrmacy.pptx
2. Earth - The Living Planet Module 2ELS
neck nodes and dissection types and lymph nodes levels
Placing the Near-Earth Object Impact Probability in Context
microscope-Lecturecjchchchchcuvuvhc.pptx
Sciences of Europe No 170 (2025)
DRUG THERAPY FOR SHOCK gjjjgfhhhhh.pptx.
HPLC-PPT.docx high performance liquid chromatography
IFIT3 RNA-binding activity primores influenza A viruz infection and translati...
Protein & Amino Acid Structures Levels of protein structure (primary, seconda...
INTRODUCTION TO EVS | Concept of sustainability
Unveiling a 36 billion solar mass black hole at the centre of the Cosmic Hors...
Microbiology with diagram medical studies .pptx
TOTAL hIP ARTHROPLASTY Presentation.pptx
AlphaEarth Foundations and the Satellite Embedding dataset

Estimation of the score vector and observed information matrix in intractable models

  • 1. Estimation of the score vector and observed information matrix in intractable models Arnaud Doucet (University of Oxford) Pierre E. Jacob (University of Oxford) Sylvain Rubenthaler (Universit´e Nice Sophia Antipolis) June 15th, 2015 Pierre Jacob Derivative estimation 1/ 25
  • 2. Outline 1 Context: intractable derivatives 2 Shift estimators 3 Monte Carlo shift estimators Pierre Jacob Derivative estimation 1/ 25
  • 3. Outline 1 Context: intractable derivatives 2 Shift estimators 3 Monte Carlo shift estimators Pierre Jacob Derivative estimation 2/ 25
  • 4. Motivation Derivatives of the likelihood help optimizing / sampling. They become crucial when the parameter space is high-dimensional. For many models, these derivatives are intractable. One can resort to approximation techniques. Pierre Jacob Derivative estimation 2/ 25
  • 5. Motivation: hidden Markov models y2 X2X0 y1 X1 ... ... yT XT θ Figure : Graph representation of a general hidden Markov model. Pierre Jacob Derivative estimation 3/ 25
  • 6. Motivation: hidden Markov models The associated loglikelihood is (θ) = log p(y1:T ; θ) = log XT+1 T t=1 g(yt | xt, θ) µ(dx1 | θ) T t=2 f (dxt | xt−1, θ). This high-dimensional integral can be approximated efficiently using particle filters. However, the approximation itself is “discontinuous” as a function of the parameter θ, even when the “seed” is fixed. Figures coming. Pierre Jacob Derivative estimation 4/ 25
  • 7. Motivation: hidden Markov models −198 −196 −194 −192 −190 0.500 0.525 0.550 0.575 0.600 θ loglikelihood different seed truth Figure : Approximation of the log-likelihood by particle filters. Pierre Jacob Derivative estimation 5/ 25
  • 8. Motivation: hidden Markov models −196 −194 −192 −190 0.500 0.525 0.550 0.575 0.600 θ loglikelihood same seed truth Figure : Approximation of the log-likelihood by particle filters. Pierre Jacob Derivative estimation 6/ 25
  • 9. Motivation: general context Let (θ) denote a “log-likelihood” but it could be any function. Let L(θ) = exp (θ), the “likelihood”. Assume that we have access to estimators L(θ) of L(θ) such that E[L(θ)] = L(θ) and V L(θ) L(θ) = v(θ) M ≈ v M (∀θ). Pierre Jacob Derivative estimation 7/ 25
  • 10. Finite difference First derivative: (θ ) = log L(θ + h) − log L(θ − h) 2h converges to (θ ) when M → ∞ and h → 0. Second derivative: 2 (θ) = log L(θ + h) − 2 log L(θ) + log L(θ − h) h2 converges to 2 (θ ) when M → ∞ and h → 0. Pierre Jacob Derivative estimation 8/ 25
  • 11. Finite difference Bias: E (θ ) − (θ ) = O(h2 ). Variance: V (θ ) = O 1 Mh2 . Optimal rate of convergence for the first derivative: h ∼ M−1/6 leading to MSE ∼ M−2/3 . For the second derivative: h ∼ M−1/8 leading to MSE ∼ M−1/2 . Pierre Jacob Derivative estimation 9/ 25
  • 12. Outline 1 Context: intractable derivatives 2 Shift estimators 3 Monte Carlo shift estimators Pierre Jacob Derivative estimation 10/ 25
  • 13. Iterated Filtering Ionides, Bhadra, Atchad´e, King, Iterated filtering, AoS, 2011. Given a log-likelihood and a given point θ , consider a “prior” distribution centered on θ , for instance Θ ∼ N(θ , τ2 Σ). We can consider “posterior” expectations, when they exist: Epost [ϕ (Θ)] = E [ϕ (Θ) exp (Θ)] E [exp (Θ)] , where the expectations E are with respect to the “prior”. Pierre Jacob Derivative estimation 10/ 25
  • 14. Iterated Filtering Under conditions on the prior distribution, and on the log-likelihood, one gets the following result. Posterior expectation when the prior variance goes to zero For any θ , there exists a C such that, for τ small enough, |τ−2 Σ−1 (Epost[Θ] − θ ) − (θ )| ≤ Cτ. Phrased simply, posterior mean − prior mean prior variance ≈ score. Result from Ionides, Bhadra, Atchad´e, King, Iterated filtering, 2011. Pierre Jacob Derivative estimation 11/ 25
  • 15. Normal priors Let us focus on the normal case: Θ ∼ N(θ , τ2 Σ). Stein’s Lemma says... Assume that E [| iL (Θ)|] < ∞ and E 2 ijL (Θ) < ∞ for i, j ∈ {1, . . . , d}. Then we have Epost [Θ − θ ] = τ2 Σ Epost [ (Θ)] , and Epost (Θ − θ ) (Θ − θ )T = τ2 Σ + τ4 ΣEpost 2 (Θ) + (Θ) (Θ)T Σ. Pierre Jacob Derivative estimation 12/ 25
  • 16. Expected derivatives and derivatives We do not want to estimate Epost [ (Θ)] but rather (θ ). Posterior moment expansion When τ → 0, we have for ϕ satisfying some conditions, Epost [ϕ (Θ)] = ϕ (θ ) + τ2 2 d i=1 d j=1 2 ijϕ (θ ) + 2 iϕ (θ ) j (θ ) Σij + O τ4 . Pierre Jacob Derivative estimation 13/ 25
  • 17. Expected derivatives and derivatives Conditions for this to hold: Let ϕ : Rd → R be a four times continuously differentiable function. Assume that there exists a constant M < ∞ and δ > 0 such that ∂4ϕ(θ) ∂θi∂θj∂θk∂θl ≤ M for all θ ∈ BΣ(θ , δ), all (i, j, k, l) ∈ {1, . . . , d}4 . The likelihood function L is such that both θ → L (θ) and θ → ϕ (θ) × L (θ) satisfy the same assumption as ϕ. There exists τ0 > 0 such that Eτ0 [|ϕ (Θ)|] < ∞, Eτ0 [|L (Θ)|] < ∞ and Eτ0 [|ϕ (Θ) L (Θ)|] < ∞. Pierre Jacob Derivative estimation 14/ 25
  • 18. Shift estimators Shift estimators Sτ (θ ) = τ−2 Σ−1 Epost [Θ − θ ] = (θ ) + O τ2 , Iτ (θ ) = τ−4 Σ−1 Vpost [Θ] − τ2 Σ Σ−1 = 2 (θ ) + O τ2 . Using Monte Carlo approximations of Epost [Θ] and Vpost [Θ] , these formulas provide estimators of (θ ) and 2 (θ ). Pierre Jacob Derivative estimation 15/ 25
  • 19. Example: normal likelihood The parameter θ represents a d-dimensional location, and Y ∼ N(θ, Λ−1 y ). Thus the derivatives of the log-likelihood are (θ) = −Λy(θ − y), 2 (θ) = −Λy. Using a prior N θ , τ2Σ , the posterior is conjugate and the shift estimators are given by Sτ (θ ) = 1 (1 + τ2ΣΛy) (−Λy(θ − y)), Iτ (θ ) = 1 (1 + τ2ΣΛy) (−Λy) . Pierre Jacob Derivative estimation 16/ 25
  • 20. Outline 1 Context: intractable derivatives 2 Shift estimators 3 Monte Carlo shift estimators Pierre Jacob Derivative estimation 17/ 25
  • 21. Monte Carlo shift estimators Draw θi ∼ N θ , τ2Σ , for each i ∈ {1, . . . , N}. Draw ˆwi = L (θi), for each i ∈ {1, . . . , N}. Normalize the weights: for each i ∈ {1, . . . , N}, ˆWi = ˆwi/ N j=1 ˆwj. Then we can estimate τ−2 Σ−1 (Epost[Θ] − θ ) by τ−2 Σ−1 N i=1 ˆWiθi − θ , or, even better, τ−2 Σ−1 N i=1 ˆWiθi − 1 N N i=1 θi . Pierre Jacob Derivative estimation 17/ 25
  • 22. Monte Carlo shift estimators For the second order derivative, recall that: τ−4 Σ−1 Vpost [Θ] − τ2 Σ Σ−1 = 2 (θ ) + O τ2 . Importance sampling strategy (with control variates) leads to: τ−4 Σ−1   N i=1 ˆWi θi − N i=1 ˆWjθj 2 − 1 N N i=1 θi − 1 N N i=1 θi 2   Σ−1 . with obvious extension in the multivariate case. Pierre Jacob Derivative estimation 18/ 25
  • 23. Monte Carlo shift estimators Score estimator: SN,τ (θ ) = τ−2 Σ−1 N i=1 ˆWiθi − 1 N N i=1 θi . Bias: E SN,τ (θ ) − (θ ) = O τ2 . Variance: V SN,τ (θ ) = O 1 Nτ2 . ⇒ same rate of convergence than the finite difference method! Similar results for the estimator of 2 (θ ). Pierre Jacob Derivative estimation 19/ 25
  • 24. Monte Carlo shift estimators The shift estimators are competitive with the gold standard! The shift estimators are not better than the gold standard! So why bother? Why is Iterated Filtering used in practice? Let’s have a look at the non-asymptotic behaviour. Pierre Jacob Derivative estimation 20/ 25
  • 25. Robustness Recall: SN,τ (θ ) = τ−2 Σ−1 N i=1 ˆWiθi − 1 N N i=1 θi , where ˆWi = L(θi) N j=1 L(θj) , and V(L(θi)/L(θi)) = v/M. Scenario: v is very large. Then N i=1 ˆWiθi ≈ θj for some j. We obtain, for all v: V(SN,τ (θ ) ≤ Cτ−2 . On the other hand, the variance of finite difference estimators increases to ∞ with v. Pierre Jacob Derivative estimation 21/ 25
  • 26. Example: normal likelihood Latent variable model: X ∼ N(θ, Λ−1 y − λΛ−1 y|x), Y | X = x ∼ N(x, λΛ−1 y|x), for fixed matrices Λy and Λy|x, and λ ∈ (0, 1). Still corresponds to Y ∼ N(θ, Λ−1 y ), but a naive Monte Carlo approach has an infinite relative variance when λ goes to zero. The following figure represents the behaviour of the MSE for the Monte Carlo shift estimator and the finite difference estimator as λ goes to zero, i.e. v → ∞. Pierre Jacob Derivative estimation 22/ 25
  • 27. Robustness 1e+01 1e+04 1e+07 1e−01 1e−02 1e−03 1e−04 1e−05 1e−06 λ MSE Monte Carlo shift finite difference Figure : Robustness of the Monte Carlo shift estimator when the variance of the likelihood estimator increases. Pierre Jacob Derivative estimation 23/ 25
  • 28. Discussion Monte Carlo shift estimators and finite difference estimators: equivalent in mean squared error (N → ∞). Monte Carlo shift estimators are robust to the noise of the likelihood estimators (v → ∞). This might explain the observed gain within optimization procedures. There are specific forms of shift estimators for hidden Markov models (e.g. Iterated Filtering). Behaviour of the posterior distribution when the prior is a concentrated normal distribution, interesting on its own right? Pierre Jacob Derivative estimation 24/ 25
  • 29. Bibliography Main references: Inference for nonlinear dynamical systems, Ionides, Breto, King, PNAS, 2006. Iterated filtering, Ionides, Bhadra, Atchad´e, King, Annals of Statistics, 2011. Derivative-Free Estimation of the Score Vector and Observed Information Matrix, Doucet, Jacob, Rubenthaler, (on arXiv, to be updated). Pierre Jacob Derivative estimation 25/ 25