Let's Practice What We Preach: Likelihood Methods for Monte Carlo Data

Let’s Practice What We Preach:
Likelihood Methods for Monte Carlo Data

Xiao-Li Meng

Department of Statistics, Harvard University

September 24, 2011

logo

Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 1 / 23

Let’s Practice What We Preach:
Likelihood Methods for Monte Carlo Data

Xiao-Li Meng

Department of Statistics, Harvard University

September 24, 2011

Based on
Kong, McCullagh, Meng, Nicolae, and Tan (2003, JRSS-B, with
discussions);
Kong, McCullagh, Meng, and Nicolae (2006, Doksum Festschrift);
Tan (2004, JASA); ..., Meng and Tan (201X)
logo


Importance sampling (IS)

logo



Estimand:
q1 (x)
c1 = q1 (x)µ(dx) = p2 (x)µ(dx).
Γ Γ p2 (x)

logo



Estimand:
q1 (x)
c1 = q1 (x)µ(dx) = p2 (x)µ(dx).
Γ Γ p2 (x)

Data: {Xi2 , i = 1, . . . n2 } ∼ p2 = q2 /c2

logo



Estimand:
q1 (x)
c1 = q1 (x)µ(dx) = p2 (x)µ(dx).
Γ Γ p2 (x)

Data: {Xi2 , i = 1, . . . n2 } ∼ p2 = q2 /c2
Estimating Equation (EE):

c1 q1 (X )
r≡ = E2 .
c2 q2 (X )

logo



Estimand:
q1 (x)
c1 = q1 (x)µ(dx) = p2 (x)µ(dx).
Γ Γ p2 (x)

Data: {Xi2 , i = 1, . . . n2 } ∼ p2 = q2 /c2

c1 q1 (X )
r≡ = E2 .
c2 q2 (X )

The EE estimator:
n2
1 q1 (Xi2 )
ˆ=
r
n2 q2 (Xi2 )
i=1

logo



Estimand:
q1 (x)
c1 = q1 (x)µ(dx) = p2 (x)µ(dx).
Γ Γ p2 (x)

Data: {Xi2 , i = 1, . . . n2 } ∼ p2 = q2 /c2

c1 q1 (X )
r≡ = E2 .
c2 q2 (X )

The EE estimator:
n2
1 q1 (Xi2 )
ˆ=
r
n2 q2 (Xi2 )
i=1

Standard IS estimator for c1 when c2 = 1. logo


What about MLE?

logo


What about MLE?

The “likelihood” is:
n2
f (X12 . . . Xn2 2 ) = p2 (Xi2 ) — free of the estimand c1 !
i=1

logo


What about MLE?

n2
i=1

So why are {Xi2 , i = 1, . . . n2 } even relevant?
Violation of likelihood principle?

logo


What about MLE?

n2
i=1

So why are {Xi2 , i = 1, . . . n2 } even relevant?
Violation of likelihood principle?
What are we “inferring”?
What is the “unknown” model parameter?

logo


Bridge sampling (BS)

logo



Data: {Xij , i = 1, . . . , nj } ∼ pj = qj /cj , j = 1, 2

logo



Estimating Equation (Meng and Wong, 1996):

c1 E2 [α(X )q1 (X )]
r≡ = , ∀α: 0<| αq1 q2 dµ| < ∞
c2 E1 [α(X )q2 (X )]

logo




c1 E2 [α(X )q1 (X )]
r≡ = , ∀α: 0<| αq1 q2 dµ| < ∞
c2 E1 [α(X )q2 (X )]

Optimal choice: αO (x) ∝ [n1 q1 (x) + n2 rq2 (x)]−1

logo




c1 E2 [α(X )q1 (X )]
r≡ = , ∀α: 0<| αq1 q2 dµ| < ∞
c2 E1 [α(X )q2 (X )]

Optimal choice: αO (x) ∝ [n1 q1 (x) + n2 rq2 (x)]−1
Optimal estimator Ô , the limit of
r
n2
1 q1 (Xi2 )
n2 (t)
s1 q1 (Xi2 )+s2 Ô q2 (Xi2 )
r
(t+1) i=1
Ô
r = n1
1 q2 (Xi1 )
n1 (t)
s1 q1 (Xi1 )+s2 Ô q2 (Xi1 )
r
i=1
logo


What about MLE?

logo


What about MLE?

2 nj
qj (Xij ) −n −n
∝ c1 1 c2 2 — free of data!
cj
j=1 i=1

logo


What about MLE?

2 nj
qj (Xij ) −n −n
cj
j=1 i=1

What went wrong: cj is not “free parameter” because
cj = Γ qj (x)µ(dx) and qj is known.

logo


What about MLE?

2 nj
qj (Xij ) −n −n
cj
j=1 i=1

So what is the “unknown” model parameter?

logo


What about MLE?

2 nj
qj (Xij ) −n −n
cj
j=1 i=1

Turns out ˆO is the same as Bennett’s (1976) optimal acceptance
r
ratio estimator, as well as Geyer’s (1994) reversed logistic regression
estimator.

logo


What about MLE?

2 nj
qj (Xij ) −n −n
cj
j=1 i=1

Turns out ˆO is the same as Bennett’s (1976) optimal acceptance
r
ratio estimator, as well as Geyer’s (1994) reversed logistic regression
estimator.
So why is that? Can it be improved upon without any “sleight of
hand”?
logo


Pretending the measure is unknown!

logo



Because
c= q(x)µ(dx),
Γ
and q is known in the sense that we can evaluate it at any sample
value, the only way to make c “unknown” is to assume the underlying
measure µ is “unknown”.

logo



Because
c= q(x)µ(dx),
Γ
This is natural because Monte Carlo simulation means we use samples
to represent, and thus estimate/infer, the underlying population
q(x)µ(dx), and hence estimate/infer µ since q is known.

logo



Because
c= q(x)µ(dx),
Γ
This is natural because Monte Carlo simulation means we use samples
to represent, and thus estimate/infer, the underlying population
q(x)µ(dx), and hence estimate/infer µ since q is known.
Monte Carlo integration is about ﬁnding a tractable discrete µ to
ˆ
approximate the intractable µ.

logo


Importance Sampling Likelihood

logo



Estimand: c1 = Γ q1 (x)µ(dx)

logo



−1
Data: {Xi2 , i = 1, . . . n2 } ∼ i.i.d. c2 q2 (x)µ(dx)

logo



−1
Likelihood for µ:
n2
−1
L(µ) = c2 q2 (Xi2 )µ(Xi2 )
i=1

Note that c2 is a functional of µ.

logo



−1
Likelihood for µ:
n2
−1
L(µ) = c2 q2 (Xi2 )µ(Xi2 )
i=1

Note that c2 is a functional of µ.
The nonparametric MLE of µ is
ˆ
P(dx)
µ(dx) =
ˆ , ˆ
P — empirical measure
q2 (x)
logo



logo



Thus the MLE for r ≡ c1 /c2 is
n2
1 q1 (Xi2 )
ˆ=
r q1 (x)ˆ(dx) =
µ
n2 q2 (Xi2 )
i=1

logo



n2
1 q1 (Xi2 )
ˆ=
r q1 (x)ˆ(dx) =
µ
n2 q2 (Xi2 )
i=1

When c2 = 1, q2 = p2 , standard IS estimator for c1 is obtained.

logo



n2
1 q1 (Xi2 )
ˆ=
r q1 (x)ˆ(dx) =
µ
n2 q2 (Xi2 )
i=1

When c2 = 1, q2 = p2 , standard IS estimator for c1 is obtained.
{X(i2) , i = 1, . . . n2 } is (minimum) suﬃcient for µ on
x ∈ S2 = {x : q2 (x) > 0}, and hence c1 is guaranteed to be
ˆ
consistent only when S1 ⊂ S2 .

logo


Bridge Sampling Likelihood

logo



Estimand: ∝ cj = Γ qj (x)µ(x), j = 1, . . . , J.

logo



Data: {Xij , 1 ≤ i ≤ nj } ∼ cj−1 qj (x)µ(dx), 1 ≤j ≤J

logo



Data: {Xij , 1 ≤ i ≤ nj } ∼ cj−1 qj (x)µ(dx), 1 ≤ j ≤ J
nj
Likelihood for µ: L(µ) = J j=1
−1
i=1 cj qj (Xij )µ(Xij )

logo



Data: {Xij , 1 ≤ i ≤ nj } ∼ cj−1 qj (x)µ(dx), 1 ≤ j ≤ J
nj
Likelihood for µ: L(µ) = J j=1
−1
i=1 cj qj (Xij )µ(Xij )
Writing θ(x) = log µ(x), then
J
log L(µ) = n ˆ
θ(x)d P − nj log cj (θ),
Γ j=1

ˆ
P is the empirical measure on {Xij , 1 ≤ i ≤ nj , 1 ≤ j ≤ J}.

logo



logo


ˆ
MLE for µ given by equating the canonical suﬃcient statistics P to
its expectation:
J
ˆ
nP(dx) = nj cj−1 qj (x)ˆ(dx),
ˆ µ
j=1

ˆ
nP(dx)
µ(dx) =
ˆ J
. (A)
ˆ−1
j=1 nj cj qj (x)

logo


ˆ
its expectation:
J
ˆ
ˆ µ
j=1

ˆ
nP(dx)
µ(dx) =
ˆ J
. (A)
ˆ−1
j=1 nj cj qj (x)

Consequently, the MLE for {c1 , . . . , cJ } must satisfy
J nj
qr (xij )
cr =
ˆ qr (x) d µ =
ˆ J
. (B)
Γ j=1 i=1 s=1 ˆ−1
ns cs qs (xij )

logo


ˆ
its expectation:
J
ˆ
ˆ µ
j=1

ˆ
nP(dx)
µ(dx) =
ˆ J
. (A)
ˆ−1
j=1 nj cj qj (x)

Consequently, the MLE for {c1 , . . . , cJ } must satisfy
J nj
qr (xij )
cr =
ˆ qr (x) d µ =
ˆ J
. (B)
Γ j=1 i=1 s=1 ˆ−1
ns cs qs (xij )

(B) is the “dual” equation of (A), and is also the same as the logo
equation for optimal multiple bridge sampling estimator (Tan 2004).

But We Can Ignore Less ...

logo



To restrict the parameter space for µ by using some knowledge of the
known µ, that it, to set up a sub-model.

logo



The new MLE has a smaller asymptotic variance under the submodel
than under the full model.

logo



Examples:

logo



Examples:
Group-invariance submodel

logo



Examples:
Linear submodel

logo



Examples:
Linear submodel
Log-linear submodel

logo


An Universally Improved IS

logo



Estimand: r = c1 /c2 ; cj = Rd qj (x)µ(dx)

logo



Estimand: r = c1 /c2 ; cj = R d qj (x)µ(dx)
−1
Data: {Xi2 , i = 1, . . . n2 } i.i.d ∼ c2 q2 µ(dx)

logo



−1
Taking G = {Id , −Id } leads to

logo



−1
n2
1 q1 (Xi2 ) + q1 (−Xi2 )
ˆG =
r .
n2 q2 (Xi2 ) + q2 (−Xi2 )
i=1

logo



−1
n2
1 q1 (Xi2 ) + q1 (−Xi2 )
ˆG =
r .
n2 q2 (Xi2 ) + q2 (−Xi2 )
i=1

Because of the Rao-Blackwellization, V(ˆG ) ≤ V(ˆ).
r r

logo



−1
n2
1 q1 (Xi2 ) + q1 (−Xi2 )
ˆG =
r .
n2 q2 (Xi2 ) + q2 (−Xi2 )
i=1

r r
Need twice as many evaluations, but typically this is a small insurance
premium.

logo



−1
n2
1 q1 (Xi2 ) + q1 (−Xi2 )
ˆG =
r .
n2 q2 (Xi2 ) + q2 (−Xi2 )
i=1

r r
premium.
Consider S1 = R & S2 = R + . Then ˆG is consistent for r :
r
n2 n2
1 q1 (Xi2 ) 1 q1 (−Xi2 )
ˆG =
r + .
n2 q2 (Xi2 ) n2 q2 (Xi2 )
i=1 i=1
logo



−1
n2
1 q1 (Xi2 ) + q1 (−Xi2 )
ˆG =
r .
n2 q2 (Xi2 ) + q2 (−Xi2 )
i=1

r r
premium.
Consider S1 = R & S2 = R + . Then ˆG is consistent for r :
r
n2 n2
1 q1 (Xi2 ) 1 q1 (−Xi2 )
ˆG =
r + .
n2 q2 (Xi2 ) n2 q2 (Xi2 )
i=1 i=1
logo
∞
But standard IS ˆ only estimates
r 0 q1 (x)µ(dx)/c2 .

There are many more improvements ...

Deﬁne a sub-model by requiring µ to be G-invariant, where G is a
ﬁnite group on Γ.

logo



The new MLE of µ is
ˆ
nP G (dx)
µG (dx) =
ˆ J
,
ˆ−1 G
j=1 nj cj q j (x)

ˆ ˆ
where P G (A) = aveg ∈G P(gA); q j G (x) = aveg ∈G qj (gx).

logo



ˆ
nP G (dx)
µG (dx) =
ˆ J
,
ˆ−1 G
j=1 nj cj q j (x)

ˆ ˆ
When the draws are i.i.d. within each ps dµ,

µG = E [ˆ| GX ],
ˆ µ

i.e., the Rao-Blackwellization of µ given the orbit.
ˆ

logo



ˆ
nP G (dx)
µG (dx) =
ˆ J
,
ˆ−1 G
j=1 nj cj q j (x)

ˆ ˆ
When the draws are i.i.d. within each ps dµ,

µG = E [ˆ| GX ],
ˆ µ

i.e., the Rao-Blackwellization of µ given the orbit.
ˆ
Consequently,

cj G =
ˆ qj (x)µG (dx) = E [ˆj |GX ].
c logo
Γ


Using Groups to model trade-oﬀ

logo



If G1 G2 , then
G1 G2
Var c ≤ Var c .

logo



If G1 G2 , then
G1 G2
Var c ≤ Var c .

The statistical eﬃciency increases with the size of Gi , but so does the
computational cost needed for function evaluation (but not for
sampling, because there are no additional samples involved).

logo


Linear submodel: stratiﬁed sampling (Tan 2004)

logo



i.i.d
Data: {Xij , 1 ≤ i ≤ nj } ∼ pj (x)µ(dx), 1 ≤ j ≤ J.

logo



i.i.d
The sub-model has parameter space

µ: pj (x) µ(dx), 1 ≤ j ≤ J, are equal (to 1).
Γ

logo



i.i.d

Γ

J nj
Likelihood for µ: L(µ) = j=1 i=1 pj (Xij )µ(Xij )

logo



i.i.d

Γ

J nj
Likelihood for µ: L(µ) = j=1 i=1 pj (Xij )µ(Xij )
The MLE is
ˆ
P(dx)
µlin (dx) =
ˆ J
,
j=1 πj pj (x)
ˆ
where πj s are MLEs from a mixture model:
ˆ
i.i.d J
the data ∼ j=1 πj pj (·) with πj s unknown
logo


So why MLE?

logo


So why MLE?
Goal: to estimate c = Γ q(x)µ(dx).

logo


So why MLE?
For an arbitrary vector b, consider the control-variate estimator
(Owen and Zhou 2000)
J nj
q(xji ) − b g (xji )
cb ≡
ˆ J
,
j=1 i=1 s=1 ns ps (xji )

where g = (p2 − p1 , . . . , pJ − p1 ) .

logo


So why MLE?
J nj
cb ≡
ˆ J
,

where g = (p2 − p1 , . . . , pJ − p1 ) .
A more general class: for J λj (x) ≡ 1 and J λj (x)bj (x) ≡ b,
j=1 j=1
consider (Veach and Guibas 1995 for bj ≡ 0; Tan, 2004)
J nj
1 q(xji ) − bj (xji )g (xji )
cλ,B =
ˆ λj (xji )
nj pj (xji )
j=1 i=1

logo


So why MLE?
J nj
cb ≡
ˆ J
,

where g = (p2 − p1 , . . . , pJ − p1 ) .
A more general class: for J λj (x) ≡ 1 and J λj (x)bj (x) ≡ b,
j=1 j=1
consider (Veach and Guibas 1995 for bj ≡ 0; Tan, 2004)
J nj
1 q(xji ) − bj (xji )g (xji )
cλ,B =
ˆ λj (xji )
nj pj (xji )
j=1 i=1

Should cλ,B be more eﬃcient than cb ? Could there be something
ˆ ˆ logo

even more eﬃcient?

Three estimators for c = Γ q(x) µ(dx):

logo



IS: 1
n
q(xi )
J
,
n j=1 πj pj (xi )
i=1

where πj = nj /n are the true proportions.

logo



IS: 1
n
q(xi )
J
,
n j=1 πj pj (xi )
i=1


Reg: n ˆ
1 q(xi ) − β g (xi )
J
,
n j=1 πj pj (xi )
i=1

ˆ
where β is the estimated regression coeﬃcient, ignoring stratiﬁcation.

logo



IS: 1
n
q(xi )
J
,
n j=1 πj pj (xi )
i=1


Reg: n ˆ
1 q(xi ) − β g (xi )
J
,
n j=1 πj pj (xi )
i=1

ˆ

Lik: 1
n
q(xi )
J
,
n j=1 πj pj (xi )
ˆ
i=1

where πj s are the estimated proportions, ignoring stratiﬁcation.
ˆ
logo



IS: 1
n
q(xi )
J
,
n j=1 πj pj (xi )
i=1


Reg: n ˆ
1 q(xi ) − β g (xi )
J
,
n j=1 πj pj (xi )
i=1

ˆ

Lik: 1
n
q(xi )
J
,
n j=1 πj pj (xi )
ˆ
i=1

where πj s are the estimated proportions, ignoring stratification.
ˆ
logo
Which one is most efficient? Least efficient?

Let’s ﬁnd it out ...

logo



Γ = R10 and µ is Lebesgue measure.

logo



The integrand is
10 10
q(x) = 0.8 φ(x j ) + 0.2 ψ(x j ; 4) ,
j=1 j=1

where φ(·) is standard normal density and ψ(·; 4) is t4 density.

logo



The integrand is
10 10
q(x) = 0.8 φ(x j ) + 0.2 ψ(x j ; 4) ,
j=1 j=1

Two sampling designs:

logo



The integrand is
10 10
q(x) = 0.8 φ(x j ) + 0.2 ψ(x j ; 4) ,
j=1 j=1

(i) q2 (x) with n draws, or

logo



The integrand is
10 10
q(x) = 0.8 φ(x j ) + 0.2 ψ(x j ; 4) ,
j=1 j=1

(i) q2 (x) with n draws, or
(ii) q1 (x) and q2 (x) each with n/2 draws,
where
10 10
q1 (x) = φ(x j ), q2 (x) = ψ(x j ; 1)
j=1 j=1 logo


A little surprise?

Table: Comparison of design and estimator

one sampler two samplers
IS Reg Lik IS Reg Lik
Sqrt MSE .162 .00942 .00931 .0175 .00881 .00881
Std Err .162
.00919 .00920 .0174 .00885 .00884
√
Note: Sqrt MSE is mean squared error of the point estimates and
√
Std Err is mean of the variance estimates from 10000 repeated
simulations of size n = 500.

logo


Comparison of eﬃciency:

logo



Statistical eﬃciency: IS < Reg ≈ Lik

logo



IS is a stratiﬁed estimator, which uses only the labels.

logo



Reg is conventional method of control variates.

logo



Reg is conventional method of control variates.
Lik is constrained MLE, which uses pj s but ignores the labels;
it is exact if q = pj for any particular j.

logo


Building intuition ...

logo



Suppose we make n = 2 draws, one from N(0, 1) and one from
Cauchy (0, 1), hence π1 = π2 = 50%.

logo



Cauchy (0, 1), hence π1 = π2 = 50%.
Suppose the draws are {1, 1}, what would be the MLE (ˆ1 , π2 )?
π ˆ

logo



Cauchy (0, 1), hence π1 = π2 = 50%.
π ˆ
π ˆ

logo



Cauchy (0, 1), hence π1 = π2 = 50%.
π ˆ
π ˆ
π ˆ

logo


What Did I Learn?

logo


What Did I Learn?

Model what we ignore, not what we know!

logo


What Did I Learn?

Model comparison/selection is not about which model is true (as all
of them are “true”), but which model represents a better compromise
among human, computational, and statistical eﬃciency.

logo


What Did I Learn?

Model comparison/selection is not about which model is true (as all
of them are “true”), but which model represents a better compromise
among human, computational, and statistical eﬃciency.
There is a cure for our “schizophrenia” — we now can analyze Monte
Carlo data using the same sound statistical principles and methods for
analyzing real data.

logo


If you are looking for theoretical research topics ...

logo



RE-EXAM OLD ONES AND DERIVE NEW ONES!

logo



Prove it is MLE, or a good approximation to MLE.

logo



Or derive MLE or a cost-eﬀective approximation to it.

logo



Markov chain Monte Carlo (Tan 2006, 2008)

logo



Markov chain Monte Carlo (Tan 2006, 2008)
More ......

logo


Let's Practice What We Preach: Likelihood Methods for Monte Carlo Data

More Related Content

What's hot (20)

Viewers also liked (7)

Similar to Let's Practice What We Preach: Likelihood Methods for Monte Carlo Data (20)

More from Christian Robert (20)

Let's Practice What We Preach: Likelihood Methods for Monte Carlo Data