SlideShare a Scribd company logo
Testing for mixture components
Christian P. Robert
U. Paris Dauphine & Warwick U.
Joint on-going work with A Hairault and J Rousseau
BNP 13, Puerto Varas
EaRly Call
Incoming 2023-2030 ERC funding for postdoctoral
collaborations with
I Michael Jordan (more Paris
than Berkeley)
I Eric Moulines (Paris)
I Gareth Roberts (Warwick)
I myself (more Paris than
Warwick)
Outline
1 Mixtures of distributions
2 Approximations to evidence
3 Dirichlet process mixtures
Mixtures of distributions
Convex combination of densities
x ∼ fj with probability pj,
for j = 1, 2, . . . , k, with overall density
p1f1(x) + · · · + pkfk(x) .
Usual case: parameterised components
k
X
i=1
pif(x|ϑi) with
n
X
i=1
pi = 1
where weights pi’s are distinguished from
other parameters
Jeffreys priors for mixtures
True Jeffreys prior for mixtures of distributions defined as
Eϑ

∇
∇ log f(X|ϑ)

I O(k) matrix
I unavailable in closed form except special cases
I unidimensional integrals approximated by Monte Carlo
tools
[Grazian  X, 2015]
Difficulties
I complexity grows in O(k2)
I significant computing requirement (reduced by delayed
acceptance)
[Banterle et al., 2014]
I differ from component-wise Jeffreys
[Diebolt  X, 1990; Stoneking, 2014]
I when is the posterior proper?
I how to check properness via MCMC outputs?
Further reference priors
Reparameterisation of a location-scale mixture in terms of its
global mean µ and global variance σ2 as
µi = µ + σαi and σi = στi 1 ≤ i ≤ k
where τi  0 and αi ∈ R
Induces compact space on other parameters:
k
X
i=1
piαi = 0 and
k
X
i=1
piτ2
i +
k
X
i=1
piα2
i = 1
© Posterior associated with prior π(µ, σ) = 1/σ proper with
Gaussian components if there are at least two observations in
the sample
[Kamary, Lee  X, 2018]
Label switching paradox
I We should observe the exchangeability of the components
[label switching] to conclude about convergence of the
Gibbs sampler
I If observed, how should we estimate parameters?
I If unobserved, uncertainty about convergence
[Celeux, Hurn  X, 2000; Frühwirth-Schnatter, 2001, 2004]
[Unless adopting a point process perspective]
[Green, 2019]
Loss functions for mixture estimation
Global loss function that considers
distance between predictives
L(ξ, ^
ξ) =
Z
X
fξ(x) log

fξ(x)/f^
ξ(x) dx
eliminates the labelling effect
Similar solution for estimating
clusters through allocation variables
L(z, ^
z) =
X
ij
h
I[zi=zj](1 − I[^
zi=^
zj]) + I[^
zi=^
zj](1 − I[zi=zj])
i
.
[Celeux, Hurn  X, 2000]
Bayesian model comparison
Bayes Factor consistent for selecting number of components
[Ishwaran et al., 2001; Casella  Moreno, 2009; Chib and Kuffner, 2016]
Bayes Factor consistent for testing parametric versus
nonparametric alternatives
[Verdinelli  Wasserman, 1997; Dass  Lee, 2004; McVinish et al., 2009]
Consistent evidence for location DPM
Consistency of Bayes factor comparing finite mixtures against
(location) Dirichlet Process Mixture
H0 : f0 ∈ MK vs. H1 : f0 /
∈ MK
0 250 500 750 1000 1250 1500 1750 2000
n
4
3
2
1
0
1
2
log(BF)
0 200 400 600 800 1000
n
0
1
2
3
4
5
6
7
8
log(BF)
(left) finite mixture; (right) not a finite mixture
Consistent evidence for location DPM
Under generic assumptions, when x1, · · · , xn iid fP0
with
P0 =
k0
X
j=1
p0
j δϑ0
j
and Dirichlet DP(M, G0) prior on P, there exists t  0 such
that for all ε  0
Pf0

mDP(x)  n−(k0−1+dk0+t)/2

= o(1)
Moreover there exists q ≥ 0 such that
ΠDP

kf0 − fpk1 ≤
(log n)q
√
n
x

= 1 + oPf0
(1)
[Hairault, X  Rousseau, 2022]
Consistent evidence for location DPM
Assumption A1 [Regularity]
Assumption A2 [Strong identifiability]
Assumption A3 [Compactness ]
Assumption A4 [Existence of DP random mean]
Assumption A5 [Truncation on support of M, e.g. truncated
Gamma]
If fP0
∈ Mk0
satisfies Assumptions A1–A3, then
mk0
(y)/mDP(y) → ∞ under fP0
Moreover for all k ≥ k0, if Dirichlet parameter α = η/k and
η  kd/2, then
mk(y)/mDP(y) → ∞ under fP0
Outline
1 Mixtures of distributions
2 Approximations to evidence
3 Dirichlet process mixtures
Chib’s or candidate’s representation
Direct application of Bayes’ theorem: given x ∼ fk(x|ϑk) and
ϑk ∼ πk(ϑk),
Zk = mk(x) =
fk(x|ϑk) πk(ϑk)
πk(ϑk|x)
Replace with an approximation to the posterior
b
Zk = d
mk(x) =
fk(x|ϑ∗
k) πk(ϑ∗
k)
^
πk(ϑ∗
k|x)
.
[Besag, 1989; Chib, 1995]
Natural Rao-Blackwellisation
For missing variable z as in mixture models, natural
Rao-Blackwell (unbiased) estimate
c
πk(ϑ∗
k|x) =
1
T
T
X
t=1
πk(ϑ∗
k|x, z
(t)
k ) ,
where the z
(t)
k ’s are Gibbs sampled latent variables
[Diebolt  X, 1990; Chib, 1995]
Compensation for label switching
For mixture models, z
(t)
k usually fails to visit all configurations
in a balanced way, despite the symmetry predicted by the
theory
Consequences on numerical approximation, biased by an order
k!
Compensation for label switching
For mixture models, z
(t)
k usually fails to visit all configurations
in a balanced way, despite the symmetry predicted by the
theory
Recover the theoretical symmetry by using
f
πk(ϑ∗
k|x) =
1
T k!
X
σ∈Sk
T
X
t=1
πk(σ(ϑ∗
k)|x, z
(t)
k ) .
for all σ’s in Sk, set of all permutations of {1, . . . , k}
[Neal, 1999; Berkhof, Mechelen,  Gelman, 2003; Lee  X, 2018]
Astronomical illustration
Benchmark galaxies for radial velocities of 82 galaxies
[Postman et al., 1986; Roader, 1992; Raftery, 1996]
Conjugate priors
σ2
k ∼ Γ−1
(a0, b0)
µk|σ2
k ∼ N(µ0, σ2
k/λ0)
Galaxy dataset (k)
Using Chib’s estimate, with ϑ∗
k as MAP estimator,
log(^
Zk(x)) = −105.1396
for k = 3, while introducing permutations leads to
log(^
Zk(x)) = −103.3479
Note that
−105.1396 + log(3!) = −103.3479
k 2 3 4 5 6 7 8
Zk(x) -115.68 -103.35 -102.66 -101.93 -102.88 -105.48 -108.44
Estimations of the marginal likelihoods by the symmetrised Chib’s
approximation (based on 105
Gibbs iterations and, for k  5, 100
permutations selected at random in Sk).
[Lee et al., 2008]
Rethinking Chib’s solution
Alternate Rao–Blackwellisation by marginalising into partitions
Apply candidate’s/Chib’s formula to a chosen partition:
mk(x) =
fk(x|C0)πk(C0)
πk(C0|x)
with
πk(C(z)) =
k!
(k − k+)!
Γ
Pk
j=1 αj

Γ
Pk
j=1 αj + n

k
Y
j=1
Γ(nj + αj)
Γ(αj)
C(z) partition of {1, . . . , n} induced by cluster membership z
nj =
Pn
i=1 I{zi=j} # observations assigned to cluster j
k+ =
Pk
j=1 I{nj0} # non-empty clusters
Rethinking Chib’s solution
Under conjugate priors G0 on ϑ,
fk(x|C(z))
k
Y
j=1
Z
Θ
Y
i:zi=k
f(xi|ϑ)G0(dϑ)
| {z }
m(Ck(z))
and
^
πk(C0
|x) =
1
T
T
X
t=1
IC0≡C(z(t))
I considerably lower computational demand
I no label switching issue
More efficient sampling
Difficulty with the explosive numbers of terms in
f
πk(ϑ∗
k|x) =
1
T k!
X
σ∈Sk
T
X
t=1
πk(σ(ϑ∗
k)|x, z
(t)
k ) .
when most terms are equal to zero...
Iterative bridge sampling:
b
E(t)
(k) = b
E(t−1)
(k) M−1
1
M1
X
l=1
^
π(ϑ̃l|x)
M1q(ϑ̃l) + M2 ^
π(ϑ̃l|x)
.
M−1
2
M2
X
m=1
q(^
ϑm)
M1q(^
ϑm) + M2 ^
π(^
ϑm|x)
[Frühwirth-Schnatter, 2004]
More efficient sampling
Iterative bridge sampling:
b
E(t)
(k) = b
E(t−1)
(k) M−1
1
M1
X
l=1
^
π(ϑ̃l|x)
M1q(ϑ̃l) + M2 ^
π(ϑ̃l|x)
.
M−1
2
M2
X
m=1
q(^
ϑm)
M1q(^
ϑm) + M2 ^
π(^
ϑm|x)
[Frühwirth-Schnatter, 2004]
where
q(ϑ) =
1
J1
J1
X
j=1
p(ϑ|z(j)
)
k
Y
i=1
p(ξi|ξ
(j)
ij, ξ
(j−1)
ij , z(j)
, x)
More efficient sampling
Iterative bridge sampling:
b
E(t)
(k) = b
E(t−1)
(k) M−1
1
M1
X
l=1
^
π(ϑ̃l|x)
M1q(ϑ̃l) + M2 ^
π(ϑ̃l|x)
.
M−1
2
M2
X
m=1
q(^
ϑm)
M1q(^
ϑm) + M2 ^
π(^
ϑm|x)
[Frühwirth-Schnatter, 2004]
or where
q(ϑ) =
1
k!
X
σ∈S(k)
p(ϑ|σ(zo
))
k
Y
i=1
p(ξi|σ(ξo
ij), σ(ξo
ij), σ(zo
), x)
Sparsity for the sum
Contribution of each term relative to q(ϑ)
ησ(ϑ) =
hσ(ϑ)
k!q(ϑ)
=
hσi
(ϑ)
P
σ∈Sk
hσ(ϑ)
and importance of permutation σ evaluated by
b
Ehσc
[ησi
(ϑ)] =
1
M
M
X
l=1
ησi
(ϑ(l)
) , ϑ(l)
∼ hσc (ϑ)
Approximate set A(k) ⊆ S(k) consist of [σ1, · · · , σn] for the
smallest n that satisfies the condition
^
ϕn =
1
M
M
X
l=1
q̃n(ϑ(l)
) − q(ϑ(l)
)  τ
dual importance sampling with approximation
DIS2A
1 Randomly select {z(j)
, ϑ(j)
}J
j=1 from Gibbs sample and
un-switch
Construct q(ϑ)
2 Choose hσc (ϑ) and generate particles {ϑ(t)
}T
t=1 ∼ hσc (ϑ)
3 Construction of approximation q̃(ϑ) using first M-sample
3.1 Compute b
Ehσc
[ησ1
(ϑ)], · · · ,b
Ehσc
[ησk!
(ϑ)]
3.2 Reorder the σ’s such that
b
Ehσc
[ησ1
(ϑ)] ≥ · · · ≥ b
Ehσc
[ησk!
(ϑ)].
3.3 Initially set n = 1 and compute q̃n(ϑ(t)
)’s and b
ϕn. If
b
ϕn=1  τ, go to Step 4. Otherwise increase n = n + 1
4 Replace q(ϑ(1)
), . . . , q(ϑ(T)
) with q̃(ϑ(1)
), . . . , q̃(ϑ(T)
) to
estimate b
E
[Lee  X, 2014]
illustrations
k k! |A(k)| ∆(A)
3 6 1.0000 0.1675
4 24 2.7333 0.1148
Fishery data
k k! |A(k)| ∆(A)
3 6 1.000 0.1675
4 24 15.7000 0.6545
6 720 298.1200 0.4146
Galaxy data
Table: Mean estimates of approximate set sizes, |A(k)|, and the
reduction rate of a number of evaluated h-terms ∆(A) for (a) fishery
and (b) galaxy datasets
Sequential Monte Carlo
Tempered sequence of targets (t = 1, . . . , T)
πkt(ϑk) ∝ pkt(ϑk) = πk(ϑk)fk(x|ϑk)λt
λ1 = 0  · · ·  λT = 1
particles (simulations) (i = 1, . . . , Nt)
ϑi
t
i.i.d.
∼ πkt(ϑk)
usually obtained by MCMC step
ϑi
t ∼ Kt(ϑi
t−1, ϑ)
with importance weights (i = 1, . . . , Nt)
ωt
i = fk(x|ϑk)λt−λt−1
[Del Moral et al., 2006; Bucholz et al., 2021]
Sequential Monte Carlo
Tempered sequence of targets (t = 1, . . . , T)
πkt(ϑk) ∝ pkt(ϑk) = πk(ϑk)fk(x|ϑk)λt
λ1 = 0  · · ·  λT = 1
Produces approximation of evidence
^
Zk =
Y
t
1
Nt
Nt
X
i=1
ωt
i
[Del Moral et al., 2006; Bucholz et al., 2021]
Sequential importance sampling
For conjugate priors, (marginal) particle filter representation of
a proposal:
π∗
(z|x) = π(z1|x1)
n
Y
i=2
π(zi|x1:i, z1:i−1)
with importance weight
π(z|x)
π∗(z|x)
=
π(x, z)
m(x)
m(x1)
π(z1, x1)
m(z1, x1, x2)
π(z1, x1, z2, x2)
· · ·
π(z1:n−1, x)
π(z, x)
=
w(z, x)
m(x)
leading to unbiased estimator of evidence
^
Zk(x) =
1
T
T
X
i=1
w(z(t)
, x)
[Long, Liu  Wong, 1994; Carvalho et al., 2010]
Galactic illustration
C
h
i
b
C
h
i
b
P
e
r
m
B
r
i
d
g
e
S
a
m
p
l
i
n
g
a
d
a
p
t
i
v
e
S
M
C
S
I
S
C
h
i
b
P
a
r
t
i
t
i
o
n
s
a
r
i
t
h
m
e
t
i
c
M
e
a
n
229
228
227
226
m(y)
K=3
C
h
i
b
C
h
i
b
P
e
r
m
B
r
i
d
g
e
S
a
m
p
l
i
n
g
a
d
a
p
t
i
v
e
S
M
C
S
I
S
C
h
i
b
P
a
r
t
i
t
i
o
n
s
a
r
i
t
h
m
e
t
i
c
M
e
a
n
232
230
228
226
K=5
C
h
i
b
C
h
i
b
R
a
n
d
P
e
r
m
a
d
a
p
t
i
v
e
S
M
C
S
I
S
C
h
i
b
P
a
r
t
i
t
i
o
n
s
a
r
i
t
h
m
e
t
i
c
M
e
a
n
235.0
232.5
230.0
227.5
m(y)
K=6
C
h
i
b
C
h
i
b
R
a
n
d
P
e
r
m
a
d
a
p
t
i
v
e
S
M
C
S
I
S
C
h
i
b
P
a
r
t
i
t
i
o
n
s
a
r
i
t
h
m
e
t
i
c
M
e
a
n
240
235
230
225
K=8
Common illustration
5.0 5.5 6.0 6.5 7.0 7.5 8.0
log(time)
10
8
6
4
2
0
2
log(MSE)
ChibPartitions
Bridge Sampling
ChibPerm
SIS
SMC
Empirical conclusions
I Bridge sampling, arithmetic mean and original Chib’s
method fail to scale with n, sample size
I Partition Chib’s increasingly variable
I Adaptive SMC ultimately fails
I SIS remains most reliable method
1 Mixtures of distributions
2 Approximations to evidence
3 Dirichlet process mixtures
Dirichlet process mixture (DPM)
Extension to the k = ∞ (non-parametric) case
xi|zi, ϑ
i.i.d
∼ f(xi|ϑxi
), i = 1, . . . , n (1)
P(Zi = k) = πk, k = 1, 2, . . .
π1, π2, . . . ∼ GEM(M) M ∼ π(M)
ϑ1, ϑ2, . . .
i.i.d
∼ G0
with GEM (Griffith-Engen-McCloskey) defined by the
stick-breaking representation
πk = vk
k−1
Y
i=1
(1 − vi) vi ∼ Beta(1, M)
[Sethuraman, 1994]
Dirichlet process mixture (DPM)
Resulting in an infinite mixture
x ∼
n
Y
i=1
∞
X
i=1
πif(xi|ϑi)
with (prior) cluster allocation
π(z|M) =
Γ(M)
Γ(M + n)
MK+
K+
Y
j=1
Γ(nj)
and conditional likelihood
p(x|z, M) =
K+
Y
j=1
Z Y
i:zi=j
f(xi|ϑj)dG0(ϑj)
available in closed form when G0 conjugate
Approximating the evidence
Extension of Chib’s formula by marginalising over z and ϑ
mDP(x) =
p(x|M∗, G0)π(M∗)
π(M∗|x)
and using estimate
^
π(M∗
|x) =
1
T
T
X
t=1
π(M∗
|x, η(t)
, K
(t)
+ )
provided prior on M a Γ(a, b) distribution since
M|x, η, K+ ∼ ωΓ(a+K+, b−log(η))+(1−ω)Γ(a+K+−1, b−log(η))
with ω = (a + K+ − 1)/{n(b − log(η)) + a + K+ − 1} and
η|x, M ∼ Beta(M + 1, n)
[Basu  Chib, 2003]
Approximating the likelihood
Intractable likelihood p(x|M∗, G0) approximated by sequential
importance sampling
Generating z from the proposal
π∗
(z|x, M) =
n
Y
i=1
π(zi|x1:i, z1:i−1, M)
and using the approximation
^
L(x|M∗
, G0) =
1
T
T
X
t=1
^
p(x1|z
(t)
1 , G0)
n
Y
i=2
p(yi|x1:i−1z
(t)
1:i−1, G0)
[Kong, Lu  Wong, 1994; Basu  Chib, 2003]
Approximating the evidence (bis)
Reverse logistic regression applies to DPM:
Importance function
π1(z, M) := π∗
(z|x, M)π(M) and π2(z, M) =
π(z, M|x)
m(y)
{z(1,j), M(1,j)}T
j=1 and {z(2,j), M(2,j)}T
j=1 samples from π1 and π2
marginal likelihood m(y) estimated as intercept of logistic
regression with covariate
log{π1(z, M)/π̃2(z, M)}
[Geyer, 1994; Chen  Shao, 1997]
Galactic illustration
a
r
i
t
h
m
e
t
i
c
h
a
r
m
o
n
i
c
C
h
i
b
R
L
R
-
S
I
S
R
L
R
-
P
r
i
o
r
23.296
23.288
23.280
23.272
23.264
23.256
23.248
23.240
23.232
23.224
m(x)
n=6
a
r
i
t
h
m
e
t
i
c
h
a
r
m
o
n
i
c
C
h
i
b
R
L
R
-
S
I
S
R
L
R
-
P
r
i
o
r
111
110
109
108
107
106
105
104
n=36
a
r
i
t
h
m
e
t
i
c
h
a
r
m
o
n
i
c
C
h
i
b
R
L
R
-
S
I
S
R
L
R
-
P
r
i
o
r
240
236
232
228
224
220
216
212
n=82
Galactic illustration
1.2 1.4 1.6 1.8 2.0 2.2 2.4 2.6
log(time)
3
2
1
0
1
log(MSE)
Chib
RLR-SIS
RLR-Prior
Muchas gracias y hasta Luego!
[arXiv:2205.05416]

More Related Content

DOC
околен свят входяща диагностика
PDF
Inferring the number of components: dream or reality?
PPTX
звук и буква бб
PPTX
Rプログラミング02 データ入出力編
DOCX
1.клас Работни листове с диктовки и текстове за четене от А до Д по новата пр...
PPT
Qurious (Prelims)
PDF
How many components in a mixture?
PDF
Testing for mixtures by seeking components
околен свят входяща диагностика
Inferring the number of components: dream or reality?
звук и буква бб
Rプログラミング02 データ入出力編
1.клас Работни листове с диктовки и текстове за четене от А до Д по новата пр...
Qurious (Prelims)
How many components in a mixture?
Testing for mixtures by seeking components

Similar to Testing for mixtures at BNP 13 (20)

PDF
BAYSM'14, Wien, Austria
PDF
prior selection for mixture estimation
PDF
Vancouver18
PDF
Can we estimate a constant?
PDF
Bayesian inference on mixtures
PDF
An investigation of inference of the generalized extreme value distribution b...
PDF
PDF
Rao-Blackwellisation schemes for accelerating Metropolis-Hastings algorithms
PDF
Multiple estimators for Monte Carlo approximations
PDF
Lecture_9.pdf
PDF
Inference for stochastic differential equations via approximate Bayesian comp...
PDF
Jere Koskela slides
PDF
Workshop in honour of Don Poskitt and Gael Martin
PDF
Asymptotics of ABC, lecture, Collège de France
PDF
Gentle Introduction to Dirichlet Processes
PDF
thesis_final_draft
PDF
Talk at 2013 WSC, ISI Conference in Hong Kong, August 26, 2013
PDF
Baum3
PDF
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
BAYSM'14, Wien, Austria
prior selection for mixture estimation
Vancouver18
Can we estimate a constant?
Bayesian inference on mixtures
An investigation of inference of the generalized extreme value distribution b...
Rao-Blackwellisation schemes for accelerating Metropolis-Hastings algorithms
Multiple estimators for Monte Carlo approximations
Lecture_9.pdf
Inference for stochastic differential equations via approximate Bayesian comp...
Jere Koskela slides
Workshop in honour of Don Poskitt and Gael Martin
Asymptotics of ABC, lecture, Collège de France
Gentle Introduction to Dirichlet Processes
thesis_final_draft
Talk at 2013 WSC, ISI Conference in Hong Kong, August 26, 2013
Baum3
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
Ad

More from Christian Robert (20)

PDF
Insufficient Gibbs sampling (A. Luciano, C.P. Robert and R. Ryder)
PDF
The future of conferences towards sustainability and inclusivity
PDF
Adaptive Restore algorithm & importance Monte Carlo
PDF
discussion of ICML23.pdf
PDF
restore.pdf
PDF
CDT 22 slides.pdf
PDF
discussion on Bayesian restricted likelihood
PDF
NCE, GANs & VAEs (and maybe BAC)
PDF
ABC-Gibbs
PDF
Coordinate sampler : A non-reversible Gibbs-like sampler
PDF
eugenics and statistics
PDF
Laplace's Demon: seminar #1
PDF
ABC-Gibbs
PDF
asymptotics of ABC
PDF
ABC-Gibbs
PDF
Likelihood-free Design: a discussion
PDF
the ABC of ABC
PDF
CISEA 2019: ABC consistency and convergence
PDF
a discussion of Chib, Shin, and Simoni (2017-8) Bayesian moment models
PDF
ABC based on Wasserstein distances
Insufficient Gibbs sampling (A. Luciano, C.P. Robert and R. Ryder)
The future of conferences towards sustainability and inclusivity
Adaptive Restore algorithm & importance Monte Carlo
discussion of ICML23.pdf
restore.pdf
CDT 22 slides.pdf
discussion on Bayesian restricted likelihood
NCE, GANs & VAEs (and maybe BAC)
ABC-Gibbs
Coordinate sampler : A non-reversible Gibbs-like sampler
eugenics and statistics
Laplace's Demon: seminar #1
ABC-Gibbs
asymptotics of ABC
ABC-Gibbs
Likelihood-free Design: a discussion
the ABC of ABC
CISEA 2019: ABC consistency and convergence
a discussion of Chib, Shin, and Simoni (2017-8) Bayesian moment models
ABC based on Wasserstein distances
Ad

Recently uploaded (20)

PDF
ELS_Q1_Module-11_Formation-of-Rock-Layers_v2.pdf
PDF
Phytochemical Investigation of Miliusa longipes.pdf
PDF
Placing the Near-Earth Object Impact Probability in Context
PPTX
Classification Systems_TAXONOMY_SCIENCE8.pptx
PPTX
cpcsea ppt.pptxssssssssssssssjjdjdndndddd
PDF
Warm, water-depleted rocky exoplanets with surfaceionic liquids: A proposed c...
PPT
POSITIONING IN OPERATION THEATRE ROOM.ppt
PPT
6.1 High Risk New Born. Padetric health ppt
PPTX
Vitamins & Minerals: Complete Guide to Functions, Food Sources, Deficiency Si...
PPTX
Taita Taveta Laboratory Technician Workshop Presentation.pptx
PDF
CAPERS-LRD-z9:AGas-enshroudedLittleRedDotHostingaBroad-lineActive GalacticNuc...
PDF
CHAPTER 3 Cell Structures and Their Functions Lecture Outline.pdf
PPTX
ANEMIA WITH LEUKOPENIA MDS 07_25.pptx htggtftgt fredrctvg
PDF
Formation of Supersonic Turbulence in the Primordial Star-forming Cloud
PPTX
Introduction to Cardiovascular system_structure and functions-1
PDF
Lymphatic System MCQs & Practice Quiz – Functions, Organs, Nodes, Ducts
PPTX
Pharmacology of Autonomic nervous system
PPTX
2. Earth - The Living Planet earth and life
PDF
. Radiology Case Scenariosssssssssssssss
PPTX
ECG_Course_Presentation د.محمد صقران ppt
ELS_Q1_Module-11_Formation-of-Rock-Layers_v2.pdf
Phytochemical Investigation of Miliusa longipes.pdf
Placing the Near-Earth Object Impact Probability in Context
Classification Systems_TAXONOMY_SCIENCE8.pptx
cpcsea ppt.pptxssssssssssssssjjdjdndndddd
Warm, water-depleted rocky exoplanets with surfaceionic liquids: A proposed c...
POSITIONING IN OPERATION THEATRE ROOM.ppt
6.1 High Risk New Born. Padetric health ppt
Vitamins & Minerals: Complete Guide to Functions, Food Sources, Deficiency Si...
Taita Taveta Laboratory Technician Workshop Presentation.pptx
CAPERS-LRD-z9:AGas-enshroudedLittleRedDotHostingaBroad-lineActive GalacticNuc...
CHAPTER 3 Cell Structures and Their Functions Lecture Outline.pdf
ANEMIA WITH LEUKOPENIA MDS 07_25.pptx htggtftgt fredrctvg
Formation of Supersonic Turbulence in the Primordial Star-forming Cloud
Introduction to Cardiovascular system_structure and functions-1
Lymphatic System MCQs & Practice Quiz – Functions, Organs, Nodes, Ducts
Pharmacology of Autonomic nervous system
2. Earth - The Living Planet earth and life
. Radiology Case Scenariosssssssssssssss
ECG_Course_Presentation د.محمد صقران ppt

Testing for mixtures at BNP 13

  • 1. Testing for mixture components Christian P. Robert U. Paris Dauphine & Warwick U. Joint on-going work with A Hairault and J Rousseau BNP 13, Puerto Varas
  • 2. EaRly Call Incoming 2023-2030 ERC funding for postdoctoral collaborations with I Michael Jordan (more Paris than Berkeley) I Eric Moulines (Paris) I Gareth Roberts (Warwick) I myself (more Paris than Warwick)
  • 3. Outline 1 Mixtures of distributions 2 Approximations to evidence 3 Dirichlet process mixtures
  • 4. Mixtures of distributions Convex combination of densities x ∼ fj with probability pj, for j = 1, 2, . . . , k, with overall density p1f1(x) + · · · + pkfk(x) . Usual case: parameterised components k X i=1 pif(x|ϑi) with n X i=1 pi = 1 where weights pi’s are distinguished from other parameters
  • 5. Jeffreys priors for mixtures True Jeffreys prior for mixtures of distributions defined as Eϑ ∇ ∇ log f(X|ϑ) I O(k) matrix I unavailable in closed form except special cases I unidimensional integrals approximated by Monte Carlo tools [Grazian X, 2015]
  • 6. Difficulties I complexity grows in O(k2) I significant computing requirement (reduced by delayed acceptance) [Banterle et al., 2014] I differ from component-wise Jeffreys [Diebolt X, 1990; Stoneking, 2014] I when is the posterior proper? I how to check properness via MCMC outputs?
  • 7. Further reference priors Reparameterisation of a location-scale mixture in terms of its global mean µ and global variance σ2 as µi = µ + σαi and σi = στi 1 ≤ i ≤ k where τi 0 and αi ∈ R Induces compact space on other parameters: k X i=1 piαi = 0 and k X i=1 piτ2 i + k X i=1 piα2 i = 1 © Posterior associated with prior π(µ, σ) = 1/σ proper with Gaussian components if there are at least two observations in the sample [Kamary, Lee X, 2018]
  • 8. Label switching paradox I We should observe the exchangeability of the components [label switching] to conclude about convergence of the Gibbs sampler I If observed, how should we estimate parameters? I If unobserved, uncertainty about convergence [Celeux, Hurn X, 2000; Frühwirth-Schnatter, 2001, 2004] [Unless adopting a point process perspective] [Green, 2019]
  • 9. Loss functions for mixture estimation Global loss function that considers distance between predictives L(ξ, ^ ξ) = Z X fξ(x) log fξ(x)/f^ ξ(x) dx eliminates the labelling effect Similar solution for estimating clusters through allocation variables L(z, ^ z) = X ij h I[zi=zj](1 − I[^ zi=^ zj]) + I[^ zi=^ zj](1 − I[zi=zj]) i . [Celeux, Hurn X, 2000]
  • 10. Bayesian model comparison Bayes Factor consistent for selecting number of components [Ishwaran et al., 2001; Casella Moreno, 2009; Chib and Kuffner, 2016] Bayes Factor consistent for testing parametric versus nonparametric alternatives [Verdinelli Wasserman, 1997; Dass Lee, 2004; McVinish et al., 2009]
  • 11. Consistent evidence for location DPM Consistency of Bayes factor comparing finite mixtures against (location) Dirichlet Process Mixture H0 : f0 ∈ MK vs. H1 : f0 / ∈ MK 0 250 500 750 1000 1250 1500 1750 2000 n 4 3 2 1 0 1 2 log(BF) 0 200 400 600 800 1000 n 0 1 2 3 4 5 6 7 8 log(BF) (left) finite mixture; (right) not a finite mixture
  • 12. Consistent evidence for location DPM Under generic assumptions, when x1, · · · , xn iid fP0 with P0 = k0 X j=1 p0 j δϑ0 j and Dirichlet DP(M, G0) prior on P, there exists t 0 such that for all ε 0 Pf0 mDP(x) n−(k0−1+dk0+t)/2 = o(1) Moreover there exists q ≥ 0 such that ΠDP kf0 − fpk1 ≤ (log n)q √ n x = 1 + oPf0 (1) [Hairault, X Rousseau, 2022]
  • 13. Consistent evidence for location DPM Assumption A1 [Regularity] Assumption A2 [Strong identifiability] Assumption A3 [Compactness ] Assumption A4 [Existence of DP random mean] Assumption A5 [Truncation on support of M, e.g. truncated Gamma] If fP0 ∈ Mk0 satisfies Assumptions A1–A3, then mk0 (y)/mDP(y) → ∞ under fP0 Moreover for all k ≥ k0, if Dirichlet parameter α = η/k and η kd/2, then mk(y)/mDP(y) → ∞ under fP0
  • 14. Outline 1 Mixtures of distributions 2 Approximations to evidence 3 Dirichlet process mixtures
  • 15. Chib’s or candidate’s representation Direct application of Bayes’ theorem: given x ∼ fk(x|ϑk) and ϑk ∼ πk(ϑk), Zk = mk(x) = fk(x|ϑk) πk(ϑk) πk(ϑk|x) Replace with an approximation to the posterior b Zk = d mk(x) = fk(x|ϑ∗ k) πk(ϑ∗ k) ^ πk(ϑ∗ k|x) . [Besag, 1989; Chib, 1995]
  • 16. Natural Rao-Blackwellisation For missing variable z as in mixture models, natural Rao-Blackwell (unbiased) estimate c πk(ϑ∗ k|x) = 1 T T X t=1 πk(ϑ∗ k|x, z (t) k ) , where the z (t) k ’s are Gibbs sampled latent variables [Diebolt X, 1990; Chib, 1995]
  • 17. Compensation for label switching For mixture models, z (t) k usually fails to visit all configurations in a balanced way, despite the symmetry predicted by the theory Consequences on numerical approximation, biased by an order k!
  • 18. Compensation for label switching For mixture models, z (t) k usually fails to visit all configurations in a balanced way, despite the symmetry predicted by the theory Recover the theoretical symmetry by using f πk(ϑ∗ k|x) = 1 T k! X σ∈Sk T X t=1 πk(σ(ϑ∗ k)|x, z (t) k ) . for all σ’s in Sk, set of all permutations of {1, . . . , k} [Neal, 1999; Berkhof, Mechelen, Gelman, 2003; Lee X, 2018]
  • 19. Astronomical illustration Benchmark galaxies for radial velocities of 82 galaxies [Postman et al., 1986; Roader, 1992; Raftery, 1996] Conjugate priors σ2 k ∼ Γ−1 (a0, b0) µk|σ2 k ∼ N(µ0, σ2 k/λ0)
  • 20. Galaxy dataset (k) Using Chib’s estimate, with ϑ∗ k as MAP estimator, log(^ Zk(x)) = −105.1396 for k = 3, while introducing permutations leads to log(^ Zk(x)) = −103.3479 Note that −105.1396 + log(3!) = −103.3479 k 2 3 4 5 6 7 8 Zk(x) -115.68 -103.35 -102.66 -101.93 -102.88 -105.48 -108.44 Estimations of the marginal likelihoods by the symmetrised Chib’s approximation (based on 105 Gibbs iterations and, for k 5, 100 permutations selected at random in Sk). [Lee et al., 2008]
  • 21. Rethinking Chib’s solution Alternate Rao–Blackwellisation by marginalising into partitions Apply candidate’s/Chib’s formula to a chosen partition: mk(x) = fk(x|C0)πk(C0) πk(C0|x) with πk(C(z)) = k! (k − k+)! Γ Pk j=1 αj Γ Pk j=1 αj + n k Y j=1 Γ(nj + αj) Γ(αj) C(z) partition of {1, . . . , n} induced by cluster membership z nj = Pn i=1 I{zi=j} # observations assigned to cluster j k+ = Pk j=1 I{nj0} # non-empty clusters
  • 22. Rethinking Chib’s solution Under conjugate priors G0 on ϑ, fk(x|C(z)) k Y j=1 Z Θ Y i:zi=k f(xi|ϑ)G0(dϑ) | {z } m(Ck(z)) and ^ πk(C0 |x) = 1 T T X t=1 IC0≡C(z(t)) I considerably lower computational demand I no label switching issue
  • 23. More efficient sampling Difficulty with the explosive numbers of terms in f πk(ϑ∗ k|x) = 1 T k! X σ∈Sk T X t=1 πk(σ(ϑ∗ k)|x, z (t) k ) . when most terms are equal to zero... Iterative bridge sampling: b E(t) (k) = b E(t−1) (k) M−1 1 M1 X l=1 ^ π(ϑ̃l|x) M1q(ϑ̃l) + M2 ^ π(ϑ̃l|x) . M−1 2 M2 X m=1 q(^ ϑm) M1q(^ ϑm) + M2 ^ π(^ ϑm|x) [Frühwirth-Schnatter, 2004]
  • 24. More efficient sampling Iterative bridge sampling: b E(t) (k) = b E(t−1) (k) M−1 1 M1 X l=1 ^ π(ϑ̃l|x) M1q(ϑ̃l) + M2 ^ π(ϑ̃l|x) . M−1 2 M2 X m=1 q(^ ϑm) M1q(^ ϑm) + M2 ^ π(^ ϑm|x) [Frühwirth-Schnatter, 2004] where q(ϑ) = 1 J1 J1 X j=1 p(ϑ|z(j) ) k Y i=1 p(ξi|ξ (j) ij, ξ (j−1) ij , z(j) , x)
  • 25. More efficient sampling Iterative bridge sampling: b E(t) (k) = b E(t−1) (k) M−1 1 M1 X l=1 ^ π(ϑ̃l|x) M1q(ϑ̃l) + M2 ^ π(ϑ̃l|x) . M−1 2 M2 X m=1 q(^ ϑm) M1q(^ ϑm) + M2 ^ π(^ ϑm|x) [Frühwirth-Schnatter, 2004] or where q(ϑ) = 1 k! X σ∈S(k) p(ϑ|σ(zo )) k Y i=1 p(ξi|σ(ξo ij), σ(ξo ij), σ(zo ), x)
  • 26. Sparsity for the sum Contribution of each term relative to q(ϑ) ησ(ϑ) = hσ(ϑ) k!q(ϑ) = hσi (ϑ) P σ∈Sk hσ(ϑ) and importance of permutation σ evaluated by b Ehσc [ησi (ϑ)] = 1 M M X l=1 ησi (ϑ(l) ) , ϑ(l) ∼ hσc (ϑ) Approximate set A(k) ⊆ S(k) consist of [σ1, · · · , σn] for the smallest n that satisfies the condition ^ ϕn = 1 M M X l=1 q̃n(ϑ(l) ) − q(ϑ(l) ) τ
  • 27. dual importance sampling with approximation DIS2A 1 Randomly select {z(j) , ϑ(j) }J j=1 from Gibbs sample and un-switch Construct q(ϑ) 2 Choose hσc (ϑ) and generate particles {ϑ(t) }T t=1 ∼ hσc (ϑ) 3 Construction of approximation q̃(ϑ) using first M-sample 3.1 Compute b Ehσc [ησ1 (ϑ)], · · · ,b Ehσc [ησk! (ϑ)] 3.2 Reorder the σ’s such that b Ehσc [ησ1 (ϑ)] ≥ · · · ≥ b Ehσc [ησk! (ϑ)]. 3.3 Initially set n = 1 and compute q̃n(ϑ(t) )’s and b ϕn. If b ϕn=1 τ, go to Step 4. Otherwise increase n = n + 1 4 Replace q(ϑ(1) ), . . . , q(ϑ(T) ) with q̃(ϑ(1) ), . . . , q̃(ϑ(T) ) to estimate b E [Lee X, 2014]
  • 28. illustrations k k! |A(k)| ∆(A) 3 6 1.0000 0.1675 4 24 2.7333 0.1148 Fishery data k k! |A(k)| ∆(A) 3 6 1.000 0.1675 4 24 15.7000 0.6545 6 720 298.1200 0.4146 Galaxy data Table: Mean estimates of approximate set sizes, |A(k)|, and the reduction rate of a number of evaluated h-terms ∆(A) for (a) fishery and (b) galaxy datasets
  • 29. Sequential Monte Carlo Tempered sequence of targets (t = 1, . . . , T) πkt(ϑk) ∝ pkt(ϑk) = πk(ϑk)fk(x|ϑk)λt λ1 = 0 · · · λT = 1 particles (simulations) (i = 1, . . . , Nt) ϑi t i.i.d. ∼ πkt(ϑk) usually obtained by MCMC step ϑi t ∼ Kt(ϑi t−1, ϑ) with importance weights (i = 1, . . . , Nt) ωt i = fk(x|ϑk)λt−λt−1 [Del Moral et al., 2006; Bucholz et al., 2021]
  • 30. Sequential Monte Carlo Tempered sequence of targets (t = 1, . . . , T) πkt(ϑk) ∝ pkt(ϑk) = πk(ϑk)fk(x|ϑk)λt λ1 = 0 · · · λT = 1 Produces approximation of evidence ^ Zk = Y t 1 Nt Nt X i=1 ωt i [Del Moral et al., 2006; Bucholz et al., 2021]
  • 31. Sequential importance sampling For conjugate priors, (marginal) particle filter representation of a proposal: π∗ (z|x) = π(z1|x1) n Y i=2 π(zi|x1:i, z1:i−1) with importance weight π(z|x) π∗(z|x) = π(x, z) m(x) m(x1) π(z1, x1) m(z1, x1, x2) π(z1, x1, z2, x2) · · · π(z1:n−1, x) π(z, x) = w(z, x) m(x) leading to unbiased estimator of evidence ^ Zk(x) = 1 T T X i=1 w(z(t) , x) [Long, Liu Wong, 1994; Carvalho et al., 2010]
  • 33. Common illustration 5.0 5.5 6.0 6.5 7.0 7.5 8.0 log(time) 10 8 6 4 2 0 2 log(MSE) ChibPartitions Bridge Sampling ChibPerm SIS SMC
  • 34. Empirical conclusions I Bridge sampling, arithmetic mean and original Chib’s method fail to scale with n, sample size I Partition Chib’s increasingly variable I Adaptive SMC ultimately fails I SIS remains most reliable method
  • 35. 1 Mixtures of distributions 2 Approximations to evidence 3 Dirichlet process mixtures
  • 36. Dirichlet process mixture (DPM) Extension to the k = ∞ (non-parametric) case xi|zi, ϑ i.i.d ∼ f(xi|ϑxi ), i = 1, . . . , n (1) P(Zi = k) = πk, k = 1, 2, . . . π1, π2, . . . ∼ GEM(M) M ∼ π(M) ϑ1, ϑ2, . . . i.i.d ∼ G0 with GEM (Griffith-Engen-McCloskey) defined by the stick-breaking representation πk = vk k−1 Y i=1 (1 − vi) vi ∼ Beta(1, M) [Sethuraman, 1994]
  • 37. Dirichlet process mixture (DPM) Resulting in an infinite mixture x ∼ n Y i=1 ∞ X i=1 πif(xi|ϑi) with (prior) cluster allocation π(z|M) = Γ(M) Γ(M + n) MK+ K+ Y j=1 Γ(nj) and conditional likelihood p(x|z, M) = K+ Y j=1 Z Y i:zi=j f(xi|ϑj)dG0(ϑj) available in closed form when G0 conjugate
  • 38. Approximating the evidence Extension of Chib’s formula by marginalising over z and ϑ mDP(x) = p(x|M∗, G0)π(M∗) π(M∗|x) and using estimate ^ π(M∗ |x) = 1 T T X t=1 π(M∗ |x, η(t) , K (t) + ) provided prior on M a Γ(a, b) distribution since M|x, η, K+ ∼ ωΓ(a+K+, b−log(η))+(1−ω)Γ(a+K+−1, b−log(η)) with ω = (a + K+ − 1)/{n(b − log(η)) + a + K+ − 1} and η|x, M ∼ Beta(M + 1, n) [Basu Chib, 2003]
  • 39. Approximating the likelihood Intractable likelihood p(x|M∗, G0) approximated by sequential importance sampling Generating z from the proposal π∗ (z|x, M) = n Y i=1 π(zi|x1:i, z1:i−1, M) and using the approximation ^ L(x|M∗ , G0) = 1 T T X t=1 ^ p(x1|z (t) 1 , G0) n Y i=2 p(yi|x1:i−1z (t) 1:i−1, G0) [Kong, Lu Wong, 1994; Basu Chib, 2003]
  • 40. Approximating the evidence (bis) Reverse logistic regression applies to DPM: Importance function π1(z, M) := π∗ (z|x, M)π(M) and π2(z, M) = π(z, M|x) m(y) {z(1,j), M(1,j)}T j=1 and {z(2,j), M(2,j)}T j=1 samples from π1 and π2 marginal likelihood m(y) estimated as intercept of logistic regression with covariate log{π1(z, M)/π̃2(z, M)} [Geyer, 1994; Chen Shao, 1997]
  • 42. Galactic illustration 1.2 1.4 1.6 1.8 2.0 2.2 2.4 2.6 log(time) 3 2 1 0 1 log(MSE) Chib RLR-SIS RLR-Prior
  • 43. Muchas gracias y hasta Luego! [arXiv:2205.05416]