How many components in a mixture?

How many mixture components?
Christian P. Robert
U. Paris Dauphine & Warwick U.
Joint work with A. Hairault and J. Rousseau
Festschrift for Sylvia, Cambridge, May 12, 2023

An early meet[ing]
ESF HSSS, CIRM, 1995

Outline
1 Mixtures of distributions
2 Approximations to evidence
3 Distributed evidence evaluation
4 Dirichlet process mixtures

Mixtures of distributions
Convex combination of densities
x ∼ fj with probability pj,
for j = 1, 2, . . . , k, with overall density
p1f1(x) + · · · + pkfk(x) .
Usual case: parameterised components
k
X
i=1
pif(x|ϑi) with
n
X
i=1
pi = 1
where weights pi’s are distinguished from
other parameters

Jeffreys priors for mixtures
True Jeffreys prior for mixtures of distributions defined as
Eϑ

∇
∇ log f(X|ϑ)
1/2
I O(k) matrix
I unavailable in closed form except special cases
I unidimensional integrals approximated by Monte Carlo
tools
[Grazian X, 2015]

Difficulties
I complexity grows in O(k2)
I significant computing requirement (reduced by delayed
acceptance)
[Banterle et al., 2014]
I differ from component-wise Jeffreys
[Diebolt X, 1990; Stoneking, 2014]
I when is the posterior proper?
I how to check properness via MCMC outputs?

Further reference priors
Reparameterisation of a location-scale mixture in terms of its
global mean µ and global variance σ2 as
µi = µ + σαi and σi = στi 1 ≤ i ≤ k
where τi 0 and αi ∈ R
Induces compact space on other parameters:
k
X
i=1
piαi = 0 and
k
X
i=1
piτ2
i +
k
X
i=1
piα2
i = 1
© Posterior associated with prior π(µ, σ) = 1/σ proper with
Gaussian components if there are at least two observations in
the sample
[Kamary, Lee X, 2018]

Label switching paradox
I Under exchangeability, should observe exchangeability of
the components [label switching] to conclude about
convergence of the Gibbs sampler
I If observed, how should we estimate parameters?
I If unobserved, uncertainty about convergence
[Celeux, Hurn X, 2000; Frühwirth-Schnatter, 2001, 2004; Jasra al., 2005]
[Unless adopting a point process perspective]
[Green, 2019]

Loss functions for mixture estimation
Global loss function that considers
distance between predictives
L(ξ, ^
ξ) =
Z
X
fξ(x) log

fξ(x)/f^
ξ(x) dx
eliminates the labelling effect
Similar solution for estimating
clusters through allocation variables
L(z, ^
z) =
X
ij
h
I[zi=zj](1 − I[^
zi=^
zj]) + I[^
zi=^
zj](1 − I[zi=zj])
i
.
[Celeux, Hurn X, 2000]

Bayesian model comparison
Bayes Factor consistent for selecting number of components
[Ishwaran et al., 2001; Casella Moreno, 2009; Chib and Kuffner, 2016]
Bayes Factor consistent for testing parametric versus
nonparametric alternatives
[Verdinelli Wasserman, 1997; Dass Lee, 2004; McVinish et al., 2009]

Consistent evidence for location DPM
Consistency of Bayes factor comparing finite mixtures against
(location) Dirichlet Process Mixture
H0 : f0 ∈ MK vs. H1 : f0 /
∈ MK
0 250 500 750 1000 1250 1500 1750 2000
n
4
3
2
1
0
1
2
log(BF)
0 200 400 600 800 1000
n
0
1
2
3
4
5
6
7
8
log(BF)
(left) finite mixture; (right) not a finite mixture

Under generic assumptions, when x1, · · · , xn iid fP0
with
P0 =
k0
X
j=1
p0
j δϑ0
j
and Dirichlet DP(M, G0) prior on P, there exists t 0 such
that for all ε 0
Pf0

mDP(x) n−(k0−1+dk0+t)/2

= o(1)
Moreover there exists q ≥ 0 such that
ΠDP

kf0 − fpk1 ≤
(log n)q
√
n
x

= 1 + oPf0
(1)
[Hairault, X Rousseau, 2022]

Assumption A1 [Regularity]
Assumption A2 [Strong identifiability]
Assumption A3 [Compactness ]
Assumption A4 [Existence of DP random mean]
Assumption A5 [Truncated support of M, e.g. trunc’d Ga]
If fP0
∈ Mk0
satisfies Assumptions A1–A3, then
mk0
(y)/mDP(y) → ∞ under fP0
Moreover for all k ≥ k0, if Dirichlet parameter α = η/k and
η kd/2, then
mk(y)/mDP(y) → ∞ under fP0

Chib’s or candidate’s representation
Direct application of Bayes’ theorem: given x ∼ fk(x|ϑk) and
ϑk ∼ πk(ϑk),
Zk = mk(x) =
fk(x|ϑk) πk(ϑk)
πk(ϑk|x)
Replace with an approximation to the posterior
b
Zk = d
mk(x) =
fk(x|ϑ∗
k) πk(ϑ∗
k)
^
πk(ϑ∗
k|x)
.
[Besag, 1989; Chib, 1995]

Natural Rao-Blackwellisation
For missing variable z as in mixture models, natural
Rao-Blackwell (unbiased) estimate
c
πk(ϑ∗
k|x) =
1
T
T
X
t=1
πk(ϑ∗
k|x, z
(t)
k ) ,
where the z
(t)
k ’s are Gibbs sampled latent variables
[Diebolt X, 1990; Chib, 1995]

Compensation for label switching
For mixture models, z
(t)
k usually fails to visit all configurations,
despite symmetry predicted by theory
Consequences on numerical approximation, biased by an order
k!

Compensation for label switching
For mixture models, z
(t)
k usually fails to visit all configurations,
despite symmetry predicted by theory
Force predicted theoretical symmetry by using
f
πk(ϑ∗
k|x) =
1
T k!
X
σ∈Sk
T
X
t=1
πk(σ(ϑ∗
k)|x, z
(t)
k ) .
for all σ’s in Sk, set of all permutations of {1, . . . , k}
[Neal, 1999; Berkhof, Mechelen, Gelman, 2003; Lee X, 2018]

Astronomical illustration
Benchmark galaxies for radial velocities of 82 galaxies
[Postman et al., 1986; Roader, 1992; Raftery, 1996]
Conjugate priors
σ2
k ∼ Γ−1
(a0, b0)
µk|σ2
k ∼ N(µ0, σ2
k/λ0)

Galaxy dataset (k)
Using Chib’s estimate, with ϑ∗
k as MAP estimator,
log(^
Zk(x)) = −105.1396
for k = 3, while introducing permutations leads to
log(^
Zk(x)) = −103.3479
Note that
−105.1396 + log(3!) = −103.3479
k 2 3 4 5 6 7 8
Zk(x) -115.68 -103.35 -102.66 -101.93 -102.88 -105.48 -108.44
Estimations of the marginal likelihoods by the symmetrised Chib’s
approximation (based on 105
Gibbs iterations and, for k 5, 100
permutations selected at random in Sk).
[Lee et al., 2008]

Rethinking Chib’s solution
Alternate Rao–Blackwellisation by marginalising into partitions
Apply candidate’s/Chib’s formula to a chosen partition:
mk(x) =
fk(x|C0)πk(C0)
πk(C0|x)
with
πk(C(z)) =
k!
(k − k+)!
Γ
Pk
j=1 αj

Γ
Pk
j=1 αj + n

k
Y
j=1
Γ(nj + αj)
Γ(αj)
C(z) partition of {1, . . . , n} induced by cluster membership z
nj =
Pn
i=1 I{zi=j} # observations assigned to cluster j
k+ =
Pk
j=1 I{nj0} # non-empty clusters

Rethinking Chib’s solution
Under conjugate prior G0 on ϑ,
fk(x|C(z)) =
k
Y
j=1
Z
Θ
Y
i:zi=k
f(xi|ϑ)G0(dϑ)
| {z }
m(Ck(z))
and
^
πk(C0
|x) =
1
T
T
X
t=1
IC0≡C(z(t))
I considerably lower computational demand
I no label switching issue
I further Rao-Blackwellisation?

More efficient sampling
Iterative bridge sampling:
b
e(t)
(k) = b
e(t−1)
(k) M−1
1
M1
X
l=1
^
π(ϑ̃l|x)
M1q(ϑ̃l) + M2 ^
π(ϑ̃l|x)
.
M−1
2
M2
X
m=1
q(^
ϑm)
M1q(^
ϑm) + M2 ^
π(^
ϑm|x)
[Frühwirth-Schnatter, 2004]
where
ϑ̃l
∼ q(ϑ) and ^
ϑm
∼ π(ϑ)

Sparsity for permutations
Contribution of each term relative to q(ϑ)
ησ(ϑ) =
hσ(ϑ)
k!q(ϑ)
=
hσi
(ϑ)
P
σ∈Sk
hσ(ϑ)
and importance of permutation σ evaluated by
b
Ehσc
[ησi
(ϑ)] =
1
M
M
X
l=1
ησi
(ϑ(l)
) , ϑ(l)
∼ hσc (ϑ)
Approximate set A(k) ⊆ S(k) consist of [σ1, · · · , σn] for the
smallest n that satisfies the condition
^
ϕn =
1
M
M
X
l=1
q̃n(ϑ(l)
) − q(ϑ(l)
) τ

dual importance sampling with approximation
DIS2A
1 Randomly select {z(j)
, ϑ(j)
}J
j=1 from Gibbs sample and
un-switch
Construct q(ϑ)
2 Choose hσc (ϑ) and generate particles {ϑ(t)
}T
t=1 ∼ hσc (ϑ)
3 Construction of approximation q̃(ϑ) using first M-sample
3.1 Compute b
Ehσc
[ησ1
(ϑ)], · · · ,b
Ehσc
[ησk!
(ϑ)]
3.2 Reorder the σ’s such that
b
Ehσc
[ησ1
(ϑ)] ≥ · · · ≥ b
Ehσc
[ησk!
(ϑ)].
3.3 Initially set n = 1 and compute q̃n(ϑ(t)
)’s and b
ϕn. If
b
ϕn=1 τ, go to Step 4. Otherwise increase n = n + 1
4 Replace q(ϑ(1)
), . . . , q(ϑ(T)
) with q̃(ϑ(1)
), . . . , q̃(ϑ(T)
) to
estimate b
E
[Lee X, 2014]

illustrations
k k! |A(k)| ∆(A)
3 6 1.0000 0.1675
4 24 2.7333 0.1148
Fishery data
k k! |A(k)| ∆(A)
3 6 1.000 0.1675
4 24 15.7000 0.6545
6 720 298.1200 0.4146
Galaxy data
Table: Mean estimates of approximate set sizes, |A(k)|, and the
reduction rate of a number of evaluated h-terms ∆(A) for (a) fishery
and (b) galaxy datasets

Sequential Monte Carlo
Tempered sequence of targets (t = 1, . . . , T)
πkt(ϑk) ∝ pkt(ϑk) = πk(ϑk)fk(x|ϑk)λt
λ1 = 0 · · · λT = 1
particles (simulations) (i = 1, . . . , Nt)
ϑi
t
i.i.d.
∼ πkt(ϑk)
usually obtained by MCMC step
ϑi
t ∼ Kt(ϑi
t−1, ϑ)
with importance weights (i = 1, . . . , Nt)
ωt
i = fk(x|ϑk)λt−λt−1
[Del Moral et al., 2006; Buchholz et al., 2021]

Sequential Monte Carlo
Tempered sequence of targets (t = 1, . . . , T)
πkt(ϑk) ∝ pkt(ϑk) = πk(ϑk)fk(x|ϑk)λt
λ1 = 0 · · · λT = 1
Produces approximation of evidence
^
Zk =
Y
t
1
Nt
Nt
X
i=1
ωt
i
[Del Moral et al., 2006; Buchholz et al., 2021]

Sequential2
imputation
For conjugate priors, (marginal) particle filter representation of
a proposal:
π∗
(z|x) = π(z1|x1)
n
Y
i=2
π(zi|x1:i, z1:i−1)
with importance weight
π(z|x)
π∗(z|x)
=
π(x, z)
m(x)
m(x1)
π(z1, x1)
m(z1, x1, x2)
π(z1, x1, z2, x2)
· · ·
π(z1:n−1, x)
π(z, x)
=
w(z, x)
m(x)
leading to unbiased estimator of evidence
^
Zk(x) =
1
T
T
X
i=1
w(z(t)
, x)
[Long, Liu Wong, 1994; Carvalho et al., 2010]

Galactic illustration
C
h
i
b
C
h
i
b
P
e
r
m
B
r
i
d
g
e
S
a
m
p
l
i
n
g
a
d
a
p
t
i
v
e
S
M
C
S
I
S
C
h
i
b
P
a
r
t
i
t
i
o
n
s
a
r
i
t
h
m
e
t
i
c
M
e
a
n
229
228
227
226
m(y)
K=3
C
h
i
b
C
h
i
b
P
e
r
m
B
r
i
d
g
e
S
a
m
p
l
i
n
g
a
d
a
p
t
i
v
e
S
M
C
S
I
S
C
h
i
b
P
a
r
t
i
t
i
o
n
s
a
r
i
t
h
m
e
t
i
c
M
e
a
n
232
230
228
226
K=5
C
h
i
b
C
h
i
b
R
a
n
d
P
e
r
m
a
d
a
p
t
i
v
e
S
M
C
S
I
S
C
h
i
b
P
a
r
t
i
t
i
o
n
s
a
r
i
t
h
m
e
t
i
c
M
e
a
n
235.0
232.5
230.0
227.5
m(y)
K=6
C
h
i
b
C
h
i
b
R
a
n
d
P
e
r
m
a
d
a
p
t
i
v
e
S
M
C
S
I
S
C
h
i
b
P
a
r
t
i
t
i
o
n
s
a
r
i
t
h
m
e
t
i
c
M
e
a
n
240
235
230
225
K=8

Common illustration
5.0 5.5 6.0 6.5 7.0 7.5 8.0
log(time)
10
8
6
4
2
0
2
log(MSE)
ChibPartitions
Bridge Sampling
ChibPerm
SIS
SMC

Empirical conclusions
I Bridge sampling, arithmetic mean and original Chib’s
method fail to scale with n, sample size
I Partition Chib’s increasingly variable with k, number of
components
I Adaptive SMC ultimately fails
I SIS remains most reliable method

1 Mixtures of distributions
2 Approximations to evidence
3 Distributed evidence evaluation
4 Dirichlet process mixtures

Distributed computation
[Buchholz et al., 2022]

Connecting bits
While
m(y) =
Z S
Y
s=1
p(ys|ϑ)π̃(ϑ)dϑ 6=
S
Y
s=1
Z
p(ys|ϑ)π̃(ϑ)dϑ =
S
Y
s=1
m̃(ys)
they can be connected as
m(y) = ZS
S
Y
s=1
m̃(ys)
Z S
Y
s=1
π̃(ϑ|ys)dϑ

Connecting bits
m(y) = ZS
S
Y
s=1
m̃(ys)
Z S
Y
s=1
π̃(ϑ|ys)dϑ
where
π̃(ϑ|ys) ∝ p(ys|ϑ)π̃(ϑ),
m̃(ys) =
Z
p(ys|ϑ)π̃(ϑ)dϑ,
Z =
Z
π(ϑ)1/S
dϑ

π̃(ϑ|ys) =
Z
Issue: with distributed computing, shards zs are unrelated and
corresponding clusters disconnected.

π̃(ϑ|ys) =
Z
Issue: with distributed computing, shards zs are unrelated and
corresponding clusters disconnected.
4500
4400
4300
4200
4100
4000
3900
3800
logm(y)
K=1 K=2 K=3 K=4 K=5

Label switching imposition
Returning to averaging across permutations, identity
^
Iperm =
1
TK!S−1
T
X
t=1
X
σ2,...,σS∈SK
Z
π̃(ϑ|z
(t)
1 , y1)
S
Y
s=2
π̃(ϑ|σs(z
(t)
s ), ys)dϑ

^
Iperm =
1
TK!S−1
T
X
t=1
X
σ2,...,σS∈SK
Z
π̃(ϑ|z
(t)
1 , y1)
S
Y
s=2
π̃(ϑ|σs(z
(t)
s ), ys)dϑ
I Iperm
4500
4400
4300
4200
4100
4000
3900
3800
logm(y)
K=1
I Iperm
K=2
I Iperm
K=3

^
Iperm =
1
TK!S−1
T
X
t=1
X
σ2,...,σS∈SK
Z
π̃(ϑ|z
(t)
1 , y1)
S
Y
s=2
π̃(ϑ|σs(z
(t)
s ), ys)dϑ
Obtained at heavy computational cost: O(T) for ^
I versus
O(TK!S−1) for ^
Iperm

Importance sampling version
Obtained at heavy computational cost: O(T) for ^
I versus
O(TK!S−1) for ^
Iperm
Avoid enumeration of permutations by using simulated values of
parameter for the reference sub-posterior as anchors towards
coherent labeling of clusters
[Celeux, 1998; Stephens, 2000]

For each batch s = 2, . . . , S, define matching matrix
Ps =



ps11 · · · ps1K
.
.
.
.
.
.
.
.
.
psK1 · · · psKK



where
pslk =
Y
i:zsi=l
p(ysi|ϑk)
used in creating proposals
qs(σ) ∝
K
Y
k=1
pskσ(k)
that reflect probabilities that each cluster k of batch s is
well-matched with cluster σ(k) of batch 1

Considerably reduced computational cost compared to
^
m^
Iperm
(y)
I At each iteration t, total cost of O(Kn/S) for evaluating Ps
I computing K! weights of discrete importance distribution
qσs requires K! operations
I sampling from the global discrete importance distribution
requires M(t) basic operations
Global cost of
O(T(Kn/S + K! + M̄))
for M̄ maximum number of importance simulations

Resulting estimator
^
IIS =
1
TK!S−1
T
X
t=1
1
M(t)
M(t)
X
m=1
χ(z(t); σ
(t,m)
2 , . . . , σ
(t,m)
S )
πσ(σ
(t,m)
2 , . . . , σ
(t,m)
S )
where
χ(z(t)
; σ2, . . . , σS) :=
Z
π̃(ϑ|z
(t)
1 , y1)
S
Y
s=2
π̃(ϑ|σs(z
(t)
s ), ys)dϑ

Resulting estimator
^
IIS =
1
TK!S−1
T
X
t=1
1
M(t)
M(t)
X
m=1
χ(z(t); σ
(t,m)
2 , . . . , σ
(t,m)
S )
πσ(σ
(t,m)
2 , . . . , σ
(t,m)
S )
I Iperm IIS
1520
1500
1480
1460
1440
1420
logm(y)
K=1
I Iperm IIS
K=2
I Iperm IIS
K=3
I Iperm IIS
K=4

Sequential importance sampling
Define
π̃s(ϑ) =
Qs
l=1 π̃(ϑ|yl)
Zs
where
Zs =
Z s
Y
l=1
π̃(ϑ|yl)
then
m(y) = ZS
× m(y1) ×
S
Y
s=2
Z
πs−1(ϑ)p(ys|ϑ)π̃(ϑ)dϑ

Calls for standard sequential importance sampling strategy
making use of the successive distributions πs(ϑ) as importance
distributions

Calls for standard sequential importance sampling strategy
making use of the successive distributions πs(ϑ) as importance
distributions
I Iperm IIS SMC
1575
1550
1525
1500
1475
1450
1425
logm(y)
K=1
I Iperm IIS SMC
K=2
I Iperm IIS SMC
K=3
I Iperm IIS SMC
K=4
I Iperm IIS SMC
K=5

Dirichlet process mixture (DPM)
Extension to the k = ∞ (non-parametric) case
xi|zi, ϑ
i.i.d
∼ f(xi|ϑxi
), i = 1, . . . , n (1)
P(Zi = k) = πk, k = 1, 2, . . .
π1, π2, . . . ∼ GEM(M) M ∼ π(M)
ϑ1, ϑ2, . . .
i.i.d
∼ G0
with GEM (Griffith-Engen-McCloskey) defined by the
stick-breaking representation
πk = vk
k−1
Y
i=1
(1 − vi) vi ∼ Beta(1, M)
[Sethuraman, 1994]

Dirichlet process mixture (DPM)
Resulting in an infinite mixture
x ∼
n
Y
i=1
∞
X
i=1
πif(xi|ϑi)
with (prior) cluster allocation
π(z|M) =
Γ(M)
Γ(M + n)
MK+
K+
Y
j=1
Γ(nj)
and conditional likelihood
p(x|z, M) =
K+
Y
j=1
Z Y
i:zi=j
f(xi|ϑj)dG0(ϑj)
available in closed form when G0 conjugate

Approximating the evidence
Extension of Chib’s formula by marginalising over z and ϑ
mDP(x) =
p(x|M∗, G0)π(M∗)
π(M∗|x)
and using estimate
^
π(M∗
|x) =
1
T
T
X
t=1
π(M∗
|x, η(t)
, K
(t)
+ )
provided prior on M a Γ(a, b) distribution since
M|x, η, K+ ∼ ωΓ(a+K+, b−log(η))+(1−ω)Γ(a+K+−1, b−log(η))
with ω = (a + K+ − 1)/{n(b − log(η)) + a + K+ − 1} and
η|x, M ∼ Beta(M + 1, n)
[Basu Chib, 2003]

Approximating the likelihood
Intractable likelihood p(x|M∗, G0) approximated by sequential
inputation importance sampling
Generating z from the proposal
π∗
(z|x, M) =
n
Y
i=1
π(zi|x1:i, z1:i−1, M)
and using the approximation
^
L(x|M∗
, G0) =
1
T
T
X
t=1
^
p(x1|z
(t)
1 , G0)
n
Y
i=2
p(yi|x1:i−1z
(t)
1:i−1, G0)
[Kong, Lu Wong, 1994; Basu Chib, 2003]

Approximating the evidence (bis)
Reverse logistic regression applies to DPM:
Importance function
π1(z, M) := π∗
(z|x, M)π(M) and π2(z, M) =
π(z, M|x)
m(y)
{z(1,j), M(1,j)}T
j=1 and {z(2,j), M(2,j)}T
j=1 samples from π1 and π2
Marginal likelihood m(y) estimated as intercept of logistic
regression with covariate
log{π1(z, M)/π̃2(z, M)}
on merged sample
[Geyer, 1994; Chen Shao, 1997]

a
r
i
t
h
m
e
t
i
c
h
a
r
m
o
n
i
c
C
h
i
b
R
L
R
-
S
I
S
R
L
R
-
P
r
i
o
r
23.296
23.288
23.280
23.272
23.264
23.256
23.248
23.240
23.232
23.224
m(x)
n=6
a
r
i
t
h
m
e
t
i
c
h
a
r
m
o
n
i
c
C
h
i
b
R
L
R
-
S
I
S
R
L
R
-
P
r
i
o
r
111
110
109
108
107
106
105
104
n=36
a
r
i
t
h
m
e
t
i
c
h
a
r
m
o
n
i
c
C
h
i
b
R
L
R
-
S
I
S
R
L
R
-
P
r
i
o
r
240
236
232
228
224
220
216
212
n=82

1.2 1.4 1.6 1.8 2.0 2.2 2.4 2.6
log(time)
3
2
1
0
1
log(MSE)
Chib
RLR-SIS
RLR-Prior

Towards new adventures
Exumas, Bahamas, 10 August 2011

How many components in a mixture?

More Related Content

What's hot (20)

Similar to How many components in a mixture? (20)

More from Christian Robert (19)

Recently uploaded (20)

How many components in a mixture?