Considerate Approaches to ABC Model Selection

Considerate Approaches to ABC Model Selection

Michael P.H. Stumpf, Christopher
Barnes, Sarah Filippi, Thomas Thorne

Theoretical Systems Biology Group

26/06/2012

Considerate Approaches to ABC Model Selection Stumpf et al. 1 of 15

Evolving Networks

(a) Duplication attachment (b) Duplication attachment
with complimentarity

wj
(c) Linear preferential
wi
(d) General scale-free
attachment
Considerate Approaches to ABC Model Selection Stumpf et al. Model Selection 2 of 15

Inference and Model Selection
We have observed data, D, that was generated by some system that
we seek to describe by a mathematical model. In principle we can
have a model-set, M = {M1 , . . . , Mν }, where each model Mi has an
associated parameter θi .
We may know the different constituent parts of the system, Xi , and
have measurements for some or all of them under some experimental
designs, T .


designs, T .
Model Posterior

Pr(Mi |T, D)


designs, T . Likelihood Prior
Model Posterior
Pr(D|Mi , T)π(Mi )
Pr(Mi |T, D)= ν
Pr(D|Mj , T)π(Mj )
j =1

Evidence


Model Posterior
Pr(D|Mi , T)π(Mi ) For complicated models and/or
Pr(Mi |T, D)= ν detailed data the likelihood
Pr(D|Mj , T)π(Mj ) evaluation can become
j =1 prohibitively expensive.
Evidence


Model Posterior
Pr(D|Mi , T)π(Mi ) For complicated models and/or
Pr(Mi |T, D)= ν detailed data the likelihood
Pr(D|Mj , T)π(Mj ) evaluation can become
j =1 prohibitively expensive.
Evidence

Approximate Inference
We can approximate the likelihood and/or the models. The “true”
model is unlikely to be in M anyway.


Approximate Bayesian Computation
We can deﬁne the posterior as
f (x |θi )π(θi )
p(θi |x ) =
p (x )
Here fi (x |θ) is the likelihood which is often hard to evaluate; consider
for example
dy
y = max[0, y +g1 +y ×g2] with g1 , g2 ∼ N(0,σ1/2 ) and
˜ = g (y ; θ).
dt

Considerate Approaches to ABC Model Selection Stumpf et al. Approximate Bayesian Computation 4 of 15

f (x |θi )π(θi )
p(θi |x ) =
p (x )
for example
dy
˜ = g (y ; θ).
dt
But we can still simulate from the data-generating model, whence

1(y = x )f (y |θi )π(θi )
p(θi |x ) = dy
X p (x )
1 (∆(y , x ) < ) f (y |θi )π(θi )
≈ dy
X p (x )

Solutions for Complex Problems (?)
Approximate (i) data, (ii) model or (iii) distance.


ABC with Summary Statistics
If the data, D, are very complex and detailed, direct comparison
between real and simulated data becomes prohibitive. In such
situations, which originally motivated ABC approaches, summary
statistics of the data are compared. We then have

pS , (θi |D) ∝ 1 (∆ (S (x )), S (yθ )) < ) f (y |θ)π(θi )dy
X

Considerate Approaches to ABC Model Selection Stumpf et al. ABC Summary Statistics 5 of 15


X
Sufﬁcient Statistics
This only works is the statistic S (.) is sufﬁcient, i.e. if for s = S (x ) we
have
p(x |s, θ) = p(x |s)



X
Sufficient Statistics
This only works is the statistic S (.) is sufficient, i.e. if for s = S (x ) we
have
p(x |s, θ) = p(x |s)

Sufficency for Model Selection
If S (.) is sufficient for parameter estimation (in all models i
considered) it is not necessarily sufficient for model selection (Robert
et al., PNAS (2011)).

Generate data X ∼ N(1, 1) and use ABC to infer µ (assuming that
σ2 = 1 is known).

mean var
30

600
25 Role of Summary Statistics
20
Mean (sufﬁcient) correctly
400 15

10
infers µ.
200
5 Max/Min capture some
0
−4 −2 0 2 4
0
−4 −2 0 2 4
information on µ.
min max

250
300 Var fails to capture any
200
250
information on µ.
200
150
150

100
100

50 50

0 0
−4 −2 0 2 4 −4 −2 0 2 4
θ


Generate data X ∼ N(1, 1) and use ABC to infer µ (assuming that
σ2 = 1 is known).

mean var
30

600
25 Role of Summary Statistics
20
Mean (sufﬁcient) correctly
400 15

10
infers µ.
200
5 Max/Min capture some
0
−4 −2 0 2 4
0
−4 −2 0 2 4
information on µ.
min max

250
300 Var fails to capture any
200
250
information on µ.
200
150
150

100
We need a way of constructing
100

50 50
sets of statistics that together are
0 0 (approximately) sufﬁcient.
−4 −2 0 2 4 −4 −2 0 2 4
θ


A Closer Look at Summary Statistics
We interpret a summary statistic as a function,

S : Rd −→ Rw , S(x ) = s.
If S is sufﬁcient then (we include the model indicator variable in θ)

p(θ|x ) = p(θ|s)



S : Rd −→ Rw , S(x ) = s.

p(θ|x ) = p(θ|s)
Information Theoretical Perspective
A summary statistic is an information compression device. Now let S
be a set of statistics which together are sufﬁcient. Then the mutual
information
p(θ, x )
I (Θ; X ) = p(θ, x ) log d θdx = I (θ, S)
Ω X p(θ)p(x )



S : Rd −→ Rw , S(x ) = s.

p(θ|x ) = p(θ|s)
Information Theoretical Perspective
A summary statistic is an information compression device. Now let S
be a set of statistics which together are sufﬁcient. Then the mutual
information
p(θ, x )
I (Θ; X ) = p(θ, x ) log d θdx = I (θ, S)
Ω X p(θ)p(x )

Constructing Minimally Sufﬁcient Summary Statistics
We seek the set U ⊆ S with minimal cardinality such that
I (Θ; S) = I (Θ; U).


Constructing Sufﬁcient Statistics

Proposition
Let X be a random variable generated according to f (·|θ). Let S be a
summary statistic and U and T two subsets of S such that U = U(X ),
T = T(X ) and S = S(X ) satisfy U ⊂ T ⊂ S. We have

I (Θ; S |T ) = I (Θ; S |U ) − I (Θ; T |U ) .

In order to construct a subset T of S such that I (Θ; S |T ) = 0, it is thus
sufﬁcient to add statistics from S one by one until the condition holds.
If we denote by S(k ) the kth statistic to be added (with k w) we have
S(k ) = S(k ) (X ), and then

I (Θ; S |S(1) , . . . , S(k +1) ) I (Θ; S |S(1) , . . . , S(k ) ) .



p(θ, S(x )|U(x ))
I (Θ; S |U ) = p(θ, S(x ), U(x )) log dxd θ
Ω X p(θ|U(x ))p(S(x )|U(x ))

= p(S(x )) [KL(p(Θ|S(x ))||p(Θ|U(x )))] dx
X
= Ep(X ) [KL(p(Θ|S(X ))||p(Θ|U(X )))]

An Impossible Algorithm
• for all subsets u ∗ ⊆ s ∗ , perform ABC to obtain estimates p (Θ|u ∗ )
• determine the set
A = {u ∗ ⊂ s∗ such that KL (p (Θ|s∗ )||p (Θ|u ∗ )) = 0},
• the desired subset is argminu ∗ ∈A |u ∗ |


input: a sufﬁcient set of statistics whose values on the dataset is s∗ =
{s1 , . . . , sw }, a threshold δ
∗ ∗

output: a subset v ∗ of s∗
choose randomly u ∗ in s∗
v ∗ ← u∗
q ∗ ← s ∗ v ∗
repeat
repeat
if q ∗ = Ø then return v ∗
end if
choose randomly u ∗ in q ∗
q ∗ ← q ∗ u ∗
perform ABC to obtain p (Θ|v ∗ , u ∗ )
until KL (p (Θ|v ∗ , u ∗ )||p (Θ|v ∗ )) δ
optionally: v ∗ ← OrderDependency (v ∗ , u ∗ )
v ∗ ← v ∗ ∪ u∗
q ∗ ← s ∗ v ∗
until q ∗ = Ø
return v ∗

Examples: Normal Distributions

y1 , ...yd ∼ N(µ, σ2 ) and y1 , ...yd ∼ N(µ, σ2 )
1 2

100 100

80 80

60 60
Run

Run
40 40

20 20

mean S2 range max random mean S2 range max random


Examples: Normal Distributions

y1 , ...yd ∼ N(µ, σ2 ) and y1 , ...yd ∼ N(µ, σ2 )
1 2
6

q q q

8
q
qqq
qqq
q q
q qq
q
q qq q qq q
q
q q q q q

6
4

q q q
q qq
q qq q
q q
q q qq q q
q q q q
q q qq q q
q q q q q q
qq q
qq q q
q qq
log(BF) ABC

log(BF) ABC
q q q

4
q
q q q q
2

q qq q q qq
q q
q q q q
qq q q
q q qq
q q
qq q q q
q q
q q qqqqqq qq q

2
q q q q qq qq
qqq q qq q
q q q
q
q q q qq q q q qq q q
q q q q qq q q
qq q q q q
qq q
q
0

qq q
q qq q q q
q

0
q
q
q q q
q
−2

−2
q q

−2 0 2 4 6 8 −2 0 2 4 6 8

log(BF) predicted log(BF) predicted


Examples: Population Genetics
Constant Population
Size
100

80

60
Run

40

20

S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 S11

[S1] Number of Segregating Sites; [S2] Number of Distinct Haplotypes,; [S3] Haplotype Homozygosity; [S4] Average SNP
Homozygosity; [S5] Number of occurrences of most common haplotype; [S6] Mean number of pair-wise differences between
haplotypes; [S7] Number of Singleton Haplotypes; [S8] Number of Singleton SNPs; [S9] Linkage Disequilibrium.


Constant Population Exponential Two-Island Model
Size Population Growth with Migration
100
100 100

80
80 80

60
60 60
Run

Run

Run
40
40 40

20
20 20

S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 S11
S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 S11 S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 S11



Constant Population Exponential Two-Island Model
Size Population Growth with Migration
100
100 100

80
80 80

60
60 60
Run

Run

Run
40
40 40

20
20 20

S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 S11
S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 S11 S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 S11


Summary Statistic Choice
The choice of summary statistics appears to depend subtely on the
true data-generating model. In light of coalescent processes this is,
however, to be expected.


Examples: Random Walks

Classical Random Persistent Random Biased Random
Walk Walk Walk
100 100 100

80 80 80

60 60 60
Run

Run

Run
40 40 40

20 20 20

S1 S2 S3 S4 S5 S1 S2 S3 S4 S5 S1 S2 S3 S4 S5

[S1] Mean square displacement; [S2] Mean x and y displacement; [S3] Mean square x and y displacement; [S4] Straightness
index; [S5] Eigenvalues of gyration tensor.

Parameter Sufﬁciency for Complex Problems
Here all statistics that have been chosen for parameter estimation are
also chosen for model selection.


Conditioning on Information

Θ

s1 s2 s3

Considerate Approaches to ABC Model Selection Stumpf et al. Interpreting ABC 10 of 15


Θ

s1 s2 x

Statistics
Sufﬁcient: Implicates same area as
full data.
Ancillary: Implicates all values of θ
equally.


What is the meaning of
Θ p(θ|s0 , s1 , . . . , sn )?
Let s = (s0 , s1 , . . . , sn ), and
assume I (θ, s) < I (θ, x ) but
→ 0.
This can happen for sufﬁcient
and ancillary s. In the latter
s1 s2 x case we obtain

p(θ|s) = π(θ).
Statistics
Sufﬁcient: Implicates same area as
full data.
equally.


What is the meaning of
Θ p(θ|s0 , s1 , . . . , sn )?
Let s = (s0 , s1 , . . . , sn ), and
assume I (θ, s) < I (θ, x ) but
→ 0.
This can happen for sufficient
and ancillary s. In the latter
s1 s2 x case we obtain

p(θ|s) = π(θ).
Statistics
Sufficient: Implicates same area as How about
full data.
p(t |s)
equally. if s is not (quite) sufficient?

Model Selection vs. Model Checking

Model Selection: Several models M ∈ M are compared and one or
more are chosen in light of the data: Find models which
are better than others.
Model Checking: The quality of a model Mi is assessed against the
available data: Determine if a model is actually ‘good’.
Alternative Approach: ABCµ [Ratmann et al., PNAS].


Model Selection vs. Model Checking

Model Selection: Several models M ∈ M are compared and one or
more are chosen in light of the data: Find models which
are better than others.
Model Checking: The quality of a model Mi is assessed against the
available data: Determine if a model is actually ‘good’.
Alternative Approach: ABCµ [Ratmann et al., PNAS].

Posterior Predictive Checks
We are interested in the posterior predictive distribution,

p(t (X )|s(X )) = p(t (X )|θ)p(θ|s(X ))d θ.
Θ
In particular we have

p(s(X )|s(X )) = p(s(X )|X )
unless t (X ) is sufﬁcient.


ABC on Network Data

(e) Duplication attachment (f) Duplication attachment
with complimentarity

wj
(g) Linear preferential wi
(h) General scale-free
attachment
Considerate Approaches to ABC Model Selection Stumpf et al. Network Evolution 12 of 15

ABC on Network Data

Summarizing Networks
• Data are noisy and incomplete.
• We can simulate models of network
evolution, but this does not allow us to
calculate likelihoods for all but very
trivial models.
• There is also no sufﬁcient statistic that
would allow us to summarize networks,
so ABC approaches require some
thought.
• Many possible summary statistics of
networks are expensive to calculate.
Full likelihood: Wiuf et al., PNAS (2006).
ABC: Ratman et al., PLoS Comp.Biol. (2008).
ABC (better): Thorne & Stumpf, J.Roy.Soc. Interface (2012).
Stumpf & Wiuf, J. Roy. Soc. Interface (2010).


Spectral Distances
c a b c d e
 
0 1 1 1 0 a
 
a d e 
 1 0 1 1 0 b

A = 1 1 0 0 0 c
 

 1 1 0 0 1 d

b 0 0 0 1 0 e

Graph Spectra
Given a graph G with nodes N and edges (i , j ) ∈ E with i , j ∈ N, the
adjacency matrix, A, of the graph is deﬁned by
1 if (i , j ) ∈ E ,
ai ,j =
0 otherwise.
The eigenvalues, λ, of this matrix provide one way of deﬁning the
graph spectrum.

Spectral Distances
A simple distance measure between graphs having adjacency
matrices A and B, known as the edit distance, is to count the number
of edges that are not shared by both graphs,

D (A, B ) = (ai ,j − bi ,j )2 .
i ,j


Spectral Distances

D (A, B ) = (ai ,j − bi ,j )2 .
i ,j

However for unlabelled graphs we require some mapping h from
i ∈ NA to i ∈ NB that minimizes the distance

D (A, B ) Dh (A, B ) = (ai ,j − bh(i ),h(j ) )2 ,
i ,j


Spectral Distances

D (A, B ) = (ai ,j − bi ,j )2 .
i ,j

However for unlabelled graphs we require some mapping h from
i ∈ NA to i ∈ NB that minimizes the distance

D (A, B ) Dh (A, B ) = (ai ,j − bh(i ),h(j ) )2 ,
i ,j

Given a spectrum (which is relatively cheap to compute) we have

(α) (β) 2
D (A, B ) = λl − λl
l


Protein Interaction Network Data
Species Proteins Interactions Genome size Sampling fraction
S.cerevisiae 5035 22118 6532 0.77
D. melanogaster 7506 22871 14076 0.53
H. pylori 715 1423 1589 0.45
E. coli 1888 7008 5416 0.35


S.cerevisiae 5035 22118 6532 0.77
D. melanogaster 7506 22871 14076 0.53
H. pylori 715 1423 1589 0.45
E. coli 1888 7008 5416 0.35

0.5

0.4
Model probability

Organism
0.3 S.cerevisae
D.melanogaster
H.pylori
E.coli

0.2

0.1

0.0

DA DAC LPA SF DACL DACR
Model


S.cerevisiae 5035 22118 6532 0.77
D. melanogaster 7506 22871 14076 0.53
H. pylori 715 1423 1589 0.45
E. coli 1888 7008 5416 0.35

0.5
Model Selection
• Inference here was based on all
0.4

the data, not summary
Model probability

0.3
Organism
S.cerevisae statistics.
D.melanogaster
H.pylori
E.coli • Duplication models receive the
0.2
strongest support from the data.
0.1 • Several models receive support
and no model is chosen
0.0
unambiguously.
DA DAC LPA SF DACL DACR
Model


δ α

15

8
6
10
DA

4
5

2
0

0
0.0 0.4 0.8 0.0 0.4 0.8
δ α
15

8
6
10
DAC

4
5

2

S.cerevisiae
0

0

0.0 0.4 0.8 0.0 0.4 0.8 D. melanogaster
δ α p m
H. pylori

1.0
10

10
4

0.8
8

8
E. coli
3

0.6
6

6
DACL

2

0.4
4

4
1

0.2
2

2

0.0
0

0

0

0.0 0.4 0.8 0.0 0.4 0.8 0.0 0.4 0.8 0 2 4 6 8 10
δ α p m

1.0
4

5
8

0.8
4
3
6

0.6
3
DACR

2
4

0.4
2
1
2

0.2
1

0.0
0

0

0

0.0 0.4 0.8 0.0 0.4 0.8 0.0 0.4 0.8 0 2 4 6 8 10


Considerate Use of ABC

• ABC is a tool for situations where conventional statistical
approaches fail or are too cumbersome.
• If all the data are used then this is (relatively) unproblematic; if the
data are compressed/corrupted then caution is required.
• Some of the issues arising in ABC mirror those also encountered
in “conventional” statistics:
Any Bayesian inference uses the data only via the minimal
sufﬁcient statistic. This is because the calculation of the
posterior distribution involves multiplying the likelihood by the
prior and normalizing. Any factor of the likelihood that is a
function of y alone will disappear after normalization.

D. Cox (2006).
• In other cases it seems prudent to accept the additional (and
considerable) computational cost of constructing suitable summary
statistics (such as in Barnes et al., Stat&Comp 2012).
Considerate Approaches to ABC Model Selection Stumpf et al. Conclusion 15 of 15

Acknowledgements

Considerate Approaches to ABC Model Selection Stumpf et al. Conclusion 15 of 15

Considerate Approaches to ABC Model Selection

More Related Content

What's hot (20)

Viewers also liked (7)

Similar to Considerate Approaches to ABC Model Selection (20)

More from Michael Stumpf (8)

Recently uploaded (20)

Considerate Approaches to ABC Model Selection