SlideShare a Scribd company logo
Bayesian inference for mixed-effects
models driven by SDEs and other
stochastic models: a scalable approach.
Umberto Picchini
Dept. Mathematical Sciences, Chalmers and Gothenburg University
7@uPicchini
Statistics seminar at Maths dept., Bristol University, 1 April, 2022
1
2
A classical problem of interest in biomedicine is the analysis of
repeated measurements data.
For example modelling repeated measurements of drug
concentrations (pharmacokinetics/pharmacodynamics)
Here we have concentrations of theophylline across 12 subjects.
3
Tumor growth in mice 1.
0 5 10 15 20 25 30 35 40
days
2.5
3
3.5
4
4.5
5
5.5
6
6.5
7
7.5
log
volume
(mm
3
)
group 3
Modelling tumor growth on 8 mice (we compared between
different treatments).
1
P and Forman (2019). Journal of the Royal Statistical Society: Series C
4
Neuronal data:
215
t was
e de-
ation,
mulus
ercial
level
d set
each
oten-
brane
es (if
esent
ticle.
with
Figure 1: Depolarization [mV] vs time [sec].
We may focus on what happens between spikes. So called
inter-spikes-intervals data (ISIs).
5
Inter-spikes-intervals data (ISIs):
0 50 100 150 200 250 300 350
Time (msec)
0.0
2.5
5.0
7.5
10.0
12.5
15.0
17.5
depolarization
mV
Figure 2: Observations from 100 ISIs.
P, Ditlevsen, De Gaetano and Lansky (2008). Parameters of the diffusion
leaky integrate-and-fire neuronal model for a slowly fluctuating signal.
Neural Computation, 20(11), 2696-2714. 6
With mixed-effects models (aka random-effects) we fit
simultaneously discretely observed data from M “subjects”
(= units).
The reason to do this is to perform inference at the population
level and better account for all information at hand.
Assume for example that for some covariate X
yi
j
|{z}
observation j in unit i
= Xi
j(φi
) + i
j, i = 1, ..., M; j = 1, ..., ni
φi
∼ N(η, σ2
η), individual random effects
The random effects have “population mean” η and “population
variance σ2
η”.
It’s typically of interest to estimate population parameters
(η, σ2
η) not the subject-specific φi.
7
So in this case each trajectory is guided by its own φi.
However all trajectories have something in common, the shared
parameters (η, σ2
η), since each φi ∼ N(η, σ2
η).
8
Mixed-effect methodology is now standard.
About 40 years of literature available.
It could turn tricky though to use this methodology when data
are observations from stochastic processes.
For example, when mixed-effects modelling are driven by
stochastic differential equations (SDEs).
There exist about 50 papers on fitting SDEs with mixed effects,
but these always consider some constraint that makes the
models not very general.
https://umbertopicchini.github.io/sdemem/
9
(this slide: courtesy of Susanne Ditlevsen)
 
The concentration of a drug in blood
*
*
*
*
* *
*
* * *
0 20 40 60 80 100 120
0
20
40
60
80
100
time in minutes
C12
concentration
1
Exponential decay
dC(t)
dt
= −µC(t)
C(t) = C(0)e−µt
*
*
*
*
* *
*
* * *
0 20 40 60 80 100 120
0
20
40
60
80
100
time in minutes
C12
concentration
2
Exponential decay with noise
dC(t) = −µC(t)dt + σC(t)dW(t)
C(t) = C(0) exp −(µ + 1
2 σ2
)t + σW(t)

*
*
*
*
* *
*
* * *
0 20 40 60 80 100 120
0
20
40
60
80
100
time in minutes
C12
concentration
Different realizations
dC(t) = −µC(t)dt + σC(t)dW(t)
C(t) = C(0) exp −(µ + 1
2 σ2
)t + σW(t)

*
*
*
*
* *
*
* * *
0 20 40 60 80 100 120
0
20
40
60
80
100
time in minutes
C12
concentration
10
Stochastic differential equation
mixed-effects models
(SDEMEMs)
SDEMEMs: model structure
The state-space SDEMEM follows









Y i
tj
= h(Xi
tj
, i
tj
) i
t|ξ
indep
∼ p(ξ), tj = 1, ..., ni
dXi
t = α(Xi
t , φi
) dt +
p
β(Xi
t , φi) dWi
t , i = 1, ..., M
φi
∼ π(φi
|η)
dWi
t ∼iid N(0, dt)
φi
and η are vectors of random and fixed (population) parameters.
• example: Y i
tj
= Xi
tj
+ i
tj
, but we are allowed to take h(·)
nonlinear with non-additive errors.
• Latent diffusions Xi
t share a common functional form, but have
individual parameters φi
, and are driven by individual Brownian
motions Wi
t .
11





Y i
t = h(Xi
t, i
t) i
t|ξ
indep
∼ p(ξ), tj = 1, ..., ni
dXi
t = α(Xi
t, φi) dt +
p
β(Xi
t, φi) dWi
t , i = 1, ..., M
φi ∼ π(φi|η)
SDEMEMs are flexible. Allow explanation of three levels of
variation:
• Intra-subject random variability modelled by a diffusion
process Xi
t.
• Variation between different units is taken into account
according to the (assumed) distribution of the φi’s.
• Residual variation is modeled via a measurement error ξ.
Goal: exact Bayesian inference for θ = [η, ξ].
12
What we want to do is:
produce (virtually) exact Bayesian inference for general,
nonlinear SDEMEMs.
“General” means:
• the SDEs can be nonlinear in the states Xt;
• the error-model for Yt does not have to be linear in the Xt;
• the error-model does not have to be additive, i.e. does not
have to be of the type Yt = F · Xt + t
• t does not have to be Gaussian distributed;
• random effects φi can have any distribution.
What we come up with is essentially an instance of the
pseudomarginal method (Andrieu,Roberts 2009), embedded
into a Gibbs sampler with careful use of blocking strategies
(and more...).
13
As it sometimes happen, independent work similar to ours was
carried out simultaneously in
Botha, I., Kohn, R.,  Drovandi, C. (2021). Particle methods
for stochastic differential equation mixed effects models.
Bayesian Analysis, 16(2), 575-609.
14
Bayesian inference for SDEMEMs
The joint posterior





Y i
t = h(Xi
t , i
t) i
t|ξ
indep
∼ p(ξ), tj = 1, ..., ni
dXi
t = α(Xi
t , φi
) dt +
p
β(Xi
t , φi) dWi
t , i = 1, ..., M
φi
∼ π(φi
|η), i = 1, ..., M
• observed data y = (Y i
1:ni
)M
i=1 across M individuals;
• latent x = (Xi
1:ni
)M
i=1 at discrete time-points;
We have the joint posterior
π(η, ξ, φ, x|y) ∝ π(η)π(ξ)π(φ|η)π(x|φ)π(y|x, ξ),
where (from now on assume ni ≡ n for all units).
π(φ|η) =
M
Y
i=1
π(φi
|η), π(x|φ) =
M
Y
i=1
π(xi
1)
n
Y
j=2
π(xi
j|xi
j−1, φi
)
| {z }
Markovianity
,
π(y|x, ξ) =
M
Y
i=1
n
Y
j=1
π(yi
j|xi
j, ξ)
| {z }
condit. independence
.
15
π(η, ξ, φ, x|y) ∝ π(η)π(ξ)π(φ|η)π(x|φ)π(y|x, ξ),
while several components of the joint π(η, ξ, φ, x|y) may have
tractable conditionals, sampling from such a joint posterior can
still be an horrendous task.→ slow parameters surface
exploration.
Reason being that unknown parameters and x are highly
correlated.
Hence a Gibbs sampler would very badly mix.
The best is, in fact, to sample from either of the following
marginals
π(η, ξ, φ|y) =
Z
π(η, ξ, φ, x|y)dx
or
π(η, ξ|y) =
Z Z
π(η, ξ, φ, x|y)dxdφ 16
Marginal posterior over parameters and random effects
• By integrating x out, the resulting marginal is
π(η, ξ, φ|y) ∝ π(η)π(ξ)
M
Y
i=1
π(φi
|η)π(yi
|ξ, φi
).
• The data likelihood π(yi
|ξ, φi
) for the generic i-th unit is
π(yi
|ξ, φi
) ∝
Z n
Y
j=1
π(yi
j|xi
j, ξ) × π(xi
1)
n
Y
j=2
π(xi
j|xi
j−1, φi
)dxi
j
• Typical problem: transitions density π(xi
j|xi
j−1, ·) unknown;
• integral is generally intractable but we can estimate it via Monte
Carlo.
• For (very) simple cases, such as linear SDEs, we can apply the
Kalman filter and obtain an exact solution. 17
We assume a generic nonlinear SDE.
• Transition density π(xi
j|xi
j−1, φi
) is unknown,
• luckily we can still approximate the likelihood integral unbiasedly
• sequential Monte Carlo (SMC) can be used for the task
• when π(xi
j|xi
j−1, φi
) is unknown, we are still able to run a
numerical discretization methods with step-size h  0 and
simulate from the approximate πh(xi
j|xi
j−1, φi
), e.g.
xi
t+h = xi
t + α(xi
t, φi
)h +
q
β(xi
t, φi) · ui
t
ui
t ∼iid N(0, h)
this is the Euler-Maruyama discretization scheme (possible to use
more advances schemes).
Hence xi
t+h|xi
t ∼ πh(xi
t+h|xi
t, φi
) (which is clearly Gaussian).
18
We approximate the observed data likelihood as
πh(yi
|ξ, φi
) ∝
Z n
Y
j=1
π(yi
j|xi
j(h), ξ) × π(xi
1)
n
Y
j=2
πh(xi
j|xi
j−1, φi
)dxi
j
but now for simplicity we stop emphasizing the reference to h.
So we have the Monte-Carlo approximation
π(yi
|ξ, φi
) = E
 n
Y
j=1
π(yi
j|xi
j, ξ)

≈
1
N
N
X
k=1
n
Y
j=1
π(yi
j|xi
j,k, ξ),
xi
j,k ∼iid πh(xi
j|xi
j−1, φi
), (k = 1, ..., N)
and the last sampling can of course be performed numerically (say by
Euler-Maruyama).
19
The efficient way to produce Monte Carlo approximations for
nonlinear time-series observed with error is Sequential Monte
Carlo (aka particle filters).
With SMC the N Monte Carlo draws are called “particles”.
The secret to SMC is
• “propagate particles xi
t forward”: xi
t → xi
t+h,
• “weight” the particles proportionally to π(y|x),
• “resample particles according to their weight”. The last
operation is essential to let particles track the observations.
I won’t get into details. But the simplest particle filter is the
bootstrap filter (Gordon et al. 1993)2.
2
Useful intro from colleagues in Linköping and Uppsala:
Naesseth, Lindsten, Schön. Elements of Sequential Monte Carlo. Foundations and Trends in
Machine Learning, 12(3):307–392, 2019
20
Estimating the observed data likelihood with the bootstrap
filter
An unbiased non-negative estimation of the data likelihood
can be computed with the bootstrap filter SMC method
using N particles:
π̂ui (yi
|ξ, φi
) =
1
Nn
n
Y
t=1
N
X
k=1
π(yi
t|xi
t,k, ξ), i = 1, ..., M
Recall, for particle k we have
xi
t+h,k = xi
t,k + α(xi
t,k, φi
)h +
q
β(xi
t,k, φi) · ui
t,k
ui
t,k ∼iid N(0, h)
We will soon see that it is important to keep track of the
apparently uninteresting ui
t,k variates. 21
The “Blocked” Gibbs algorithm
Recall:
• random effects φi ∼ π(φi|η), i = 1, .., M
• population parameters η ∼ π(η)
• measurement error ξ ∼ π(ξ)
• SMC variates: ui ∼ g(ui), i = 1, .., M.
We found very important to “block” the generation of the u
variates.
1. π(φi, ui|η, ξ, yi) ∝ π(φi|η)π̂ui (yi|ξ, φi)g(ui), i = 1, . . . , M,
2. π(ξ|η, φ, y, u) = π(ξ|φ, y, u) ∝ π(ξ)
QM
i=1 π̂ui (yi|ξ, φi),
3. π(η|ξ, φ, y, u) = π(η|φ) ∝ π(η)
QM
i=1 π(φi|η).
Once the u variables are sampled and accepted in step 1, we
reuse them when computing (in step 2) π̂ui (yi|ξ, φi).
Better performance compared to generating new u in step 2. 22
In practice it is a Metropolis-Hastings within Gibbs
• Using the approximated likelihood π̂u, we construct a
Metropolis-Hastings within Gibbs algorithm:
1. π(φi
, ui
|η, ξ, yi
) ∝ π(φi
|η)π̂ui (yi
|ξ, φi
)g(ui
), i = 1, . . . , M,
2. π(ξ|η, φ, y, u) = π(ξ|φ, y, u) ∝ π(ξ)
QM
i=1 π̂ui (yi
|ξ, φi
),
3. π(η|ξ, φ, y, u) = π(η|φ) ∝ π(η)
QM
i=1 π(φi
|η).
• With this scheme the acceptance probability in the first
step is
min

1 ,
π(φi∗|·)
π(φi|·)
×
π̂ui∗ (yi|φi∗, ·)
π̂ui (yi|φi, ·)
×
q(φi|φi∗)
q(φi∗|φi)

.
• Often computer expensive due to the many particles
needed to keep the variance of π̂ui∗ (yi|φi∗, ·) small.
23
CPMMH: correlating the likelihood approximations
Smart idea proposed by Deligiannidis et al. (2018): control instead
the variance of the likelihood ratio.
• Let us consider the acceptance probability in step 1
min

1 ,
π(φi∗
|·)
π(φi|·)
×
π̂ui∗ (yi
|φi∗
, ·)
π̂ui (yi|φi, ·)
×
q(φi
|φi∗
)
q(φi∗|φi)

.
• The main idea in CPMMH is to induce a positive correlation
between π̂ui∗ (yi
|φi∗
, ·) and π̂ui (yi
|φi
, ·). Which reduces the ratio
variance while using fewer particles in the particle filter.
• Correlation induced via Crank–Nicolson :
ui∗
= ρ · ui,(j−1)
+
p
1 − ρ2 · ω, ω ∼ N(0, Id)
ρ ∈ (0.9, 0.999)
24
CPMMH: selecting number of particles
• For PMMH (no correlated particles) we selected the
number of particles N such that the variance of the
log-likelihood σ2
N is σ2
N ≈ 2 at some fixed parameter value.3
• For CPMMH N is selected such that σ2
N ≈ 2.162/(1 − ρ2
l )
where ρl is the estimated correlation between π̂ui (yi|ξ, φi)
and π̂ui∗ (yi|ξ, φi).4
• A drawback with the CPMMH algorithm is that we have
to store the random numbers u = (u1, . . . , uM )T in
memory, which can be problematic if we have a very large
number of particles N, many subjects, or long time-series.
3
Sherlock, Thiery, Roberts, Rosenthal (2015). AoS.
4
Choppala, Gunawan, Chen, M.-N Tran, Kohn, 2016.
25
We want to show how to improve scalability for increasing M.
But first, some illustrative applications.
26
Applications
Ornstein-Uhlenbeck SDEMEM: model structure
Let us consider the following Ornstein-Uhlenbeck SDEMEM
(
Y i
t = Xi
t + i
t, i
t
indep
∼ N(0, σ2
 ), i = 1, ..., 40
dXi
t = θi
1(θi
2 − Xi
t)dt + θi
3dWi
t .
• The random effects φi = (log θi
1, log θi
2, log θi
3) follow
φi
j|η
indep
∼ N(µj, τ−1
j ), j = 1, . . . , 3,
where η = (µ1, µ2, µ3, τ1, τ2, τ3)
• This induces a semi-conjugate prior on η. Thus, we have a
tractable Gibbs step when updating η.
27
Ornstein-Uhlenbeck SDEMEM: simulated data
We have M = 40 individuals.
0 2 4 6 8 10
Time
0
5
10
15
20
25
30
Figure 3: Simulated data from the OU-SDEMEM model.
28
Ornstein-Uhlenbeck SDEMEM: different inference meth-
ods
We compare the following MCMC methods: we always use the
outlined Metropolis-within-Gibbs sampler, with likelihood
computed with several flavours:
• “Kalman”: Computing the data likelihood exactly with the
Kalman filter.
• “PMMH”: Estimating the data likelihood with the
bootstrap filter and no correlated likelihoods.
• “CPMMH-099”: Estimating the data likelihood with the
bootstrap filter with correlated likelihoods, with
correlation ρ = 0.99.
• “CPMMH-0999”: same as above, with correlation
ρ = 0.999.
29
Ornstein-Uhlenbeck SDEMEM: inference results for η
−1.4 −1.2 −1.0 −0.8 −0.6 −0.4 −0.2
µ1
0
1
2
3
4
Density
1.8 2.0 2.2 2.4 2.6 2.8
µ2
0
1
2
3
4
5
6
Density
−1.4 −1.2 −1.0 −0.8 −0.6 −0.4 −0.2
µ3
0
1
2
3
4
Density
2 4 6 8 10 12
τ1
0.0
0.1
0.2
0.3
0.4
Density
2 4 6 8 10 12 14
τ2
0.00
0.05
0.10
0.15
0.20
0.25
0.30
Density
2 4 6 8 10
τ3
0.0
0.1
0.2
0.3
0.4
0.5
Density
Figure 4: OU SDEMEM: marginal posterior distributions for
η = (µ1, µ2, µ3, τ1, τ2, τ3). Almost overlapping lines are: Kalman, PMMH,
CPMMH-099, vertical lines are ground truth.
30
Ornstein-Uhlenbeck SDEMEM: comparing efficiency
The ones below are all MH-within-Gibbs algorithms:
Algorithm ρ N CPU (min) mESS mESS/min Rel.
Kalman - - 1.23 488.51 396.37 5684.46
PMMH 0 3000 4076.94 450.13 0.11 1
CPMMH-099 0.99 100 200.92 418.22 2.09 19
CPMMH-0999 0.999 50 110.66 323.77 2.93 26.6
Figure 5: OU SDEMEM. Correlation ρ, number of particles N,
CPU time (minutes), minimum ESS (mESS), minimum ESS per
minute (mESS/min) and relative minimum ESS per minute (Rel.) as
compared to PMMH-naive. All results are based on 50k iterations of
each scheme, and are medians over 5 independent runs of each
algorithm on different data sets. We could only produce 5 runs due to
the very high computational cost of PMMH.
31
Tumor growth simulation study
This example is inspired by another publication: there, P. and
Forman analyzed real experimental data of tumor growth on
mice, using SDEMEMs.
However, here to illustrate the use of our inference method we
use a slightly simpler model.
32
Figure 6: Source http://www.nature.com/articles/srep04384
33
34
Tumor growth SDEMEM: model structure
Let us now consider the following SDEMEM 5





Y i
t = log V i
t + i
t, i
t
indep
∼ N(0, σ2
e).
dXi
1,t = βi + (γi)2/2

Xi
1,tdt + γiXi
1,tdWi
1,t,
dXi
2,t = −δi + (ψi)2/2

Xi
2,tdt + ψiXi
2,tdWi
2,t.
• Xi
1,t the volume of surviving tumor cells.
• Xi
2,t the volume of cells “killed by a treatment”.
• V i
t = Xi
1,t + Xi
2,t the total tumor volume.
5
P  Forman. (2019). Bayesian inference for stochastic differential
equation mixed effects models of a tumor xenography study. JRSS-C.
35
Tumor growth SDEMEM: random effects model
• The random effects φi = (log βi, log γi, log δi, log ψi) follow
φi
j|η
indep
∼ N(µj, τ−1
j ), j = 1, . . . , 4,
where η = (µ1, . . . , µ4, τ1, . . . , τ4).
36
Tumor growth SDEMEM: simulated data
We assume M = 10 subjects with n = 20 datapoints each.
5 10 15 20
6
8
10
12
14
Time
Figure 7: Simulated data from the tumour growth model.
37
Tumor growth SDEMEM: different inference methods
We use the following inference methods:
• “PMMH”: Estimating the data likelihood with the
bootstrap filter and no correlation in the likelihoods.
• “CPMMH”: Estimating the data likelihood with the
bootstrap filter and inducing correlation in the
likelihoods, with ρ = 0.999.
Kalman here cannot be used due to nonlinear
Yt = log(X1,t + X2,t) + t
38
Tumor growth SDEMEM: inference results for η
-2.0 -1.5 -1.0
0
1
2
3
µ1
Density
0 20 40 60
0.00
0.02
0.04
0.06
τ1
Density
-2.5 -2.0 -1.5 -1.0
0.0
0.5
1.0
1.5
2.0
µ2
Density
0 20 40 60
0.00
0.02
0.04
0.06
0.08
τ2
Density
-3.5 -3.0 -2.5 -2.0 -1.5 -1.0
0.0
0.5
1.0
1.5
µ3
Density
0 20 40 60 80
0.00
0.02
0.04
0.06
τ3
Density
-3.0 -2.5 -2.0 -1.5 -1.0 -0.5 0.0
0.0
0.2
0.4
0.6
0.8
1.0
µ4
Density
0 20 40 60
0.00
0.05
0.10
0.15
0.20
0.25
τ4
Density
Figure 8: Marginal posterior distributions for µi and τi, i = 1, . . . , 4.
Dotted line shows results from LNA scheme, solid line is from the CPMMH
scheme and dashed line is the PMMH Scheme.
39
Tumor growth SDEMEM: comparing efficiency
Algorithm ρ N CPU (m) mESS mESS/m Rel.
PMMH 0 30 2963 2559 0.864 1
CPMMH 0.999 10 957 2311 2.415 3
Figure 9: Tumour model. Correlation ρ, number of particles N, CPU
time (in minutes m), minimum ESS (mESS), minimum ESS per minute
(mESS/m) and relative minimum ESS per minute (Rel.) as compared to
PMMH. All results are based on 500k iterations of each scheme.
40
Improving scalability for
increasing individuals
Obviously, when the number of individuals M increases,
problems emerge...
Recall the blocked Gibbs steps:
1. π(φi, ui|η, ξ, yi) ∝ π(φi|η)π̂ui (yi|ξ, φi)g(ui), i = 1, . . . , M,
2. π(ξ|η, φ, y, u) = π(ξ|φ, y, u) ∝ π(ξ)
QM
i=1 π̂ui (yi|ξ, φi),
3. π(η|ξ, φ, y, u) = π(η|φ) ∝ π(η)
QM
i=1 π(φi|η).
Steps 1 and 2 are the hard ones because each require M runs of
a particle filter, one for each likelihood π̂ui (yi|ξ, φi)g(ui).
Moreover, steps 2 involves the product of individual
loglikelihoods. Too keep the variance of the product low, many
particles may be needed for the individual terms.
41
The “trick”
However co-author Sebastian Persson had the intuition to
borrow a trick from the Monolix6 software, which is specialised
in inference for mixed-effects models.
Quite simply, consider a “perturbation” of the original
SDEMEM, where we allow the constant parameter ξ (and
possibly other fixed parameters) to be slightly varying between
subjects as
ξi
∼ N(ξpop, δ), i = 1, ..., M
where ξpop is the original ξ to be inferred.
6
https://lixoft.com/products/monolix/
42
Gibbs for unperturbed model:
1. π(φi, ui|η, ξ, yi) ∝ π(φi|η)π̂ui (yi|ξ, φi)g(ui), i = 1, . . . , M,
2. π(ξ|η, φ, y, u) = π(ξ|φ, y, u) ∝ π(ξ)
QM
i=1 π̂ui (yi|ξ, φi),
3. π(η|ξ, φ, y, u) = π(η|φ) ∝ π(η)
QM
i=1 π(φi|η).
Now introduce
ξi
∼ N(ξpop, δ), i = 1, ..., M
Gibbs for perturbed model:
1. π(φi, ξi, ui|η, ξpop, yi) ∝ π(φi|η)π(ξi|ξpop)π̂ui (yi|ξi, φi)g(ui),
i = 1, . . . , M,
2. π(ξ|η, φ, y, u) = π(ξ|φ, y, u) ∝ π(ξ)
QM
i=1 π̂ui (yi|ξ, φi),
3. π(η|ξ, φ, y, u) = π(η|φ) ∝ π(η)
QM
i=1 π(φi|η).
The expensive Step 2 has disappeared!
43
Notice step 1 allows to target each random effect separately
from the others.
So controlling the variance of each individual likelihood
π̂ui (yi|ξi, φi) is much easier than controlling the variance of
the “joint likelihood”
QM
i=1 π̂ui (yi|ξ, φi).
44
Perturbed and non-perturbed Ornstein-Uhlenbeck
Gibbs on perturbed vs Gibbs on non-perturbed model (OU
model).
45
Another model (forgot which one)
Gibbs on perturbed vs Gibbs on non-perturbed model .
46
The perturbation variance
ξi
∼ N(ξpop, δ), i = 1, ..., M
δ  0 is a small tuning parameter specified by the user.
We set δ somewhat arbitrarily, but we found that for
parameters having magnitude between 1–10 a value of δ = 0.01
worked well.
47
Try our PEPSDI package
Everything is coded in Julia for efficient inference at
https://github.com/cvijoviclab/PEPSDI
Includes:
• tutorials and notebooks on how to run the package;
• several adaptive MCMC samplers benchmarked (tl;dr best
one is Matti Vihola’s RAM sampler).
• “guided” particle filters better suited for informative
observations (low measurement error).
• nontrivial case studies are in the paper.
• it’s not just SDEs with mixed effects! Mixed-effects
stochastic kinetic models are implemented, and several
numerical integrators typical in systems biology are
supported (tau leaping, Gillespie).
48
Thanks to great co-authors!
(a) Marija
Cvijovic
(b) Samuel
Wiqvist
(c)
Sebastian
Persson
(d) Andrew
Golightly
(e) Ash-
leigh McLean
(f) Niek
Welkenhuy-
sen
(g)
Sviatlana
Shashkova
(h)
Patrick
Reith
(i) Gregor
Schmidt
49
Thank you
7@uPicchini
50
Appendix
CPMMH: updating step
• When correlating the particles, i.e. using CPMMH, step 1
in the Metroplis-Hastings within Gibbs scheme becomes:
1: For i = 1, . . . , M:
(1a) Propose φi∗
∼ q(·|φi,(j−1)
). Draw ω ∼ N(0, Id) and put
ui∗
= ρui,(j−1)
+
p
1 − ρ2ω.
(1b) Compute π̂ui∗ (yi
|ξ(j−1)
, φi∗
) by running the particle filter
with ui∗
, φi∗
, ξ(j−1)
and yi
.
(1c) With probability
put φi,(j)
= φi∗
and ui,(j)
= ui∗
. Otherwise, store the
current values φi,(j)
= φi,(j−1)
and ui,(j)
= ui,(j−1)
.
CPMMH: selecting number of particles
• For PMMH (no correlated particles) we selected the
number of particles N such that the variance of the
log-likelihood σ2
N is σ2
N ≈ 2 at some fixed parameter value.7
• For CPMMH N is selected such that σ2
N = 2.162/(1 − ρ2
l )
where ρl is the estimated correlation between π̂ui (yi|ξ, φi)
and π̂ui∗ (yi|ξ, φi).8
• A drawback with the CPMMH algorithm is that we have
to store the random numbers u = (u1, . . . , uM )T in
memory, which can be problematic if we have a very large
number of particles N, many subjects, or long time-series.
7
doucet15, sherlock2015.
8
tran2016block.
CPMMH: updating step
• When correlating the particles, step 1 in the
MH-within-Gibbs scheme becomes:
1: For i = 1, . . . , M:
(1a) Propose φi∗
∼ q(·|φi,(j−1)
). Draw ω ∼ N(0, Id) and put
ui∗
= ρui,(j−1)
+
p
1 − ρ2ω.
(1b) Compute π̂ui∗ (yi
|ξ(j−1)
, φi∗
) by running the particle filter
with ui∗
, φi∗
, ξ(j−1)
and yi
.
(1c) With probability
min

1 ,
π(φi∗
|·)
π(φi|·)
×
π̂ui∗ (yi
|φi∗
, ·)
π̂ui (yi|φi, ·)
×
q(φi
|φi∗
)
q(φi∗|φi)

.
put φi,(j)
= φi∗
and ui,(j)
= ui∗
. Otherwise, store the
current values φi,(j)
= φi,(j−1)
and ui,(j)
= ui,(j−1)
.
Using stochastic modelling is important!
[This slide refers to the tumor-growth data]
And what if we produced inference using a deterministic model
(ODEMEM) while observations come from a stochastic model?
Here follows the etimation of the measurement error variance σ2
e
(truth is log σ2
e = −1.6).
True value is massively overestimated by the ODE-based approach
Application: neuronal data with informative observations
215
ea, it was
trode de-
lification,
stimulus
mmercial
DC) level
d and set
fter each
ne poten-
membrane
spikes (if
e present
s article.
mV) with
d 0−501
Figure 11: Depolarization [mV] vs time [sec].
We may focus on what happens between spikes. So called
inter-spikes-intervals data (ISIs).
Inter-spikes-intervals data (ISIs):
0 50 100 150 200 250 300 350
Time (msec)
0.0
2.5
5.0
7.5
10.0
12.5
15.0
17.5
depolarization
mV
Figure 12: Observations from M = 100 ISIs.
P., Ditlevsen, De Gaetano and Lansky (2008). Parameters of the diffusion
leaky integrate-and-fire neuronal model for a slowly fluctuating signal.
Neural Computation, 20(11), 2696-2714.
So we have about 1.6 × 105 measurements of membrane
potential, across M = 100 units.
Membrane potential dynamics are assumed governed by an
Ornstein-Uhleneck process, observed with error:
(
Y i
t = Xi
t + i
t, i
t
indep
∼ N(0, σ2
 ), i = 1, ..., M,
dXi
t = (−λiXi
t + νi)dt + σidWi
t .
• µi [mV/msec] is the electrical input into the neuron;
• 1/λi [msec] is the spontaneous voltage decay (in the
absence of input)
In this example data are informative:
This means that the measurement error term is negligible.
Contrary to intuition, having informative observations
complicates things from the computational side.
Shortly: the “particles” propagated forward via the bootstrap
filter will have hard time, since π(yt|xt, ·) now has a very
narrow support.
Hence many particles will receive a tiny weight (→ poorly
approximated likelihood).
Solution: at time t, let the particles be “guided forward” to get
close to the next datapoint yt+1.
We used the guided scheme in Golightly, A.,  Wilkinson, D. J.
(2011). Interface focus, 1(6), 807-820.
With the “guided” particles having N = 1 is sufficient to get
good inference (not reported here).
Algorithm ρ N CPU (m) mESS mESS/m Rel.
Kalman - - 56 666 12.0 20.0
PMMH - 1 481 287 0.6 1.0
CPMMH-09 0.9 1 653 381 0.58 1.0
CPMMH-0999 0.999 1 655 326 0.50 0.8
Figure 13: Neuronal model. Correlation ρ, number of particles N,
CPU time (in minutes m), minimum ESS (mESS), minimum ESS per
minute (mESS/m), and relative minimum ESS per minute (Rel.) as
compared to PMMH. All results are based on 100k iterations of each
scheme.
Several adaptive MCMC samplers
We compare ESS and Wasserstein distance (wrt to true posterior
when available) across several MCMC samplers.

More Related Content

PDF
Ideas for world water day
PDF
Inference for stochastic differential equations via approximate Bayesian comp...
PDF
Multiple estimators for Monte Carlo approximations
PDF
Talk in BayesComp 2018
PPTX
Monte Carlo Berkeley.pptx
PDF
Cs229 notes7b
PDF
Inference via Bayesian Synthetic Likelihoods for a Mixed-Effects SDE Model of...
PDF
Spike sorting: What is it? Why do we need it? Where does it come from? How is...
Ideas for world water day
Inference for stochastic differential equations via approximate Bayesian comp...
Multiple estimators for Monte Carlo approximations
Talk in BayesComp 2018
Monte Carlo Berkeley.pptx
Cs229 notes7b
Inference via Bayesian Synthetic Likelihoods for a Mixed-Effects SDE Model of...
Spike sorting: What is it? Why do we need it? Where does it come from? How is...

Similar to Bayesian inference for mixed-effects models driven by SDEs and other stochastic models: a scalable approach (20)

PDF
How many components in a mixture?
PDF
A likelihood-free version of the stochastic approximation EM algorithm (SAEM)...
PDF
Machine learning (8)
PDF
Application of stochastic lognormal diffusion model with
PDF
An investigation of inference of the generalized extreme value distribution b...
PPT
Gracheva Inessa - Fast Global Image Denoising Algorithm on the Basis of Nonst...
PDF
Testing for mixtures at BNP 13
PDF
BAYSM'14, Wien, Austria
PDF
Scalable inference for a full multivariate stochastic volatility
PDF
MUMS: Transition & SPUQ Workshop - Complexities in Bayesian Inverse Problems:...
PDF
Conference poster 6
PDF
NCE, GANs & VAEs (and maybe BAC)
PPTX
Bayesian Neural Networks
PDF
Cookbook en
PDF
Jensen's inequality, EM 알고리즘
PDF
Accelerated approximate Bayesian computation with applications to protein fol...
PDF
CDT 22 slides.pdf
PDF
Datadriven Computational Methods Parameter And Operator Estimations Harlim
PDF
Development, Optimization, and Analysis of Cellular Automaton Algorithms to S...
PDF
2012 mdsp pr05 particle filter
How many components in a mixture?
A likelihood-free version of the stochastic approximation EM algorithm (SAEM)...
Machine learning (8)
Application of stochastic lognormal diffusion model with
An investigation of inference of the generalized extreme value distribution b...
Gracheva Inessa - Fast Global Image Denoising Algorithm on the Basis of Nonst...
Testing for mixtures at BNP 13
BAYSM'14, Wien, Austria
Scalable inference for a full multivariate stochastic volatility
MUMS: Transition & SPUQ Workshop - Complexities in Bayesian Inverse Problems:...
Conference poster 6
NCE, GANs & VAEs (and maybe BAC)
Bayesian Neural Networks
Cookbook en
Jensen's inequality, EM 알고리즘
Accelerated approximate Bayesian computation with applications to protein fol...
CDT 22 slides.pdf
Datadriven Computational Methods Parameter And Operator Estimations Harlim
Development, Optimization, and Analysis of Cellular Automaton Algorithms to S...
2012 mdsp pr05 particle filter
Ad

More from Umberto Picchini (6)

PDF
Guided sequential ABC schemes for simulation-based inference
PDF
Stratified Monte Carlo and bootstrapping for approximate Bayesian computation
PDF
Stratified sampling and resampling for approximate Bayesian computation
PDF
My data are incomplete and noisy: Information-reduction statistical methods f...
PDF
ABC with data cloning for MLE in state space models
PDF
Intro to Approximate Bayesian Computation (ABC)
Guided sequential ABC schemes for simulation-based inference
Stratified Monte Carlo and bootstrapping for approximate Bayesian computation
Stratified sampling and resampling for approximate Bayesian computation
My data are incomplete and noisy: Information-reduction statistical methods f...
ABC with data cloning for MLE in state space models
Intro to Approximate Bayesian Computation (ABC)
Ad

Recently uploaded (20)

PPTX
Cell Membrane: Structure, Composition & Functions
PPTX
Comparative Structure of Integument in Vertebrates.pptx
PPTX
2. Earth - The Living Planet Module 2ELS
DOCX
Viruses (History, structure and composition, classification, Bacteriophage Re...
PDF
ELS_Q1_Module-11_Formation-of-Rock-Layers_v2.pdf
PDF
IFIT3 RNA-binding activity primores influenza A viruz infection and translati...
PDF
VARICELLA VACCINATION: A POTENTIAL STRATEGY FOR PREVENTING MULTIPLE SCLEROSIS
PPTX
ECG_Course_Presentation د.محمد صقران ppt
PDF
An interstellar mission to test astrophysical black holes
PPTX
Introduction to Cardiovascular system_structure and functions-1
PPTX
7. General Toxicologyfor clinical phrmacy.pptx
PPTX
G5Q1W8 PPT SCIENCE.pptx 2025-2026 GRADE 5
PPTX
2Systematics of Living Organisms t-.pptx
PPTX
BIOMOLECULES PPT........................
PDF
CAPERS-LRD-z9:AGas-enshroudedLittleRedDotHostingaBroad-lineActive GalacticNuc...
PPTX
TOTAL hIP ARTHROPLASTY Presentation.pptx
PDF
HPLC-PPT.docx high performance liquid chromatography
PPTX
GEN. BIO 1 - CELL TYPES & CELL MODIFICATIONS
PDF
Phytochemical Investigation of Miliusa longipes.pdf
PPTX
Classification Systems_TAXONOMY_SCIENCE8.pptx
Cell Membrane: Structure, Composition & Functions
Comparative Structure of Integument in Vertebrates.pptx
2. Earth - The Living Planet Module 2ELS
Viruses (History, structure and composition, classification, Bacteriophage Re...
ELS_Q1_Module-11_Formation-of-Rock-Layers_v2.pdf
IFIT3 RNA-binding activity primores influenza A viruz infection and translati...
VARICELLA VACCINATION: A POTENTIAL STRATEGY FOR PREVENTING MULTIPLE SCLEROSIS
ECG_Course_Presentation د.محمد صقران ppt
An interstellar mission to test astrophysical black holes
Introduction to Cardiovascular system_structure and functions-1
7. General Toxicologyfor clinical phrmacy.pptx
G5Q1W8 PPT SCIENCE.pptx 2025-2026 GRADE 5
2Systematics of Living Organisms t-.pptx
BIOMOLECULES PPT........................
CAPERS-LRD-z9:AGas-enshroudedLittleRedDotHostingaBroad-lineActive GalacticNuc...
TOTAL hIP ARTHROPLASTY Presentation.pptx
HPLC-PPT.docx high performance liquid chromatography
GEN. BIO 1 - CELL TYPES & CELL MODIFICATIONS
Phytochemical Investigation of Miliusa longipes.pdf
Classification Systems_TAXONOMY_SCIENCE8.pptx

Bayesian inference for mixed-effects models driven by SDEs and other stochastic models: a scalable approach

  • 1. Bayesian inference for mixed-effects models driven by SDEs and other stochastic models: a scalable approach. Umberto Picchini Dept. Mathematical Sciences, Chalmers and Gothenburg University 7@uPicchini Statistics seminar at Maths dept., Bristol University, 1 April, 2022 1
  • 2. 2
  • 3. A classical problem of interest in biomedicine is the analysis of repeated measurements data. For example modelling repeated measurements of drug concentrations (pharmacokinetics/pharmacodynamics) Here we have concentrations of theophylline across 12 subjects. 3
  • 4. Tumor growth in mice 1. 0 5 10 15 20 25 30 35 40 days 2.5 3 3.5 4 4.5 5 5.5 6 6.5 7 7.5 log volume (mm 3 ) group 3 Modelling tumor growth on 8 mice (we compared between different treatments). 1 P and Forman (2019). Journal of the Royal Statistical Society: Series C 4
  • 5. Neuronal data: 215 t was e de- ation, mulus ercial level d set each oten- brane es (if esent ticle. with Figure 1: Depolarization [mV] vs time [sec]. We may focus on what happens between spikes. So called inter-spikes-intervals data (ISIs). 5
  • 6. Inter-spikes-intervals data (ISIs): 0 50 100 150 200 250 300 350 Time (msec) 0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 depolarization mV Figure 2: Observations from 100 ISIs. P, Ditlevsen, De Gaetano and Lansky (2008). Parameters of the diffusion leaky integrate-and-fire neuronal model for a slowly fluctuating signal. Neural Computation, 20(11), 2696-2714. 6
  • 7. With mixed-effects models (aka random-effects) we fit simultaneously discretely observed data from M “subjects” (= units). The reason to do this is to perform inference at the population level and better account for all information at hand. Assume for example that for some covariate X yi j |{z} observation j in unit i = Xi j(φi ) + i j, i = 1, ..., M; j = 1, ..., ni φi ∼ N(η, σ2 η), individual random effects The random effects have “population mean” η and “population variance σ2 η”. It’s typically of interest to estimate population parameters (η, σ2 η) not the subject-specific φi. 7
  • 8. So in this case each trajectory is guided by its own φi. However all trajectories have something in common, the shared parameters (η, σ2 η), since each φi ∼ N(η, σ2 η). 8
  • 9. Mixed-effect methodology is now standard. About 40 years of literature available. It could turn tricky though to use this methodology when data are observations from stochastic processes. For example, when mixed-effects modelling are driven by stochastic differential equations (SDEs). There exist about 50 papers on fitting SDEs with mixed effects, but these always consider some constraint that makes the models not very general. https://umbertopicchini.github.io/sdemem/ 9
  • 10. (this slide: courtesy of Susanne Ditlevsen)   The concentration of a drug in blood * * * * * * * * * * 0 20 40 60 80 100 120 0 20 40 60 80 100 time in minutes C12 concentration 1 Exponential decay dC(t) dt = −µC(t) C(t) = C(0)e−µt * * * * * * * * * * 0 20 40 60 80 100 120 0 20 40 60 80 100 time in minutes C12 concentration 2 Exponential decay with noise dC(t) = −µC(t)dt + σC(t)dW(t) C(t) = C(0) exp −(µ + 1 2 σ2 )t + σW(t) * * * * * * * * * * 0 20 40 60 80 100 120 0 20 40 60 80 100 time in minutes C12 concentration Different realizations dC(t) = −µC(t)dt + σC(t)dW(t) C(t) = C(0) exp −(µ + 1 2 σ2 )t + σW(t) * * * * * * * * * * 0 20 40 60 80 100 120 0 20 40 60 80 100 time in minutes C12 concentration 10
  • 12. SDEMEMs: model structure The state-space SDEMEM follows          Y i tj = h(Xi tj , i tj ) i t|ξ indep ∼ p(ξ), tj = 1, ..., ni dXi t = α(Xi t , φi ) dt + p β(Xi t , φi) dWi t , i = 1, ..., M φi ∼ π(φi |η) dWi t ∼iid N(0, dt) φi and η are vectors of random and fixed (population) parameters. • example: Y i tj = Xi tj + i tj , but we are allowed to take h(·) nonlinear with non-additive errors. • Latent diffusions Xi t share a common functional form, but have individual parameters φi , and are driven by individual Brownian motions Wi t . 11
  • 13.      Y i t = h(Xi t, i t) i t|ξ indep ∼ p(ξ), tj = 1, ..., ni dXi t = α(Xi t, φi) dt + p β(Xi t, φi) dWi t , i = 1, ..., M φi ∼ π(φi|η) SDEMEMs are flexible. Allow explanation of three levels of variation: • Intra-subject random variability modelled by a diffusion process Xi t. • Variation between different units is taken into account according to the (assumed) distribution of the φi’s. • Residual variation is modeled via a measurement error ξ. Goal: exact Bayesian inference for θ = [η, ξ]. 12
  • 14. What we want to do is: produce (virtually) exact Bayesian inference for general, nonlinear SDEMEMs. “General” means: • the SDEs can be nonlinear in the states Xt; • the error-model for Yt does not have to be linear in the Xt; • the error-model does not have to be additive, i.e. does not have to be of the type Yt = F · Xt + t • t does not have to be Gaussian distributed; • random effects φi can have any distribution. What we come up with is essentially an instance of the pseudomarginal method (Andrieu,Roberts 2009), embedded into a Gibbs sampler with careful use of blocking strategies (and more...). 13
  • 15. As it sometimes happen, independent work similar to ours was carried out simultaneously in Botha, I., Kohn, R., Drovandi, C. (2021). Particle methods for stochastic differential equation mixed effects models. Bayesian Analysis, 16(2), 575-609. 14
  • 17. The joint posterior      Y i t = h(Xi t , i t) i t|ξ indep ∼ p(ξ), tj = 1, ..., ni dXi t = α(Xi t , φi ) dt + p β(Xi t , φi) dWi t , i = 1, ..., M φi ∼ π(φi |η), i = 1, ..., M • observed data y = (Y i 1:ni )M i=1 across M individuals; • latent x = (Xi 1:ni )M i=1 at discrete time-points; We have the joint posterior π(η, ξ, φ, x|y) ∝ π(η)π(ξ)π(φ|η)π(x|φ)π(y|x, ξ), where (from now on assume ni ≡ n for all units). π(φ|η) = M Y i=1 π(φi |η), π(x|φ) = M Y i=1 π(xi 1) n Y j=2 π(xi j|xi j−1, φi ) | {z } Markovianity , π(y|x, ξ) = M Y i=1 n Y j=1 π(yi j|xi j, ξ) | {z } condit. independence . 15
  • 18. π(η, ξ, φ, x|y) ∝ π(η)π(ξ)π(φ|η)π(x|φ)π(y|x, ξ), while several components of the joint π(η, ξ, φ, x|y) may have tractable conditionals, sampling from such a joint posterior can still be an horrendous task.→ slow parameters surface exploration. Reason being that unknown parameters and x are highly correlated. Hence a Gibbs sampler would very badly mix. The best is, in fact, to sample from either of the following marginals π(η, ξ, φ|y) = Z π(η, ξ, φ, x|y)dx or π(η, ξ|y) = Z Z π(η, ξ, φ, x|y)dxdφ 16
  • 19. Marginal posterior over parameters and random effects • By integrating x out, the resulting marginal is π(η, ξ, φ|y) ∝ π(η)π(ξ) M Y i=1 π(φi |η)π(yi |ξ, φi ). • The data likelihood π(yi |ξ, φi ) for the generic i-th unit is π(yi |ξ, φi ) ∝ Z n Y j=1 π(yi j|xi j, ξ) × π(xi 1) n Y j=2 π(xi j|xi j−1, φi )dxi j • Typical problem: transitions density π(xi j|xi j−1, ·) unknown; • integral is generally intractable but we can estimate it via Monte Carlo. • For (very) simple cases, such as linear SDEs, we can apply the Kalman filter and obtain an exact solution. 17
  • 20. We assume a generic nonlinear SDE. • Transition density π(xi j|xi j−1, φi ) is unknown, • luckily we can still approximate the likelihood integral unbiasedly • sequential Monte Carlo (SMC) can be used for the task • when π(xi j|xi j−1, φi ) is unknown, we are still able to run a numerical discretization methods with step-size h 0 and simulate from the approximate πh(xi j|xi j−1, φi ), e.g. xi t+h = xi t + α(xi t, φi )h + q β(xi t, φi) · ui t ui t ∼iid N(0, h) this is the Euler-Maruyama discretization scheme (possible to use more advances schemes). Hence xi t+h|xi t ∼ πh(xi t+h|xi t, φi ) (which is clearly Gaussian). 18
  • 21. We approximate the observed data likelihood as πh(yi |ξ, φi ) ∝ Z n Y j=1 π(yi j|xi j(h), ξ) × π(xi 1) n Y j=2 πh(xi j|xi j−1, φi )dxi j but now for simplicity we stop emphasizing the reference to h. So we have the Monte-Carlo approximation π(yi |ξ, φi ) = E n Y j=1 π(yi j|xi j, ξ) ≈ 1 N N X k=1 n Y j=1 π(yi j|xi j,k, ξ), xi j,k ∼iid πh(xi j|xi j−1, φi ), (k = 1, ..., N) and the last sampling can of course be performed numerically (say by Euler-Maruyama). 19
  • 22. The efficient way to produce Monte Carlo approximations for nonlinear time-series observed with error is Sequential Monte Carlo (aka particle filters). With SMC the N Monte Carlo draws are called “particles”. The secret to SMC is • “propagate particles xi t forward”: xi t → xi t+h, • “weight” the particles proportionally to π(y|x), • “resample particles according to their weight”. The last operation is essential to let particles track the observations. I won’t get into details. But the simplest particle filter is the bootstrap filter (Gordon et al. 1993)2. 2 Useful intro from colleagues in Linköping and Uppsala: Naesseth, Lindsten, Schön. Elements of Sequential Monte Carlo. Foundations and Trends in Machine Learning, 12(3):307–392, 2019 20
  • 23. Estimating the observed data likelihood with the bootstrap filter An unbiased non-negative estimation of the data likelihood can be computed with the bootstrap filter SMC method using N particles: π̂ui (yi |ξ, φi ) = 1 Nn n Y t=1 N X k=1 π(yi t|xi t,k, ξ), i = 1, ..., M Recall, for particle k we have xi t+h,k = xi t,k + α(xi t,k, φi )h + q β(xi t,k, φi) · ui t,k ui t,k ∼iid N(0, h) We will soon see that it is important to keep track of the apparently uninteresting ui t,k variates. 21
  • 24. The “Blocked” Gibbs algorithm Recall: • random effects φi ∼ π(φi|η), i = 1, .., M • population parameters η ∼ π(η) • measurement error ξ ∼ π(ξ) • SMC variates: ui ∼ g(ui), i = 1, .., M. We found very important to “block” the generation of the u variates. 1. π(φi, ui|η, ξ, yi) ∝ π(φi|η)π̂ui (yi|ξ, φi)g(ui), i = 1, . . . , M, 2. π(ξ|η, φ, y, u) = π(ξ|φ, y, u) ∝ π(ξ) QM i=1 π̂ui (yi|ξ, φi), 3. π(η|ξ, φ, y, u) = π(η|φ) ∝ π(η) QM i=1 π(φi|η). Once the u variables are sampled and accepted in step 1, we reuse them when computing (in step 2) π̂ui (yi|ξ, φi). Better performance compared to generating new u in step 2. 22
  • 25. In practice it is a Metropolis-Hastings within Gibbs • Using the approximated likelihood π̂u, we construct a Metropolis-Hastings within Gibbs algorithm: 1. π(φi , ui |η, ξ, yi ) ∝ π(φi |η)π̂ui (yi |ξ, φi )g(ui ), i = 1, . . . , M, 2. π(ξ|η, φ, y, u) = π(ξ|φ, y, u) ∝ π(ξ) QM i=1 π̂ui (yi |ξ, φi ), 3. π(η|ξ, φ, y, u) = π(η|φ) ∝ π(η) QM i=1 π(φi |η). • With this scheme the acceptance probability in the first step is min 1 , π(φi∗|·) π(φi|·) × π̂ui∗ (yi|φi∗, ·) π̂ui (yi|φi, ·) × q(φi|φi∗) q(φi∗|φi) . • Often computer expensive due to the many particles needed to keep the variance of π̂ui∗ (yi|φi∗, ·) small. 23
  • 26. CPMMH: correlating the likelihood approximations Smart idea proposed by Deligiannidis et al. (2018): control instead the variance of the likelihood ratio. • Let us consider the acceptance probability in step 1 min 1 , π(φi∗ |·) π(φi|·) × π̂ui∗ (yi |φi∗ , ·) π̂ui (yi|φi, ·) × q(φi |φi∗ ) q(φi∗|φi) . • The main idea in CPMMH is to induce a positive correlation between π̂ui∗ (yi |φi∗ , ·) and π̂ui (yi |φi , ·). Which reduces the ratio variance while using fewer particles in the particle filter. • Correlation induced via Crank–Nicolson : ui∗ = ρ · ui,(j−1) + p 1 − ρ2 · ω, ω ∼ N(0, Id) ρ ∈ (0.9, 0.999) 24
  • 27. CPMMH: selecting number of particles • For PMMH (no correlated particles) we selected the number of particles N such that the variance of the log-likelihood σ2 N is σ2 N ≈ 2 at some fixed parameter value.3 • For CPMMH N is selected such that σ2 N ≈ 2.162/(1 − ρ2 l ) where ρl is the estimated correlation between π̂ui (yi|ξ, φi) and π̂ui∗ (yi|ξ, φi).4 • A drawback with the CPMMH algorithm is that we have to store the random numbers u = (u1, . . . , uM )T in memory, which can be problematic if we have a very large number of particles N, many subjects, or long time-series. 3 Sherlock, Thiery, Roberts, Rosenthal (2015). AoS. 4 Choppala, Gunawan, Chen, M.-N Tran, Kohn, 2016. 25
  • 28. We want to show how to improve scalability for increasing M. But first, some illustrative applications. 26
  • 30. Ornstein-Uhlenbeck SDEMEM: model structure Let us consider the following Ornstein-Uhlenbeck SDEMEM ( Y i t = Xi t + i t, i t indep ∼ N(0, σ2 ), i = 1, ..., 40 dXi t = θi 1(θi 2 − Xi t)dt + θi 3dWi t . • The random effects φi = (log θi 1, log θi 2, log θi 3) follow φi j|η indep ∼ N(µj, τ−1 j ), j = 1, . . . , 3, where η = (µ1, µ2, µ3, τ1, τ2, τ3) • This induces a semi-conjugate prior on η. Thus, we have a tractable Gibbs step when updating η. 27
  • 31. Ornstein-Uhlenbeck SDEMEM: simulated data We have M = 40 individuals. 0 2 4 6 8 10 Time 0 5 10 15 20 25 30 Figure 3: Simulated data from the OU-SDEMEM model. 28
  • 32. Ornstein-Uhlenbeck SDEMEM: different inference meth- ods We compare the following MCMC methods: we always use the outlined Metropolis-within-Gibbs sampler, with likelihood computed with several flavours: • “Kalman”: Computing the data likelihood exactly with the Kalman filter. • “PMMH”: Estimating the data likelihood with the bootstrap filter and no correlated likelihoods. • “CPMMH-099”: Estimating the data likelihood with the bootstrap filter with correlated likelihoods, with correlation ρ = 0.99. • “CPMMH-0999”: same as above, with correlation ρ = 0.999. 29
  • 33. Ornstein-Uhlenbeck SDEMEM: inference results for η −1.4 −1.2 −1.0 −0.8 −0.6 −0.4 −0.2 µ1 0 1 2 3 4 Density 1.8 2.0 2.2 2.4 2.6 2.8 µ2 0 1 2 3 4 5 6 Density −1.4 −1.2 −1.0 −0.8 −0.6 −0.4 −0.2 µ3 0 1 2 3 4 Density 2 4 6 8 10 12 τ1 0.0 0.1 0.2 0.3 0.4 Density 2 4 6 8 10 12 14 τ2 0.00 0.05 0.10 0.15 0.20 0.25 0.30 Density 2 4 6 8 10 τ3 0.0 0.1 0.2 0.3 0.4 0.5 Density Figure 4: OU SDEMEM: marginal posterior distributions for η = (µ1, µ2, µ3, τ1, τ2, τ3). Almost overlapping lines are: Kalman, PMMH, CPMMH-099, vertical lines are ground truth. 30
  • 34. Ornstein-Uhlenbeck SDEMEM: comparing efficiency The ones below are all MH-within-Gibbs algorithms: Algorithm ρ N CPU (min) mESS mESS/min Rel. Kalman - - 1.23 488.51 396.37 5684.46 PMMH 0 3000 4076.94 450.13 0.11 1 CPMMH-099 0.99 100 200.92 418.22 2.09 19 CPMMH-0999 0.999 50 110.66 323.77 2.93 26.6 Figure 5: OU SDEMEM. Correlation ρ, number of particles N, CPU time (minutes), minimum ESS (mESS), minimum ESS per minute (mESS/min) and relative minimum ESS per minute (Rel.) as compared to PMMH-naive. All results are based on 50k iterations of each scheme, and are medians over 5 independent runs of each algorithm on different data sets. We could only produce 5 runs due to the very high computational cost of PMMH. 31
  • 35. Tumor growth simulation study This example is inspired by another publication: there, P. and Forman analyzed real experimental data of tumor growth on mice, using SDEMEMs. However, here to illustrate the use of our inference method we use a slightly simpler model. 32
  • 36. Figure 6: Source http://www.nature.com/articles/srep04384 33
  • 37. 34
  • 38. Tumor growth SDEMEM: model structure Let us now consider the following SDEMEM 5      Y i t = log V i t + i t, i t indep ∼ N(0, σ2 e). dXi 1,t = βi + (γi)2/2 Xi 1,tdt + γiXi 1,tdWi 1,t, dXi 2,t = −δi + (ψi)2/2 Xi 2,tdt + ψiXi 2,tdWi 2,t. • Xi 1,t the volume of surviving tumor cells. • Xi 2,t the volume of cells “killed by a treatment”. • V i t = Xi 1,t + Xi 2,t the total tumor volume. 5 P Forman. (2019). Bayesian inference for stochastic differential equation mixed effects models of a tumor xenography study. JRSS-C. 35
  • 39. Tumor growth SDEMEM: random effects model • The random effects φi = (log βi, log γi, log δi, log ψi) follow φi j|η indep ∼ N(µj, τ−1 j ), j = 1, . . . , 4, where η = (µ1, . . . , µ4, τ1, . . . , τ4). 36
  • 40. Tumor growth SDEMEM: simulated data We assume M = 10 subjects with n = 20 datapoints each. 5 10 15 20 6 8 10 12 14 Time Figure 7: Simulated data from the tumour growth model. 37
  • 41. Tumor growth SDEMEM: different inference methods We use the following inference methods: • “PMMH”: Estimating the data likelihood with the bootstrap filter and no correlation in the likelihoods. • “CPMMH”: Estimating the data likelihood with the bootstrap filter and inducing correlation in the likelihoods, with ρ = 0.999. Kalman here cannot be used due to nonlinear Yt = log(X1,t + X2,t) + t 38
  • 42. Tumor growth SDEMEM: inference results for η -2.0 -1.5 -1.0 0 1 2 3 µ1 Density 0 20 40 60 0.00 0.02 0.04 0.06 τ1 Density -2.5 -2.0 -1.5 -1.0 0.0 0.5 1.0 1.5 2.0 µ2 Density 0 20 40 60 0.00 0.02 0.04 0.06 0.08 τ2 Density -3.5 -3.0 -2.5 -2.0 -1.5 -1.0 0.0 0.5 1.0 1.5 µ3 Density 0 20 40 60 80 0.00 0.02 0.04 0.06 τ3 Density -3.0 -2.5 -2.0 -1.5 -1.0 -0.5 0.0 0.0 0.2 0.4 0.6 0.8 1.0 µ4 Density 0 20 40 60 0.00 0.05 0.10 0.15 0.20 0.25 τ4 Density Figure 8: Marginal posterior distributions for µi and τi, i = 1, . . . , 4. Dotted line shows results from LNA scheme, solid line is from the CPMMH scheme and dashed line is the PMMH Scheme. 39
  • 43. Tumor growth SDEMEM: comparing efficiency Algorithm ρ N CPU (m) mESS mESS/m Rel. PMMH 0 30 2963 2559 0.864 1 CPMMH 0.999 10 957 2311 2.415 3 Figure 9: Tumour model. Correlation ρ, number of particles N, CPU time (in minutes m), minimum ESS (mESS), minimum ESS per minute (mESS/m) and relative minimum ESS per minute (Rel.) as compared to PMMH. All results are based on 500k iterations of each scheme. 40
  • 45. Obviously, when the number of individuals M increases, problems emerge... Recall the blocked Gibbs steps: 1. π(φi, ui|η, ξ, yi) ∝ π(φi|η)π̂ui (yi|ξ, φi)g(ui), i = 1, . . . , M, 2. π(ξ|η, φ, y, u) = π(ξ|φ, y, u) ∝ π(ξ) QM i=1 π̂ui (yi|ξ, φi), 3. π(η|ξ, φ, y, u) = π(η|φ) ∝ π(η) QM i=1 π(φi|η). Steps 1 and 2 are the hard ones because each require M runs of a particle filter, one for each likelihood π̂ui (yi|ξ, φi)g(ui). Moreover, steps 2 involves the product of individual loglikelihoods. Too keep the variance of the product low, many particles may be needed for the individual terms. 41
  • 46. The “trick” However co-author Sebastian Persson had the intuition to borrow a trick from the Monolix6 software, which is specialised in inference for mixed-effects models. Quite simply, consider a “perturbation” of the original SDEMEM, where we allow the constant parameter ξ (and possibly other fixed parameters) to be slightly varying between subjects as ξi ∼ N(ξpop, δ), i = 1, ..., M where ξpop is the original ξ to be inferred. 6 https://lixoft.com/products/monolix/ 42
  • 47. Gibbs for unperturbed model: 1. π(φi, ui|η, ξ, yi) ∝ π(φi|η)π̂ui (yi|ξ, φi)g(ui), i = 1, . . . , M, 2. π(ξ|η, φ, y, u) = π(ξ|φ, y, u) ∝ π(ξ) QM i=1 π̂ui (yi|ξ, φi), 3. π(η|ξ, φ, y, u) = π(η|φ) ∝ π(η) QM i=1 π(φi|η). Now introduce ξi ∼ N(ξpop, δ), i = 1, ..., M Gibbs for perturbed model: 1. π(φi, ξi, ui|η, ξpop, yi) ∝ π(φi|η)π(ξi|ξpop)π̂ui (yi|ξi, φi)g(ui), i = 1, . . . , M, 2. π(ξ|η, φ, y, u) = π(ξ|φ, y, u) ∝ π(ξ) QM i=1 π̂ui (yi|ξ, φi), 3. π(η|ξ, φ, y, u) = π(η|φ) ∝ π(η) QM i=1 π(φi|η). The expensive Step 2 has disappeared! 43
  • 48. Notice step 1 allows to target each random effect separately from the others. So controlling the variance of each individual likelihood π̂ui (yi|ξi, φi) is much easier than controlling the variance of the “joint likelihood” QM i=1 π̂ui (yi|ξ, φi). 44
  • 49. Perturbed and non-perturbed Ornstein-Uhlenbeck Gibbs on perturbed vs Gibbs on non-perturbed model (OU model). 45
  • 50. Another model (forgot which one) Gibbs on perturbed vs Gibbs on non-perturbed model . 46
  • 51. The perturbation variance ξi ∼ N(ξpop, δ), i = 1, ..., M δ 0 is a small tuning parameter specified by the user. We set δ somewhat arbitrarily, but we found that for parameters having magnitude between 1–10 a value of δ = 0.01 worked well. 47
  • 52. Try our PEPSDI package Everything is coded in Julia for efficient inference at https://github.com/cvijoviclab/PEPSDI Includes: • tutorials and notebooks on how to run the package; • several adaptive MCMC samplers benchmarked (tl;dr best one is Matti Vihola’s RAM sampler). • “guided” particle filters better suited for informative observations (low measurement error). • nontrivial case studies are in the paper. • it’s not just SDEs with mixed effects! Mixed-effects stochastic kinetic models are implemented, and several numerical integrators typical in systems biology are supported (tau leaping, Gillespie). 48
  • 53. Thanks to great co-authors! (a) Marija Cvijovic (b) Samuel Wiqvist (c) Sebastian Persson (d) Andrew Golightly (e) Ash- leigh McLean (f) Niek Welkenhuy- sen (g) Sviatlana Shashkova (h) Patrick Reith (i) Gregor Schmidt 49
  • 56. CPMMH: updating step • When correlating the particles, i.e. using CPMMH, step 1 in the Metroplis-Hastings within Gibbs scheme becomes: 1: For i = 1, . . . , M: (1a) Propose φi∗ ∼ q(·|φi,(j−1) ). Draw ω ∼ N(0, Id) and put ui∗ = ρui,(j−1) + p 1 − ρ2ω. (1b) Compute π̂ui∗ (yi |ξ(j−1) , φi∗ ) by running the particle filter with ui∗ , φi∗ , ξ(j−1) and yi . (1c) With probability put φi,(j) = φi∗ and ui,(j) = ui∗ . Otherwise, store the current values φi,(j) = φi,(j−1) and ui,(j) = ui,(j−1) .
  • 57. CPMMH: selecting number of particles • For PMMH (no correlated particles) we selected the number of particles N such that the variance of the log-likelihood σ2 N is σ2 N ≈ 2 at some fixed parameter value.7 • For CPMMH N is selected such that σ2 N = 2.162/(1 − ρ2 l ) where ρl is the estimated correlation between π̂ui (yi|ξ, φi) and π̂ui∗ (yi|ξ, φi).8 • A drawback with the CPMMH algorithm is that we have to store the random numbers u = (u1, . . . , uM )T in memory, which can be problematic if we have a very large number of particles N, many subjects, or long time-series. 7 doucet15, sherlock2015. 8 tran2016block.
  • 58. CPMMH: updating step • When correlating the particles, step 1 in the MH-within-Gibbs scheme becomes: 1: For i = 1, . . . , M: (1a) Propose φi∗ ∼ q(·|φi,(j−1) ). Draw ω ∼ N(0, Id) and put ui∗ = ρui,(j−1) + p 1 − ρ2ω. (1b) Compute π̂ui∗ (yi |ξ(j−1) , φi∗ ) by running the particle filter with ui∗ , φi∗ , ξ(j−1) and yi . (1c) With probability min 1 , π(φi∗ |·) π(φi|·) × π̂ui∗ (yi |φi∗ , ·) π̂ui (yi|φi, ·) × q(φi |φi∗ ) q(φi∗|φi) . put φi,(j) = φi∗ and ui,(j) = ui∗ . Otherwise, store the current values φi,(j) = φi,(j−1) and ui,(j) = ui,(j−1) .
  • 59. Using stochastic modelling is important! [This slide refers to the tumor-growth data] And what if we produced inference using a deterministic model (ODEMEM) while observations come from a stochastic model? Here follows the etimation of the measurement error variance σ2 e (truth is log σ2 e = −1.6). True value is massively overestimated by the ODE-based approach
  • 60. Application: neuronal data with informative observations 215 ea, it was trode de- lification, stimulus mmercial DC) level d and set fter each ne poten- membrane spikes (if e present s article. mV) with d 0−501 Figure 11: Depolarization [mV] vs time [sec]. We may focus on what happens between spikes. So called inter-spikes-intervals data (ISIs).
  • 61. Inter-spikes-intervals data (ISIs): 0 50 100 150 200 250 300 350 Time (msec) 0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 depolarization mV Figure 12: Observations from M = 100 ISIs. P., Ditlevsen, De Gaetano and Lansky (2008). Parameters of the diffusion leaky integrate-and-fire neuronal model for a slowly fluctuating signal. Neural Computation, 20(11), 2696-2714.
  • 62. So we have about 1.6 × 105 measurements of membrane potential, across M = 100 units. Membrane potential dynamics are assumed governed by an Ornstein-Uhleneck process, observed with error: ( Y i t = Xi t + i t, i t indep ∼ N(0, σ2 ), i = 1, ..., M, dXi t = (−λiXi t + νi)dt + σidWi t . • µi [mV/msec] is the electrical input into the neuron; • 1/λi [msec] is the spontaneous voltage decay (in the absence of input)
  • 63. In this example data are informative: This means that the measurement error term is negligible. Contrary to intuition, having informative observations complicates things from the computational side. Shortly: the “particles” propagated forward via the bootstrap filter will have hard time, since π(yt|xt, ·) now has a very narrow support. Hence many particles will receive a tiny weight (→ poorly approximated likelihood). Solution: at time t, let the particles be “guided forward” to get close to the next datapoint yt+1. We used the guided scheme in Golightly, A., Wilkinson, D. J. (2011). Interface focus, 1(6), 807-820.
  • 64. With the “guided” particles having N = 1 is sufficient to get good inference (not reported here). Algorithm ρ N CPU (m) mESS mESS/m Rel. Kalman - - 56 666 12.0 20.0 PMMH - 1 481 287 0.6 1.0 CPMMH-09 0.9 1 653 381 0.58 1.0 CPMMH-0999 0.999 1 655 326 0.50 0.8 Figure 13: Neuronal model. Correlation ρ, number of particles N, CPU time (in minutes m), minimum ESS (mESS), minimum ESS per minute (mESS/m), and relative minimum ESS per minute (Rel.) as compared to PMMH. All results are based on 100k iterations of each scheme.
  • 65. Several adaptive MCMC samplers We compare ESS and Wasserstein distance (wrt to true posterior when available) across several MCMC samplers.