Causal Inference Opening Workshop - Targeted Learning for Causal Inference Based on Real World Data - Mark van der Laan, December 9, 2019

Targeted Learning of Causal Impacts Based on Real
World Data
Mark van der Laan
Division of Biostatistics, UC Berkeley
December 9, 2019
SAMSI Workshop on Causal Inference, Durham
Joint work with Wilson Cai, David Benkeser, Maya Petersen
This research was supported by NIH grant R01 AI074345-09.

Outline
1 Ingredients of Targeted Learning: Causal Framework, Super-learning,
Targeting
2 Highly Adaptive Lasso (HAL) estimator: candidate for Super-Learner
library
3 Targeted Minimum Loss Based Estimation (TMLE)
4 Universal Least favorable submodels for one-step TMLE
5 Example: One-step TMLE of causal impact of point intervention on
survival
6 Group sequential adaptive RCT to learn optimal rule
7 TMLE of Causal Eﬀects of Multiple Time Point Interventions on
Survival
8 Sequential Regression Representation of Treatment Speciﬁc Mean
9 TMLE in complex observational study of diabetes (Neugebauer et al.)
10 Software For Targeted Learning

Highly Adaptive Lasso MLE (HAL-MLE)
• This is a maximum likelihood/minimum loss estimator that estimates
functionals (e.g outcome regression and propensity score) by
approximating them with linear model in many (≤ n2d ) tensor
product indicator basis functions, constraining the L1-norm of the
coefficient vector, and choosing it with cross-validation (vdL, 2015,
Benkeser, vdL, 2017)
• Guaranteed to converge to truth at rate n−1/3(log n)d in sample size
n (Bibaut, vdL, 2019): only assumption that true function is
right-continuous, left-hand limits, and has finite sectional variation
norm.
• When used in super-learner library (or by itself), TMLE (targeted
learning) is guaranteed consistent, (double robust) asymptotically
normal and efficient: one only needs to assume strong positivity
assumption.
• Undersmoothed HAL-MLE is efficient for smooth functionals.

Example: HAL-MLE of conditional hazard
• Suppose that O = (W , A, ˜T = min(T, C), ∆ = I(T ≤ C)), and that
we are interested in estimating the conditional hazard λ(t | A, W ).
• Let L(λ) be the log-likelihood loss.
• If T is continuous, we could parametrize
λ(t | A, W ) = exp(ψ(t, A, W )), or, if T is discrete,
Logitλ(t | A, W ) = ψ(t, A, W ).
• We can represent ψ = s⊂{1,...,d} βs,jφus,j as linear combination of
indicator basis functions, where L1-norm of β represents the sectional
variation norm of ψ.
• Therefore, we can compute the HAL-MLE of λ with either Cox-Lasso
or logistic Lasso regression (glmnet()).

Targeted minimum loss based estimation (TMLE)

TMLE
Let D∗(P) be canonical gradient/eﬃcient inﬂuence curve of target
parameter Ψ : M → IR at P ∈ M.
Initial Estimator: P0
n initial estimator of P0. We recommend
super-learning.
Targeting of initial estimator: Construct so called least favorable
parametric submodel {P0
n, : } ⊂ M through P0
n so that
d
d L(P0
n, )
=0
spans the canonical gradient D∗(P0
n ) at P0
n ,
where (e.g.) L(P)(O) = − log p(O) is log-likelihood loss. Let
n = arg min
i
L(P0
n, )(Oi )
be the MLE, and P∗
n = P0
n, n
.
TMLE of ψ0: The TMLE of ψ0 is plug-in estimator Ψ(P∗
n ).
Solves optimal estimating equation: PnD∗(P∗
n ) ≡ 1
n i D∗(P∗
n )(Oi ) ≈ 0.

Local least favorable submodel
Let O ∼ P0 ∈ M. Let Ψ : M → IR be a one-dimensional target
parameter, and let D∗(P) be its canonical gradient at P. A 1-d local least
favorable submodel {plfm : } satisﬁes
d
d
log plfm
)
=0
= D∗
(P).
Equivalently, the score of an LFM maximizes the Cramer-Rao lower bound
over all 1-d parametric submodels {P : } through P:
CR(h | P) = lim
→0
(Ψ(P ,h) − Ψ(P))2
−2P log dP ,h/dP
.
That is, an LFM has a local behavior that maximizes the square change in
target parameter per unit increase in information/likelihood.

Universal least favorable submodel
We deﬁne a 1-d universal least favorable submodel at P as a submodel
{P : } so that for all
d
d
log
dP
dP
= D∗
(P ). (1)
This acts as a local least favorable submodel at any point on its path.

TMLE based on ULFM is a one-step TMLE
Let P0
n be an initial estimator of P0. Suppose that, given a P ∈ M, we
can construct a universal least favorable parametric model
{Pulfm : ∈ (−a, a)} ⊂ M. Let
0
n = arg max Pn log
dP0
n,
dP0
n
.
Let P1
n = P0
n, 0
n
. Since 0
n is a local maximum, P1
n solves its score equation,
given by PnD∗(P1
n ) = 0. That is, the TMLE is given by Ψ(P1
n ).

Universal least favorable submodel for 1-d target parameter
For ≥ 0, we recursively deﬁne
p = p exp
0
D∗
(Px )dx , (2)
and, for < 0, we recursively deﬁne
p = p exp −
0
D∗
(Px )dx .

Universal LFM in terms of local LFM
One can also deﬁne it in terms of a given local LFM plfm: for > 0 and
d > 0, we have
p +d = plfm
,d .
That is, p +d equals the local LFM {plfm
δ : δ} through p = p at local
value δ = d . Similarly, we deﬁne it for < 0.

A universal canonical submodel that targets a
multidimensional target parameter
Let Ψ(P) = (Ψ(P)(t) : t) be multidimensional (e.g., infinite dimensional).
Let D∗(P) = (D∗
t (P) : t) be the vector-valued efficient influence curve.
Consider the following recursively defined submodel: for ≥ 0, we define
p = pΠ[0, ] 1 +
{PnD∗(Px )} D∗(Px )
D∗(Px )
dx
= p exp
0
{PnD∗(Px )} D∗(Px )
D∗(Px )
dx . (3)

Score is Euclidean norm of empirical mean of vector
eﬃcient inﬂuence curve
Theorem:
We have {p : ≥ 0} is a family of probability densities, its score at is a
linear combination of D∗
t (P ) for t ∈ τ, and is thus in the tangent space
T(P ), and we have
d
d
Pn log(p ) = PnD∗
(P ) .
As a consequence, we have d
d PnL(P ) = 0 implies PnD∗(P ) = 0.
Under regularity conditions, we also have {p : } ⊂ M.

One-step TMLE of multi-dimensional target parameter
Let p0
n ∈ M be an initial estimator of p0. Let n = arg max Pn log p . Let
p∗
n = p0
n, n
and ψ∗
n = Ψ(P∗
n ). We have
PnD∗
(P∗
n ) = 0.

One-step TMLE of treatment specific survival curve
We investigated the performance of one-step TMLE for treatment specific
survival curve based on O = (W , A, ˜T = min(T, C), ∆ = I(T ≤ C)).
Data structure
• dynamic treatment intervention: W → d(W ).
• Sd (t) is defined by
Ψ(P)(t) = EP [P (T > t|A = d(W ), W )]
• Focus on d(W ) = 1.

Efficient influence curve
The efficient influence curve for Ψ(P)(t) is (Hubbard et al., 2000)
D∗
t (P) =
k t
ht(gA, SAc , S)(k, A, W ) I(T = k, ∆ = 1)−
I(T k)λ(k|A = 1, W ) + S(t|A = 1, W ) − Ψ(P)(t)
≡ D∗
1,t(gA, SAc , S) + D∗
2,t(P),
(4)
where
ht(gA, SAc , S)(k, A, W ) =
−
I(A = 1)I(k t)
gA(A = 1|W )SAc (k_|A, W )
S(t|A, W )
S(k|A, W )
.

From local least favorable submodel to universal least
favorable submodel
• A local least favorable submodel (LLFM) for Sd (t) around initial
estimator of conditional hazard:
logit(λn,ε(·|A = 1, W )) = logit(λn(·|A = 1, W )) + εht. (5)
• Similarly, we have this local least favorable submodel for a vector
(Sd (t) : t) by adding vector (ht : t) extension.
• These imply, as above, universal least favorable submodels for single
and multidimensional survival function.

Simulations for one-step TMLE of survival curve
We investigated the performance of one-step TMLE for treatment speciﬁc
survival curve in two simulation settings.
Data structure
• O = (W , A, T) ∼ P0
• A ∈ {0, 1}
• treatment intervention: W → d(W ) = 1
• Sd (t) is deﬁned by
Ψ(P)(t) = EP [P (T > t|A = d(W ), W )]

Candidate estimators
1 Kaplan Meier)
2 Iterative TMLE for each single t separately
3 One-step TMLE targeting the whole survival curve Sd

Results
0 100 200 300 400
0.00.20.40.60.81.0
Setting I
Time
KM(A=0)
KM(A=1)
truth
iter_TMLE
onestep_curve
initial fit
0.00.20.40.60.81.0
Setting II
KM(A=0)
KM(A=1)
truth
iter_TMLE
onestep_curve
initial fit

Monte-carlo results (n = 100)
0 100 200 300 400
0.00.51.01.52.0
Setting I
t
RelativeEfficiency
KM
initial fit
one−step survival curve
iterative TMLE
Figure: Relative eﬃciency against iterative TMLE, as a function of t

Optimal intervention allocation: “Learn as you go”
Classic Randomized Trial:
Longer implementation, higher cost
Targeted Learning for
Adaptive Trial Designs
ü Is the intervention
effective?
ü For whom?
ü How much will they
benefit?
Analysis
Results
Learn faster,
with fewer
patients

Contextual multiple-bandit problem in computer science
Consider a sequence (Wn, Yn(0), Yn(1))n≥1 of i.i.d. random variables with
common probability distribution PF
0 :
• Wn, nth context (possibly high-dimensional)
• Yn(0), nth reward under action a = 0 (in ]0, 1[)
• Yn(1), nth reward under action a = 1 (in ]0, 1[)
We consider a design in which one sequentially,
• observe context Wn
• carry out randomized action An ∈ {0, 1} based on past observations
and Wn
• get the corresponding reward Yn = Yn(An) (other one not revealed),
resulting in an ordered sequence of dependent observations
On = (Wn, An, Yn).

Goal of experiment
We want to estimate
• the optimal treatment allocation/action rule d0:
d0(W ) = arg maxa=0,1 E0{Y (a)|W }, which optimizes EYd over all
possible rules d.
• the mean reward under this optimal rule d0:
Ψ(PF
0 ) = E0{Y (d0(W ))},
and we want
• maximally narrow valid conﬁdence intervals (primary) “Statistical. . .
• minimize regret (secondary) 1
n
n
i=1(Yi − Yi (dn)) . . . bandits”
This general contextual multiple bandit problem has enormous range of
applications: e.g., on-line marketing, recommender systems, randomized
clinical trials.

Bibliography (non exhaustive!)
• Sequential designs
• Thompson (1933), Robbins (1952)
• specifically in the context of medical trials
- Anscombe (1963), Colton (1963)
- response-adaptive designs: Cornfield et al. (1969), Zelen (1969),
many more since then
• Covariate-adjusted Response-Adaptive (CARA) designs
• Rosenberger et al. (2001), Bandyopadhyay and Biswas (2001), Zhang
et al. (2007), Zhang and Hu (2009), Shao et al (2010). . . typically
study
- convergence of design . . . in correctly specified parametric model
• van der Laan (2008), Chambaz and van der Laan (2013), Zheng,
Chambaz and van der Laan (2015), Bibaut et al (2019) concern
- convergence of design to optimal rule (!), super-learning and HAL-MLE
of optimal rule, and TMLE of optimal reward, with inference, without
(e.g., parametric) assumptions.

General Longitudinal Data Structure
We observe n i.i.d. copies of a longitudinal data structure
O = (L(0), A(0), . . . , L(K), A(K), Y = L(K + 1)),
where A(t) denotes a discrete valued intervention node, L(t) is an
intermediate covariate realized after A(t − 1) and before A(t),
t = 0, . . . , K, and Y is a ﬁnal outcome of interest.
Survival example: For example,
A(t) = (A1(t), A2(t))
A1(t) = I(Treated at time t)
A2(t) = I(min(T, C) ≤ t, ∆ = 0) right-censoring indicator process
∆ = I(T ≤ C) failure indicator
Y (t) = I(min(T, C) ≤ t, ∆ = 1) survival indicator process
Y (t) ⊂ L(t) Y = Y (K + 1).

Likelihood and Statistical Model
The probability distribution P0 of O can be factorized according to the
time-ordering as
p0(O) =
K+1
t=0
p0(L(t) | Pa(L(t)))
K
t=0
p0(A(t) | Pa(A(t)))
≡
K+1
t=0
q0,L(t)(O)
K
t=0
g0,A(t)(O)
≡ q0g0,
where Pa(L(t)) ≡ (¯L(t − 1), ¯A(t − 1)) and Pa(A(t)) ≡ (¯L(t), ¯A(t − 1))
denote the parents of L(t) and A(t) in the time-ordered sequence,
respectively. The g0-factor represents the intervention mechanism.
Statistical Model: We make no assumptions on q0, but could make
assumptions on g0.

Statistical Target Parameter: G-computation Formula for
Post-dynamic-Intervention Distribution
• pg∗
0 = q0(o)g∗(o) is the G-computation formula for the
post-intervention distribution of O under the stochastic intervention
g∗ = K
t=0 g∗
A(t)(O).
• In particular, for a dynamic intervention d = (dt : t = 0, . . . , K) with
dt(¯L(t), ¯A(t − 1)) being the treatment at time t, the G-computation
formula is given by
pd
0 (l) =
K+1
t=0
qd
0,L(t)(¯l(t)), (6)
where qd
L(t)(¯l(t)) = qL(t)(l(t) | ¯l(t − 1), ¯A(t − 1) = ¯dt−1(¯l(t − 1))).
• Let Ld = (L(0), Ld (1), . . . , Y d = Ld (K + 1)) denote the random
variable with probability distribution Pd .

A Sequential Regression G-computation Formula (Bang,
Robins, 2005)
• By the iterative conditional expectation rule (tower rule), we have
EPd Y d
= E . . . E(E(Y d
| ¯Ld
(K)) | Ld
(K − 1)) . . . | L(0)).
• In addition, the conditional expectation, given ¯Ld (K) is equivalent
with conditioning on ¯L(K), ¯A(K − 1) = ¯dK−1(¯L(K − 1)).
In this manner, one can represent EPd Y d as an iterative conditional
expectation, ﬁrst take conditional expectation, given ¯Ld (K) (equivalent
with ¯L(K), ¯A(K − 1)), then take the conditional expectation, given
¯Ld (K − 1) (equivalent with ¯L(K − 1), ¯A(K − 2)), and so on, until the
conditional expectation given L(0), and ﬁnally take the mean over L(0).

TMLE
• A likelihood based TMLE was developed (van der Laan, Stitelman,
2010).
• A sequential regression TMLE Ψ(Q∗
n) was developed for EYd in van
der Laan, Gruber (2012).
• The latter builds on Bang and Robins (2005) by putting their
innovative double robust eﬃcient estimating equation method, which
uses sequential clever covariate regressions to estimate the nuisance
parameters of estimating equation, into aTMLE framework.
• A TMLE for Euclidean summary measures of (EYd : d ∈ D) deﬁned
by marginal structural working models is developed in Petersen et al.
(2013);
• A new (analogue to sequential regression) TMLE allowing for
continuous valued monitoring and time till event is coming (Rijtgaard,
van der Laan, 2019).

A real-world CER study comparing diﬀerent rules for
treatment intensiﬁcation for diabetes
• Data extracted from diabetes registries of 7 HMO research network
sites:
• Kaiser Permanente
• Group Health Cooperative
• HealthPartners
• Enrollment period: Jan 1st 2001 to Jun 30th 2009
Enrollment criteria:
• past A1c< 7% (glucose level) while on 2+ oral agents or basal insulin
• 7% ≤ latest A1c ≤ 8.5% (study entry when glycemia was no longer
reined in)

Longitudinal data
• Follow-up til the earliest of Jun 30th 2010, death, health plan
disenrollment, or the failure date
• Failure deﬁned as onset/progression of albuminuria (a microvascular
complication)
• Treatment is the indicator being on ”treatment intensiﬁcation” (TI)
• n ≈ 51, 000 with a median follow-up of 2.5 years

Impact of SL on IPTW
Back to the TI study...
Impact of machine learning on inference with IPW 1:
5 10 15
0.700.750.800.850.900.951.00
Parametric model
Number ’t’ of 90−day intervals since study entry
Survival−P(T>t)(noweighttruncation)
5 10 15
0.700.750.800.850.900.951.00
5 10 15
0.700.750.800.850.900.951.00
5 10 15
0.700.750.800.850.900.951.00
5 10 15
0.700.750.800.850.900.951.00
d7
d7.5
d8
d8.5
5 10 15
0.700.750.800.850.900.951.00
Super Learning
5 10 15
0.700.750.800.850.900.951.00
5 10 15
0.700.750.800.850.900.951.00
5 10 15
0.700.750.800.850.900.951.00
5 10 15
0.700.750.800.850.900.951.00
No/weak evidence of
protective effect
Strong signiﬁcant
evidence

SL-IPTW/SL-TMLE
Practical performance
0 5 10 15
0.700.750.800.850.900.951.00
quarter ’t’
P(Td>t)
0 5 10 15
0.700.750.800.850.900.951.00
quarter ’t’
P(Td>t)
0 5 10 15
0.700.750.800.850.900.951.00
quarter ’t’
P(Td>t)
0 5 10 15
0.700.750.800.850.900.951.00
quarter ’t’
P(Td>t)
d7
d7.5
d8
d8.5
0 5 10 15
0.700.750.800.850.900.951.00
quarter ’t’
P(Td>t)
0 5 10 15
0.700.750.800.850.900.951.00
quarter ’t’
P(Td>t)
0 5 10 15
0.700.750.800.850.900.951.00
quarter ’t’
P(Td>t)
0 5 10 15
0.700.750.800.850.900.951.00
quarter ’t’
P(Td>t)
IPW estimator 3 + SL
(hazard-based)
TMLE + SL
1.07 ≤ σIP W 3
σT MLE
≤ 1.11

tlverse - Targeted Learning software ecosystem in R
• A curated collection of R packages for Targeted Learning
• Shares a consistent underlying philosophy, grammar, and set of data
structures
• Open source
• Designed for generality, usability, and extensibility
• Microwave dinners for machine learning

Targeted Learning
van der Laan & Rose, Targeted Learning: Causal Inference for
Observational and Experimental Data. New York: Springer, 2011.

Causal Inference Opening Workshop - Targeted Learning for Causal Inference Based on Real World Data - Mark van der Laan, December 9, 2019

More Related Content

What's hot (20)

Similar to Causal Inference Opening Workshop - Targeted Learning for Causal Inference Based on Real World Data - Mark van der Laan, December 9, 2019 (20)

More from The Statistical and Applied Mathematical Sciences Institute (20)

Recently uploaded (20)

Causal Inference Opening Workshop - Targeted Learning for Causal Inference Based on Real World Data - Mark van der Laan, December 9, 2019