Causal inference for complex exposures: asking questions that matter, getting answers that help.

@EpiEllie
Causal inference for complex
data: Asking questions that
matter, getting answers that
help
Eleanor Murray, ScD, MPH
Department of
Epidemiology
University of Minnesota
December 3, 2021

Epidemiology is:
Learning who has which health
problems where, and
Figuring out what to do to
change that
We can’t achieve these goals
unless we ASK the right
questions, and ESTIMATE
useful answers
Changing the public’s health requires
understanding and estimating causal
effects!

So how do we estimate causal
effects?
Miguel Hernàn’s two-step causal algorithm:
1. Ask good questions
2. Answer them with appropriate methods
? !

How do we ask good questions?
Start with a clear causal question:
What is the exact exposure(s) of interest?
What is the exact comparison group(s) of
interest?
What is the exact outcome?
Who do we want to learn about?

Relationships are complicated—
especially in epidemiology

Defining exposure is often hard…
Jiang et al 2020

…and measuring exposure can be even
harder
Measurement error for depression
Measurement error for PTSD
Measurement error for suicid
Jiang et al 2020

Complex exposures give us complicated
answers, even if we can intervene…
 Do we always need to give all parts of the intervention?
 Does the timing of the intervention or pieces matter?
 How would the intervention work in groups with other
types of usual care/ comparators?

What makes an exposure complex?
Multiple components that make it hard to define or
measure
 e.g. race, socio-economic position, cognitive
behavioral therapy
Interference between individuals
 e.g. infections, behaviors & habits, education
Exposures that vary over space
 e.g. air pollution, access to goods & services
Exposures that vary over time
 e.g. medication usage, unhealthy habits
Simple exposures that could occur at any time

Feedback loops exist, but they aren’t
really loops

Instead, they represent a recurring
sequence over time

Often, we have to decide where in the
loop to start

… so what about when we can’t*
intervene?
We need to be even more careful about the
questions we ask!
* whether because of ethical, logistical, financial, or even time
constraints
 We need well-defined causal questions!

Why are well-defined causal questions
important for complex exposures?
When there are multiple possible ‘interventions’
and we don’t specify one, our answer is a
weighted average of all ‘interventions’ but we
don’t know the weights
Murray, 2016. Agent-based models for causal inference. Harvard
University.
Murray, et al. 2019. Medical Decision Making
We call this the “WACE”:
weighted average causal effect
Useful for estimating effects of
biomarkers in a defined
population

Why are well-defined causal questions
important for complex exposures?
Worse, if the ‘intervention’ is ill-defined, the
confounding is probably also ill-defined!
University.
Murray, et al. 2019. Medical Decision Making [in press]

Asking questions that matter for
complex exposures: target trial
framework
 What is the intervention you would do if you
could do an experiment?
 How would that experiment help you make
treatment, policy, or other decisions?
Are we asking a specific enough question to get
an answer we can understand and act upon?

The best questions lead to
action
What would the difference in total
income be for all tipped workers:
◦ if all tipped workers had been women
versus
◦ if all those same tipped workers had
been men?
How do we act upon the answer to
that question?
Does gender
cause poverty
for tipped
workers?

action
What would the difference in total
income be for women who work for
tips:
◦ if all women received the amount of tips
they typically receive versus
◦ if those same women received the
amount of tips that men typically
receive?
This question could help us plan
policy interventions
If women who worked
for tipped wages,
received the same
amount of tips as their
male colleagues, would
their poverty risk
decrease?
Example inspired by research by Dr Sarah
Andrea & colleagues

action
What would the death rates be from
COVID-19:
◦ if all individuals who became infected
had been people of color, versus
◦ if all those same individuals had been
white people?
How do we act upon the answer to
that question?
Does race cause
death from
COVID-19?

action
What would the difference in
deaths from COVID-19 have been:
◦ if all people of color who were
infected had the probability of dying
observed in 2020-21, versus
◦ if those same people of color had
instead had the same probability of
dying as white people who were
infected?
This question could help us plan
policy interventions
If people of color in
America had experienced
the same rate of death
from COVID-19 as white
Americans, how many
more people would be
alive?
Example inspired by research by Dr Justin
Feldman & colleagues

What do we want to know?
How would things have changed if the
world had been slightly different?
Treat
now
Treat later

If we can’t have a time machine, we’d like to
have a randomized trial.
Treat now
Treat later

Many decisions need to be made
NOW
A randomized trial would, in principle, answer
these questions …
… But we don’t always have randomized trials for
many reasons
Deferring a decision is not an option
No decision = decision: “Keep status quo”
Worse: in reality, even perfect randomized
trials are hard!

Luckily, we have a solution that’s as old
as epi!

If we can’t have a randomized trial, we’d like
to emulate what would have happened if we
could have done one.
Treat now
Treat later

Emulate with randomized trial data
The target trial framework helps us avoid
biases caused by:
Informative censoring
Non-random non-adherence
Competing events
Poorly defined causal questions
Generalizability & transportability problems
Incorrect interpretations of trial results

What do we mean by “Target Trial”?

Target trial framework also helps us
identify our causal estimand (i.e. the target
parameter)
1. Intention-to-treat effects
 Effect of randomization to treatment
2.Per-protocol effects: effect of
treatment
 Effect of initiating treatment
 Effect of adhering to treatment protocol
 Effect of receiving point intervention, among
the ‘compliers’ (not necessarily all adherers!)
Hernan & Robins, 2016. N
Available in
Randomized Trials Only
Available in
Randomized Trials
*and* Observational
Studies

The target trial framework
clarifies that these studies aren’t
asking the same question
because the causal estimands are
different!
Why does the estimand matter?

Emulate with observational
data
But if we don’t have a trial, we can also emulate
the target trial with observational data
Problem: observational data is hard to analyze
 Who should we include?
 When did baseline start?
 Why did exposure happen (or not)?

Observational studies have potential for
some (relatively) unique biases
 Baseline confounding
 Time zero (immortal time bias)
 Structural positivity violations
 Ill-defined causal questions

Target trial framework addresses all
these biases & more
 Explicit definition of baseline
 Explicit description of target population &
inclusion/ exclusion criteria
 Explicit description of well-defined
interventions
 Clarifies the causal question
What do we really
want to know?

The target trial helps us ask better
questions
 Are we asking a specific enough question to get an
answer we can understand?
 Will our question guide decision making?

Emulate with observational
data
Problem: observational data is hard to analyze
 Who should we include?
 When did baseline start?
 Why did exposure happen (or not)?
Solution: Target trial concept with g-methods
Parametric g-formula
Inverse probability weighting
G-estimation

A quick handshake intro to g-methods
1. Inverse probability weighting (aka IPW) of
marginal structural models
2. (Parametric) G-formula
3. Doubly-robust estimation (aka targeted
maximum likelihood estimation or TMLE if
estimated using machine learning)
4. G-estimation of structural nested models

G-methods generalize estimation to
treatment-confounder feedback

G-methods are roughly similar
to …
Inverse probability weighting
(Parametric) G-formula
G-estimation
Propensity scores
Standardization
Instrumental variables
≈
≈
≈
Use when
you have
treatment
-
confound
er
feedback,
and ….
… you
would
normally
use
these.

Why use inverse probability weights?
Inverse probability weighting is a way of
correcting for missing information.
We can correct for:
Loss to follow-up: Missing outcome under
assigned treatment
Non-adherence: Missing counterfactual
outcome, had they received assigned treatment
Other missingness:
E.g. when visits are missed but we wanted to look at info
collected at those visits.

What is the parametric g-
formula?
 A generalization (g) of standardization to
time-varying settings
 An equation (formula) that relates the
observational data to the counterfactual
data
 Solved using Monte-Carlo simulation,
which relies on (parametric) modeling
assumptions

The general formula for the
parametric g-formula
For a single time point of exposure:
The probability of the counterfactual outcome (Ya) if
everyone received exposure level A=a
the average of stratum-specific observed outcome
probabilities among people who received exposure a
(Pr(Y=1|A=a, L=l])
weighted by the probability of being in each stratum
(Pr[L=l])
r 𝑌𝑎
= 1 = Pr 𝑌 = 1|𝐴 = 𝑎 = Pr 𝑌 = 1 𝐴 = 𝑎, 𝐿 = 𝑙 Pr 𝐿 = 𝑙
𝑙
Pr 𝑌𝑎
= 1 = Pr 𝑌 = 1|𝐴 = 𝑎 = Pr 𝑌 = 1 𝐴 = 𝑎, 𝐿 = 𝑙 Pr 𝐿 =
𝑙

The general formula for the
parametric g-formula
For a single time point:
For multiple time points:
r 𝑌𝑎
= 1 = Pr 𝑌 = 1|𝐴 = 𝑎 = Pr 𝑌 = 1 𝐴 = 𝑎, 𝐿 = 𝑙 Pr 𝐿 = 𝑙
𝑙
Pr 𝑌𝑎
= 1 = Pr 𝑌 = 1|𝐴 = 𝑎 = Pr 𝑌 = 1 𝐴 = 𝑎, 𝐿 = 𝑙 Pr 𝐿 =
𝑙

Ways of solving the g-formula
For a (very) small number of covariates &
time points, we can calculate by hand
In practice, we almost always have too many
variables and/or time points.
Instead, we can use Monte Carlo simulation
Packages exist in SAS and R
(https://github.com/CausalInference)
We can also use an iterative approach
Wen et al. 2020. Biometrics.

But often we don’t have observational data
either
Not enough data available
Existing treatments in new populations
Novel indications for existing treatments
Hypothetical exposures or treatments
What else can we do?

Emulate with simulation
modeling
Simulation-based approaches give us
faster decisions and requires less data
A tool for combining our subject matter
knowledge, available data, and best
guesses into a quantitative prediction

What types of simulation modeling are
available?
Group-level:
Compartmental models
Markov cohorts
SIR models
Unit-level:
Individual-level simulation models
Agent-based models
Microsimulations

Group-level models lump people
together by characteristics
Susceptible Infected Recovered

Individual-level models allow people to
live different (virtual) lives

Group-level models make assumptions
about effect modification
 What types of people are
sufficiently the same from
the perspective of your
causal question?
 What information is
necessary to track and which
information can be
discarded?
 Which strata of effect
modifiers are important?

Individual-level models make
assumptions about “risk factors”
 What types of people
are in your population?
 What determines who
comes in contact with
whom?
 What determines who
gets sick?
 What determines how
long until someone
recovers or dies?

Individual-level simulation models can
be built like layer cakes
Individual life history
layer
Environmental exposures
layer
Contact pattern layer

The more layers, the more
assumptions
 Each layer has it’s own
assumptions
 The choice of layers is an
assumption about what is
important
 The flow of information
between layers has
assumptions
 Together, these are the
“structure” assumptions

But that’s not all!
If we want our model to tell us about
decisions we can take in the real world, then
we need our model to replicate the real world.
Let’s consider the simplest model
 Individual’s in a single layer
 Cannot interact
 Have no agency
 Have no environment or neighborhood

Assumption – Data trade-off
Goal: causal
inference in
one
population
More data
required
Fewer
assumptions
Goal: causal
inference in
many
populations
Less data
required
More
assumptions
Agent-based models Observational
emulation
Murray, 2016. Agent-based models for
causal inference. Harvard University.

All causal inference requires
assumptions
Simulation models can be thought of as a
way to emulate what we would have learned
if we had conducted an observational or
randomized study
So, we need to also make all the
assumptions we would have made in those
studies when making causal inference!

What assumptions do we need?
No unmeasured confounding: all common causes of
the treatment and outcome are known and measured in
the data
No open colliders: all common effects of the treatment
and outcome are known and not conditioned on in the
data or analysis

Positivity: there is a non-zero probability of
all levels of treatment for all types of
individuals in our population

 Consistency: our treatment levels are clearly
specified, aka:
 Well-defined interventions
 Well-defined causal questions

Why is target trial emulation using
simulation models harder than
observational studies or trials?
 We need these 3 assumptions for all trial
emulation methods
 but for simulation models we need these
assumptions to hold for every pair of
variables in our model!
 This starts to get extra tricky!
University.
Murray et al, Am J Epidemiol 2017 ; 186(2):

Special challenge: well-defined
mediators
To answer our question with observational data,
we need a well-defined question about treatment
To answer our question with a simulation model,
we also need well-defined questions about CD4
cell count.
Murray et al, Med Dec Making 2020; 40 (1),

Parameterizing mediators
We need a value for: Effect of antiretroviral
therapy initiation time on mortality when
CD4 count is held fixed at some value
How do we “fix” it?
What value do we choose? Does it matter?
Murray et al, Med Dec Making 2020; 40 (1),

Parameters must be externally valid
If we use more than one data source for our
model parameters, we need every parameter
to be externally valid.
This requires more assumptions!

External validity requires:
 No unmeasured outcome causes: all causes
of outcome are known and either modeled
or identically distributed between
populations
 (for every variable in the model that looks like an
‘outcome’)

So, does it work?
Can we change our assumptions into
knowledge & make causal inference using
simulation-based approaches?
Yes, but only if:
All our assumptions are correct!

But:
Even in the simplest case, it is hard to
get the assumptions right!
Even if we get them right, we never
know for sure they are right!
When we add in disease transmission,
things get even harder!

What about transmission?
Reminder, we started with models where
individuals
 Cannot interact
 Have no agency
 Have no environment or neighborhood
The target trial framework can help us
understand how we could relax our
assumptions and still estimate causal effects
that we can define and understand

Agent-based model results can be hard to
interpret
What do we really
want to know?
Control Intervention
Connections: Shared HIV risk
Index: shaded brown or red nodes
Nearest neighbors: outlined nodes

How does the target trial help?
Problem: Causal inference typically requires the
assumption of no interference, but many topics
violate this assumption
 infectious diseases
 behavioral interventions
 educational interventions
Solution: design our simulations to emulate
cluster-randomized, two-stage, ring-vaccine,
or other interference-friendly trial designs

Adapted from: Halloran and Struchiner (
Buchanan, et al. AJE 2021
Interference makes decision-making
hard…
…but we have frameworks for understanding effects under
interference in randomized trials

Buchanan, et al. AJE 2021
Murray, et al. AJE 2021
understand results of agent-based models
and network analyses

Summary 1: Asking good question is
hard
But the clearer we are about what we are
asking, the easier it is to make use of the
answer

Summary 2: Answering our questions
well is also hard too
Unbiased Intractably biased
Ideal randomized
controlled trial
Explanatory
randomized
controlled trials
Pragmatic
randomized
trials
Observatio
nal studies
Relying on
“Gut-
feeling”
Loss to follow-up
Non-adherence
Baseline confounding
Lack of generalizability
Simulation
studies
Ill-defined uncertainty

How do we estimate causal
effects?
Miguel Hernàn’s two-step causal algorithm:
1. Ask good questions
2. Answer them with appropriate methods
? !

Contact me:
@EpiEllie
ejmurray@bu.edu
https://github.com/eleanormu
rray
https://sites.bu.edu/causal/
? !
QUESTIONS!

Causal inference for complex exposures: asking questions that matter, getting answers that help.

More Related Content

What's hot (20)

Similar to Causal inference for complex exposures: asking questions that matter, getting answers that help. (20)

Recently uploaded (20)

Causal inference for complex exposures: asking questions that matter, getting answers that help.

Editor's Notes