SlideShare a Scribd company logo
@EpiEllie
Causal inference for complex
data: Asking questions that
matter, getting answers that
help
Eleanor Murray, ScD, MPH
Department of
Epidemiology
University of Minnesota
December 3, 2021
Epidemiology is:
Learning who has which health
problems where, and
Figuring out what to do to
change that
We can’t achieve these goals
unless we ASK the right
questions, and ESTIMATE
useful answers
Changing the public’s health requires
understanding and estimating causal
effects!
So how do we estimate causal
effects?
Miguel Hernàn’s two-step causal algorithm:
1. Ask good questions
2. Answer them with appropriate methods
? !
How do we ask good questions?
Start with a clear causal question:
What is the exact exposure(s) of interest?
What is the exact comparison group(s) of
interest?
What is the exact outcome?
Who do we want to learn about?
Relationships are complicated—
especially in epidemiology
Defining exposure is often hard…
Jiang et al 2020
…and measuring exposure can be even
harder
Measurement error for depression
Measurement error for PTSD
Measurement error for suicid
Jiang et al 2020
Complex exposures give us complicated
answers, even if we can intervene…
 Do we always need to give all parts of the intervention?
 Does the timing of the intervention or pieces matter?
 How would the intervention work in groups with other
types of usual care/ comparators?
What makes an exposure complex?
Multiple components that make it hard to define or
measure
 e.g. race, socio-economic position, cognitive
behavioral therapy
Interference between individuals
 e.g. infections, behaviors & habits, education
Exposures that vary over space
 e.g. air pollution, access to goods & services
Exposures that vary over time
 e.g. medication usage, unhealthy habits
Simple exposures that could occur at any time
Feedback loops exist, but they aren’t
really loops
Instead, they represent a recurring
sequence over time
Often, we have to decide where in the
loop to start
… so what about when we can’t*
intervene?
We need to be even more careful about the
questions we ask!
* whether because of ethical, logistical, financial, or even time
constraints
 We need well-defined causal questions!
Why are well-defined causal questions
important for complex exposures?
When there are multiple possible ‘interventions’
and we don’t specify one, our answer is a
weighted average of all ‘interventions’ but we
don’t know the weights
Murray, 2016. Agent-based models for causal inference. Harvard
University.
Murray, et al. 2019. Medical Decision Making
We call this the “WACE”:
weighted average causal effect
Useful for estimating effects of
biomarkers in a defined
population
Why are well-defined causal questions
important for complex exposures?
Worse, if the ‘intervention’ is ill-defined, the
confounding is probably also ill-defined!
Murray, 2016. Agent-based models for causal inference. Harvard
University.
Murray, et al. 2019. Medical Decision Making [in press]
Asking questions that matter for
complex exposures: target trial
framework
 What is the intervention you would do if you
could do an experiment?
 How would that experiment help you make
treatment, policy, or other decisions?
Are we asking a specific enough question to get
an answer we can understand and act upon?
The best questions lead to
action
What would the difference in total
income be for all tipped workers:
◦ if all tipped workers had been women
versus
◦ if all those same tipped workers had
been men?
How do we act upon the answer to
that question?
Does gender
cause poverty
for tipped
workers?
The best questions lead to
action
What would the difference in total
income be for women who work for
tips:
◦ if all women received the amount of tips
they typically receive versus
◦ if those same women received the
amount of tips that men typically
receive?
This question could help us plan
policy interventions
If women who worked
for tipped wages,
received the same
amount of tips as their
male colleagues, would
their poverty risk
decrease?
Example inspired by research by Dr Sarah
Andrea & colleagues
The best questions lead to
action
What would the death rates be from
COVID-19:
◦ if all individuals who became infected
had been people of color, versus
◦ if all those same individuals had been
white people?
How do we act upon the answer to
that question?
Does race cause
death from
COVID-19?
The best questions lead to
action
What would the difference in
deaths from COVID-19 have been:
◦ if all people of color who were
infected had the probability of dying
observed in 2020-21, versus
◦ if those same people of color had
instead had the same probability of
dying as white people who were
infected?
This question could help us plan
policy interventions
If people of color in
America had experienced
the same rate of death
from COVID-19 as white
Americans, how many
more people would be
alive?
Example inspired by research by Dr Justin
Feldman & colleagues
What do we want to know?
How would things have changed if the
world had been slightly different?
Treat
now
Treat later
What do we want to know?
If we can’t have a time machine, we’d like to
have a randomized trial.
Treat now
Treat later
Many decisions need to be made
NOW
A randomized trial would, in principle, answer
these questions …
… But we don’t always have randomized trials for
many reasons
Deferring a decision is not an option
No decision = decision: “Keep status quo”
Worse: in reality, even perfect randomized
trials are hard!
Luckily, we have a solution that’s as old
as epi!
What do we want to know?
If we can’t have a randomized trial, we’d like
to emulate what would have happened if we
could have done one.
Treat now
Treat later
Emulate with randomized trial data
The target trial framework helps us avoid
biases caused by:
Informative censoring
Non-random non-adherence
Competing events
Poorly defined causal questions
Generalizability & transportability problems
Incorrect interpretations of trial results
What do we mean by “Target Trial”?
Target trial framework also helps us
identify our causal estimand (i.e. the target
parameter)
1. Intention-to-treat effects
 Effect of randomization to treatment
2.Per-protocol effects: effect of
treatment
 Effect of initiating treatment
 Effect of adhering to treatment protocol
 Effect of receiving point intervention, among
the ‘compliers’ (not necessarily all adherers!)
Hernan & Robins, 2016. N
Available in
Randomized Trials Only
Available in
Randomized Trials
*and* Observational
Studies
The target trial framework
clarifies that these studies aren’t
asking the same question
because the causal estimands are
different!
Why does the estimand matter?
Emulate with observational
data
But if we don’t have a trial, we can also emulate
the target trial with observational data
Problem: observational data is hard to analyze
 Who should we include?
 When did baseline start?
 Why did exposure happen (or not)?
Observational studies have potential for
some (relatively) unique biases
 Baseline confounding
 Time zero (immortal time bias)
 Structural positivity violations
 Ill-defined causal questions
Target trial framework addresses all
these biases & more
 Explicit definition of baseline
 Explicit description of target population &
inclusion/ exclusion criteria
 Explicit description of well-defined
interventions
 Clarifies the causal question
What do we really
want to know?
The target trial helps us ask better
questions
 Are we asking a specific enough question to get an
answer we can understand?
 Will our question guide decision making?
Emulate with observational
data
Problem: observational data is hard to analyze
 Who should we include?
 When did baseline start?
 Why did exposure happen (or not)?
Solution: Target trial concept with g-methods
Parametric g-formula
Inverse probability weighting
G-estimation
A quick handshake intro to g-methods
1. Inverse probability weighting (aka IPW) of
marginal structural models
2. (Parametric) G-formula
3. Doubly-robust estimation (aka targeted
maximum likelihood estimation or TMLE if
estimated using machine learning)
4. G-estimation of structural nested models
G-methods generalize estimation to
treatment-confounder feedback
G-methods are roughly similar
to …
Inverse probability weighting
(Parametric) G-formula
G-estimation
Propensity scores
Standardization
Instrumental variables
≈
≈
≈
Use when
you have
treatment
-
confound
er
feedback,
and ….
… you
would
normally
use
these.
Why use inverse probability weights?
Inverse probability weighting is a way of
correcting for missing information.
We can correct for:
Loss to follow-up: Missing outcome under
assigned treatment
Non-adherence: Missing counterfactual
outcome, had they received assigned treatment
Other missingness:
E.g. when visits are missed but we wanted to look at info
collected at those visits.
What is the parametric g-
formula?
 A generalization (g) of standardization to
time-varying settings
 An equation (formula) that relates the
observational data to the counterfactual
data
 Solved using Monte-Carlo simulation,
which relies on (parametric) modeling
assumptions
The general formula for the
parametric g-formula
For a single time point of exposure:
The probability of the counterfactual outcome (Ya) if
everyone received exposure level A=a
the average of stratum-specific observed outcome
probabilities among people who received exposure a
(Pr(Y=1|A=a, L=l])
weighted by the probability of being in each stratum
(Pr[L=l])
r 𝑌𝑎
= 1 = Pr 𝑌 = 1|𝐴 = 𝑎 = Pr 𝑌 = 1 𝐴 = 𝑎, 𝐿 = 𝑙 Pr 𝐿 = 𝑙
𝑙
Pr 𝑌𝑎
= 1 = Pr 𝑌 = 1|𝐴 = 𝑎 = Pr 𝑌 = 1 𝐴 = 𝑎, 𝐿 = 𝑙 Pr 𝐿 =
𝑙
The general formula for the
parametric g-formula
For a single time point:
For multiple time points:
r 𝑌𝑎
= 1 = Pr 𝑌 = 1|𝐴 = 𝑎 = Pr 𝑌 = 1 𝐴 = 𝑎, 𝐿 = 𝑙 Pr 𝐿 = 𝑙
𝑙
Pr 𝑌𝑎
= 1 = Pr 𝑌 = 1|𝐴 = 𝑎 = Pr 𝑌 = 1 𝐴 = 𝑎, 𝐿 = 𝑙 Pr 𝐿 =
𝑙
Ways of solving the g-formula
For a (very) small number of covariates &
time points, we can calculate by hand
In practice, we almost always have too many
variables and/or time points.
Instead, we can use Monte Carlo simulation
Packages exist in SAS and R
(https://github.com/CausalInference)
We can also use an iterative approach
Wen et al. 2020. Biometrics.
But often we don’t have observational data
either
Not enough data available
Existing treatments in new populations
Novel indications for existing treatments
Hypothetical exposures or treatments
What else can we do?
Emulate with simulation
modeling
Simulation-based approaches give us
faster decisions and requires less data
A tool for combining our subject matter
knowledge, available data, and best
guesses into a quantitative prediction
What types of simulation modeling are
available?
Group-level:
Compartmental models
Markov cohorts
SIR models
Unit-level:
Individual-level simulation models
Agent-based models
Microsimulations
Group-level models lump people
together by characteristics
Susceptible Infected Recovered
Individual-level models allow people to
live different (virtual) lives
Group-level models make assumptions
about effect modification
 What types of people are
sufficiently the same from
the perspective of your
causal question?
 What information is
necessary to track and which
information can be
discarded?
 Which strata of effect
modifiers are important?
Individual-level models make
assumptions about “risk factors”
 What types of people
are in your population?
 What determines who
comes in contact with
whom?
 What determines who
gets sick?
 What determines how
long until someone
recovers or dies?
Individual-level simulation models can
be built like layer cakes
Individual life history
layer
Environmental exposures
layer
Contact pattern layer
The more layers, the more
assumptions
 Each layer has it’s own
assumptions
 The choice of layers is an
assumption about what is
important
 The flow of information
between layers has
assumptions
 Together, these are the
“structure” assumptions
But that’s not all!
If we want our model to tell us about
decisions we can take in the real world, then
we need our model to replicate the real world.
Let’s consider the simplest model
 Individual’s in a single layer
 Cannot interact
 Have no agency
 Have no environment or neighborhood
Assumption – Data trade-off
Goal: causal
inference in
one
population
More data
required
Fewer
assumptions
Goal: causal
inference in
many
populations
Less data
required
More
assumptions
Agent-based models Observational
emulation
Murray, 2016. Agent-based models for
causal inference. Harvard University.
All causal inference requires
assumptions
Simulation models can be thought of as a
way to emulate what we would have learned
if we had conducted an observational or
randomized study
So, we need to also make all the
assumptions we would have made in those
studies when making causal inference!
What assumptions do we need?
No unmeasured confounding: all common causes of
the treatment and outcome are known and measured in
the data
No open colliders: all common effects of the treatment
and outcome are known and not conditioned on in the
data or analysis
What assumptions do we need?
Positivity: there is a non-zero probability of
all levels of treatment for all types of
individuals in our population
What assumptions do we need?
 Consistency: our treatment levels are clearly
specified, aka:
 Well-defined interventions
 Well-defined causal questions
Why is target trial emulation using
simulation models harder than
observational studies or trials?
 We need these 3 assumptions for all trial
emulation methods
 but for simulation models we need these
assumptions to hold for every pair of
variables in our model!
 This starts to get extra tricky!
Murray, 2016. Agent-based models for causal inference. Harvard
University.
Murray et al, Am J Epidemiol 2017 ; 186(2):
Special challenge: well-defined
mediators
To answer our question with observational data,
we need a well-defined question about treatment
To answer our question with a simulation model,
we also need well-defined questions about CD4
cell count.
Murray et al, Med Dec Making 2020; 40 (1),
Parameterizing mediators
We need a value for: Effect of antiretroviral
therapy initiation time on mortality when
CD4 count is held fixed at some value
How do we “fix” it?
What value do we choose? Does it matter?
Murray et al, Med Dec Making 2020; 40 (1),
Parameters must be externally valid
If we use more than one data source for our
model parameters, we need every parameter
to be externally valid.
This requires more assumptions!
External validity requires:
 No unmeasured outcome causes: all causes
of outcome are known and either modeled
or identically distributed between
populations
 (for every variable in the model that looks like an
‘outcome’)
So, does it work?
Can we change our assumptions into
knowledge & make causal inference using
simulation-based approaches?
Yes, but only if:
All our assumptions are correct!
But:
Even in the simplest case, it is hard to
get the assumptions right!
Even if we get them right, we never
know for sure they are right!
When we add in disease transmission,
things get even harder!
What about transmission?
Reminder, we started with models where
individuals
 Cannot interact
 Have no agency
 Have no environment or neighborhood
The target trial framework can help us
understand how we could relax our
assumptions and still estimate causal effects
that we can define and understand
Agent-based model results can be hard to
interpret
What do we really
want to know?
Control Intervention
Connections: Shared HIV risk
Index: shaded brown or red nodes
Nearest neighbors: outlined nodes
How does the target trial help?
Problem: Causal inference typically requires the
assumption of no interference, but many topics
violate this assumption
 infectious diseases
 behavioral interventions
 educational interventions
Solution: design our simulations to emulate
cluster-randomized, two-stage, ring-vaccine,
or other interference-friendly trial designs
Adapted from: Halloran and Struchiner (
Buchanan, et al. AJE 2021
Interference makes decision-making
hard…
…but we have frameworks for understanding effects under
interference in randomized trials
Buchanan, et al. AJE 2021
Murray, et al. AJE 2021
understand results of agent-based models
and network analyses
Summary 1: Asking good question is
hard
But the clearer we are about what we are
asking, the easier it is to make use of the
answer
Summary 2: Answering our questions
well is also hard too
Unbiased Intractably biased
Ideal randomized
controlled trial
Explanatory
randomized
controlled trials
Pragmatic
randomized
trials
Observatio
nal studies
Relying on
“Gut-
feeling”
Loss to follow-up
Non-adherence
Baseline confounding
Lack of generalizability
Simulation
studies
Ill-defined uncertainty
How do we estimate causal
effects?
Miguel Hernàn’s two-step causal algorithm:
1. Ask good questions
2. Answer them with appropriate methods
? !
Contact me:
@EpiEllie
ejmurray@bu.edu
https://github.com/eleanormu
rray
https://sites.bu.edu/causal/
? !
QUESTIONS!

More Related Content

PPTX
A Cartoon Guide to Causal Inference
PDF
Factorial ANOVA
PDF
Propensity Score Matching Methods
PPTX
PDF
Social choice
PPTX
Path analysis with manifest variables
PDF
Difference-in-Difference Methods
PPT
Utilitarianism 7
A Cartoon Guide to Causal Inference
Factorial ANOVA
Propensity Score Matching Methods
Social choice
Path analysis with manifest variables
Difference-in-Difference Methods
Utilitarianism 7

What's hot (20)

PPTX
Lecture 10_spearman's rank correlation
PPT
The problem of social cost
PPTX
Correlation and regression
PPTX
Game theory
PPTX
Logistic regression with SPSS
PPTX
Chapter 2 review
PPTX
Correlation analysis
PPT
Eco Basic 1 8
PPTX
Behavioural economics
PPTX
Assumptions of OLS.pptx
PPTX
Nature scope.pptx
PPTX
History of Sociology
PPTX
To explain or to predict
PDF
Positivist & Interpretivist approaches
PDF
Nonparametric Statistics
PPT
Moderator mediator
PPTX
Sociology of the environment ifp
PPTX
Path analysis
PPTX
Prisoner's Dilemma
PPT
Day 4 normal curve and standard scores
Lecture 10_spearman's rank correlation
The problem of social cost
Correlation and regression
Game theory
Logistic regression with SPSS
Chapter 2 review
Correlation analysis
Eco Basic 1 8
Behavioural economics
Assumptions of OLS.pptx
Nature scope.pptx
History of Sociology
To explain or to predict
Positivist & Interpretivist approaches
Nonparametric Statistics
Moderator mediator
Sociology of the environment ifp
Path analysis
Prisoner's Dilemma
Day 4 normal curve and standard scores
Ad

Similar to Causal inference for complex exposures: asking questions that matter, getting answers that help. (20)

PPTX
COVID and Causal Inference -- NAS CATS 6/2020
PDF
Causal Inference Introduction.pdf
PPTX
Causality in Python PyCon 2021 ISRAEL
PPTX
COVID and Causal Inference -- CogX 6/2020
PDF
P-values in crisis
PPT
Analytic Methods and Issues in CER from Observational Data
PDF
Glymour aaai
PPTX
Introduction tocausalinference april02_2020
PDF
2019 PMED Spring Course - Preliminaries: Basic Causal Inference - Marie David...
PPTX
Statistical Methods for Removing Selection Bias In Observational Studies
PDF
Better than a coin toss
PPTX
Modeling Social Data, Lecture 11: Causality and Experiments, Part 1
PPTX
Module7_RamdomError.pptx
PPTX
Fundamentals of Program Impact Evaluation
PDF
Aussem
PPTX
Bayesian networks and the search for causality
PDF
Causal Inference Opening Workshop - Bayesian Nonparametric Models for Treatme...
PDF
Causal Inference
PDF
Dichotomania and other challenges for the collaborating biostatistician
PPTX
Randomization and Its Discontents
COVID and Causal Inference -- NAS CATS 6/2020
Causal Inference Introduction.pdf
Causality in Python PyCon 2021 ISRAEL
COVID and Causal Inference -- CogX 6/2020
P-values in crisis
Analytic Methods and Issues in CER from Observational Data
Glymour aaai
Introduction tocausalinference april02_2020
2019 PMED Spring Course - Preliminaries: Basic Causal Inference - Marie David...
Statistical Methods for Removing Selection Bias In Observational Studies
Better than a coin toss
Modeling Social Data, Lecture 11: Causality and Experiments, Part 1
Module7_RamdomError.pptx
Fundamentals of Program Impact Evaluation
Aussem
Bayesian networks and the search for causality
Causal Inference Opening Workshop - Bayesian Nonparametric Models for Treatme...
Causal Inference
Dichotomania and other challenges for the collaborating biostatistician
Randomization and Its Discontents
Ad

Recently uploaded (20)

PPTX
cpcsea ppt.pptxssssssssssssssjjdjdndndddd
PPTX
Protein & Amino Acid Structures Levels of protein structure (primary, seconda...
PPTX
7. General Toxicologyfor clinical phrmacy.pptx
PDF
CAPERS-LRD-z9:AGas-enshroudedLittleRedDotHostingaBroad-lineActive GalacticNuc...
PDF
HPLC-PPT.docx high performance liquid chromatography
PDF
. Radiology Case Scenariosssssssssssssss
PDF
Unveiling a 36 billion solar mass black hole at the centre of the Cosmic Hors...
PDF
VARICELLA VACCINATION: A POTENTIAL STRATEGY FOR PREVENTING MULTIPLE SCLEROSIS
PDF
Phytochemical Investigation of Miliusa longipes.pdf
PDF
An interstellar mission to test astrophysical black holes
PPTX
2. Earth - The Living Planet earth and life
PPTX
microscope-Lecturecjchchchchcuvuvhc.pptx
PDF
AlphaEarth Foundations and the Satellite Embedding dataset
PPTX
G5Q1W8 PPT SCIENCE.pptx 2025-2026 GRADE 5
DOCX
Viruses (History, structure and composition, classification, Bacteriophage Re...
PPTX
Classification Systems_TAXONOMY_SCIENCE8.pptx
PPT
The World of Physical Science, • Labs: Safety Simulation, Measurement Practice
PPT
POSITIONING IN OPERATION THEATRE ROOM.ppt
PPTX
TOTAL hIP ARTHROPLASTY Presentation.pptx
PDF
Biophysics 2.pdffffffffffffffffffffffffff
cpcsea ppt.pptxssssssssssssssjjdjdndndddd
Protein & Amino Acid Structures Levels of protein structure (primary, seconda...
7. General Toxicologyfor clinical phrmacy.pptx
CAPERS-LRD-z9:AGas-enshroudedLittleRedDotHostingaBroad-lineActive GalacticNuc...
HPLC-PPT.docx high performance liquid chromatography
. Radiology Case Scenariosssssssssssssss
Unveiling a 36 billion solar mass black hole at the centre of the Cosmic Hors...
VARICELLA VACCINATION: A POTENTIAL STRATEGY FOR PREVENTING MULTIPLE SCLEROSIS
Phytochemical Investigation of Miliusa longipes.pdf
An interstellar mission to test astrophysical black holes
2. Earth - The Living Planet earth and life
microscope-Lecturecjchchchchcuvuvhc.pptx
AlphaEarth Foundations and the Satellite Embedding dataset
G5Q1W8 PPT SCIENCE.pptx 2025-2026 GRADE 5
Viruses (History, structure and composition, classification, Bacteriophage Re...
Classification Systems_TAXONOMY_SCIENCE8.pptx
The World of Physical Science, • Labs: Safety Simulation, Measurement Practice
POSITIONING IN OPERATION THEATRE ROOM.ppt
TOTAL hIP ARTHROPLASTY Presentation.pptx
Biophysics 2.pdffffffffffffffffffffffffff

Causal inference for complex exposures: asking questions that matter, getting answers that help.

  • 1. @EpiEllie Causal inference for complex data: Asking questions that matter, getting answers that help Eleanor Murray, ScD, MPH Department of Epidemiology University of Minnesota December 3, 2021
  • 2. Epidemiology is: Learning who has which health problems where, and Figuring out what to do to change that We can’t achieve these goals unless we ASK the right questions, and ESTIMATE useful answers Changing the public’s health requires understanding and estimating causal effects!
  • 3. So how do we estimate causal effects? Miguel Hernàn’s two-step causal algorithm: 1. Ask good questions 2. Answer them with appropriate methods ? !
  • 4. How do we ask good questions? Start with a clear causal question: What is the exact exposure(s) of interest? What is the exact comparison group(s) of interest? What is the exact outcome? Who do we want to learn about?
  • 6. Defining exposure is often hard… Jiang et al 2020
  • 7. …and measuring exposure can be even harder Measurement error for depression Measurement error for PTSD Measurement error for suicid Jiang et al 2020
  • 8. Complex exposures give us complicated answers, even if we can intervene…  Do we always need to give all parts of the intervention?  Does the timing of the intervention or pieces matter?  How would the intervention work in groups with other types of usual care/ comparators?
  • 9. What makes an exposure complex? Multiple components that make it hard to define or measure  e.g. race, socio-economic position, cognitive behavioral therapy Interference between individuals  e.g. infections, behaviors & habits, education Exposures that vary over space  e.g. air pollution, access to goods & services Exposures that vary over time  e.g. medication usage, unhealthy habits Simple exposures that could occur at any time
  • 10. Feedback loops exist, but they aren’t really loops
  • 11. Instead, they represent a recurring sequence over time
  • 12. Often, we have to decide where in the loop to start
  • 13. … so what about when we can’t* intervene? We need to be even more careful about the questions we ask! * whether because of ethical, logistical, financial, or even time constraints  We need well-defined causal questions!
  • 14. Why are well-defined causal questions important for complex exposures? When there are multiple possible ‘interventions’ and we don’t specify one, our answer is a weighted average of all ‘interventions’ but we don’t know the weights Murray, 2016. Agent-based models for causal inference. Harvard University. Murray, et al. 2019. Medical Decision Making We call this the “WACE”: weighted average causal effect Useful for estimating effects of biomarkers in a defined population
  • 15. Why are well-defined causal questions important for complex exposures? Worse, if the ‘intervention’ is ill-defined, the confounding is probably also ill-defined! Murray, 2016. Agent-based models for causal inference. Harvard University. Murray, et al. 2019. Medical Decision Making [in press]
  • 16. Asking questions that matter for complex exposures: target trial framework  What is the intervention you would do if you could do an experiment?  How would that experiment help you make treatment, policy, or other decisions? Are we asking a specific enough question to get an answer we can understand and act upon?
  • 17. The best questions lead to action What would the difference in total income be for all tipped workers: ◦ if all tipped workers had been women versus ◦ if all those same tipped workers had been men? How do we act upon the answer to that question? Does gender cause poverty for tipped workers?
  • 18. The best questions lead to action What would the difference in total income be for women who work for tips: ◦ if all women received the amount of tips they typically receive versus ◦ if those same women received the amount of tips that men typically receive? This question could help us plan policy interventions If women who worked for tipped wages, received the same amount of tips as their male colleagues, would their poverty risk decrease? Example inspired by research by Dr Sarah Andrea & colleagues
  • 19. The best questions lead to action What would the death rates be from COVID-19: ◦ if all individuals who became infected had been people of color, versus ◦ if all those same individuals had been white people? How do we act upon the answer to that question? Does race cause death from COVID-19?
  • 20. The best questions lead to action What would the difference in deaths from COVID-19 have been: ◦ if all people of color who were infected had the probability of dying observed in 2020-21, versus ◦ if those same people of color had instead had the same probability of dying as white people who were infected? This question could help us plan policy interventions If people of color in America had experienced the same rate of death from COVID-19 as white Americans, how many more people would be alive? Example inspired by research by Dr Justin Feldman & colleagues
  • 21. What do we want to know? How would things have changed if the world had been slightly different? Treat now Treat later
  • 22. What do we want to know? If we can’t have a time machine, we’d like to have a randomized trial. Treat now Treat later
  • 23. Many decisions need to be made NOW A randomized trial would, in principle, answer these questions … … But we don’t always have randomized trials for many reasons Deferring a decision is not an option No decision = decision: “Keep status quo” Worse: in reality, even perfect randomized trials are hard!
  • 24. Luckily, we have a solution that’s as old as epi!
  • 25. What do we want to know? If we can’t have a randomized trial, we’d like to emulate what would have happened if we could have done one. Treat now Treat later
  • 26. Emulate with randomized trial data The target trial framework helps us avoid biases caused by: Informative censoring Non-random non-adherence Competing events Poorly defined causal questions Generalizability & transportability problems Incorrect interpretations of trial results
  • 27. What do we mean by “Target Trial”?
  • 28. Target trial framework also helps us identify our causal estimand (i.e. the target parameter) 1. Intention-to-treat effects  Effect of randomization to treatment 2.Per-protocol effects: effect of treatment  Effect of initiating treatment  Effect of adhering to treatment protocol  Effect of receiving point intervention, among the ‘compliers’ (not necessarily all adherers!) Hernan & Robins, 2016. N Available in Randomized Trials Only Available in Randomized Trials *and* Observational Studies
  • 29. The target trial framework clarifies that these studies aren’t asking the same question because the causal estimands are different! Why does the estimand matter?
  • 30. Emulate with observational data But if we don’t have a trial, we can also emulate the target trial with observational data Problem: observational data is hard to analyze  Who should we include?  When did baseline start?  Why did exposure happen (or not)?
  • 31. Observational studies have potential for some (relatively) unique biases  Baseline confounding  Time zero (immortal time bias)  Structural positivity violations  Ill-defined causal questions
  • 32. Target trial framework addresses all these biases & more  Explicit definition of baseline  Explicit description of target population & inclusion/ exclusion criteria  Explicit description of well-defined interventions  Clarifies the causal question What do we really want to know?
  • 33. The target trial helps us ask better questions  Are we asking a specific enough question to get an answer we can understand?  Will our question guide decision making?
  • 34. Emulate with observational data Problem: observational data is hard to analyze  Who should we include?  When did baseline start?  Why did exposure happen (or not)? Solution: Target trial concept with g-methods Parametric g-formula Inverse probability weighting G-estimation
  • 35. A quick handshake intro to g-methods 1. Inverse probability weighting (aka IPW) of marginal structural models 2. (Parametric) G-formula 3. Doubly-robust estimation (aka targeted maximum likelihood estimation or TMLE if estimated using machine learning) 4. G-estimation of structural nested models
  • 36. G-methods generalize estimation to treatment-confounder feedback
  • 37. G-methods are roughly similar to … Inverse probability weighting (Parametric) G-formula G-estimation Propensity scores Standardization Instrumental variables ≈ ≈ ≈ Use when you have treatment - confound er feedback, and …. … you would normally use these.
  • 38. Why use inverse probability weights? Inverse probability weighting is a way of correcting for missing information. We can correct for: Loss to follow-up: Missing outcome under assigned treatment Non-adherence: Missing counterfactual outcome, had they received assigned treatment Other missingness: E.g. when visits are missed but we wanted to look at info collected at those visits.
  • 39. What is the parametric g- formula?  A generalization (g) of standardization to time-varying settings  An equation (formula) that relates the observational data to the counterfactual data  Solved using Monte-Carlo simulation, which relies on (parametric) modeling assumptions
  • 40. The general formula for the parametric g-formula For a single time point of exposure: The probability of the counterfactual outcome (Ya) if everyone received exposure level A=a the average of stratum-specific observed outcome probabilities among people who received exposure a (Pr(Y=1|A=a, L=l]) weighted by the probability of being in each stratum (Pr[L=l]) r 𝑌𝑎 = 1 = Pr 𝑌 = 1|𝐴 = 𝑎 = Pr 𝑌 = 1 𝐴 = 𝑎, 𝐿 = 𝑙 Pr 𝐿 = 𝑙 𝑙 Pr 𝑌𝑎 = 1 = Pr 𝑌 = 1|𝐴 = 𝑎 = Pr 𝑌 = 1 𝐴 = 𝑎, 𝐿 = 𝑙 Pr 𝐿 = 𝑙
  • 41. The general formula for the parametric g-formula For a single time point: For multiple time points: r 𝑌𝑎 = 1 = Pr 𝑌 = 1|𝐴 = 𝑎 = Pr 𝑌 = 1 𝐴 = 𝑎, 𝐿 = 𝑙 Pr 𝐿 = 𝑙 𝑙 Pr 𝑌𝑎 = 1 = Pr 𝑌 = 1|𝐴 = 𝑎 = Pr 𝑌 = 1 𝐴 = 𝑎, 𝐿 = 𝑙 Pr 𝐿 = 𝑙
  • 42. Ways of solving the g-formula For a (very) small number of covariates & time points, we can calculate by hand In practice, we almost always have too many variables and/or time points. Instead, we can use Monte Carlo simulation Packages exist in SAS and R (https://github.com/CausalInference) We can also use an iterative approach Wen et al. 2020. Biometrics.
  • 43. But often we don’t have observational data either Not enough data available Existing treatments in new populations Novel indications for existing treatments Hypothetical exposures or treatments What else can we do?
  • 44. Emulate with simulation modeling Simulation-based approaches give us faster decisions and requires less data A tool for combining our subject matter knowledge, available data, and best guesses into a quantitative prediction
  • 45. What types of simulation modeling are available? Group-level: Compartmental models Markov cohorts SIR models Unit-level: Individual-level simulation models Agent-based models Microsimulations
  • 46. Group-level models lump people together by characteristics Susceptible Infected Recovered
  • 47. Individual-level models allow people to live different (virtual) lives
  • 48. Group-level models make assumptions about effect modification  What types of people are sufficiently the same from the perspective of your causal question?  What information is necessary to track and which information can be discarded?  Which strata of effect modifiers are important?
  • 49. Individual-level models make assumptions about “risk factors”  What types of people are in your population?  What determines who comes in contact with whom?  What determines who gets sick?  What determines how long until someone recovers or dies?
  • 50. Individual-level simulation models can be built like layer cakes Individual life history layer Environmental exposures layer Contact pattern layer
  • 51. The more layers, the more assumptions  Each layer has it’s own assumptions  The choice of layers is an assumption about what is important  The flow of information between layers has assumptions  Together, these are the “structure” assumptions
  • 52. But that’s not all! If we want our model to tell us about decisions we can take in the real world, then we need our model to replicate the real world. Let’s consider the simplest model  Individual’s in a single layer  Cannot interact  Have no agency  Have no environment or neighborhood
  • 53. Assumption – Data trade-off Goal: causal inference in one population More data required Fewer assumptions Goal: causal inference in many populations Less data required More assumptions Agent-based models Observational emulation Murray, 2016. Agent-based models for causal inference. Harvard University.
  • 54. All causal inference requires assumptions Simulation models can be thought of as a way to emulate what we would have learned if we had conducted an observational or randomized study So, we need to also make all the assumptions we would have made in those studies when making causal inference!
  • 55. What assumptions do we need? No unmeasured confounding: all common causes of the treatment and outcome are known and measured in the data No open colliders: all common effects of the treatment and outcome are known and not conditioned on in the data or analysis
  • 56. What assumptions do we need? Positivity: there is a non-zero probability of all levels of treatment for all types of individuals in our population
  • 57. What assumptions do we need?  Consistency: our treatment levels are clearly specified, aka:  Well-defined interventions  Well-defined causal questions
  • 58. Why is target trial emulation using simulation models harder than observational studies or trials?  We need these 3 assumptions for all trial emulation methods  but for simulation models we need these assumptions to hold for every pair of variables in our model!  This starts to get extra tricky! Murray, 2016. Agent-based models for causal inference. Harvard University. Murray et al, Am J Epidemiol 2017 ; 186(2):
  • 59. Special challenge: well-defined mediators To answer our question with observational data, we need a well-defined question about treatment To answer our question with a simulation model, we also need well-defined questions about CD4 cell count. Murray et al, Med Dec Making 2020; 40 (1),
  • 60. Parameterizing mediators We need a value for: Effect of antiretroviral therapy initiation time on mortality when CD4 count is held fixed at some value How do we “fix” it? What value do we choose? Does it matter? Murray et al, Med Dec Making 2020; 40 (1),
  • 61. Parameters must be externally valid If we use more than one data source for our model parameters, we need every parameter to be externally valid. This requires more assumptions!
  • 62. External validity requires:  No unmeasured outcome causes: all causes of outcome are known and either modeled or identically distributed between populations  (for every variable in the model that looks like an ‘outcome’)
  • 63. So, does it work? Can we change our assumptions into knowledge & make causal inference using simulation-based approaches? Yes, but only if: All our assumptions are correct!
  • 64. But: Even in the simplest case, it is hard to get the assumptions right! Even if we get them right, we never know for sure they are right! When we add in disease transmission, things get even harder!
  • 65. What about transmission? Reminder, we started with models where individuals  Cannot interact  Have no agency  Have no environment or neighborhood The target trial framework can help us understand how we could relax our assumptions and still estimate causal effects that we can define and understand
  • 66. Agent-based model results can be hard to interpret What do we really want to know? Control Intervention Connections: Shared HIV risk Index: shaded brown or red nodes Nearest neighbors: outlined nodes
  • 67. How does the target trial help? Problem: Causal inference typically requires the assumption of no interference, but many topics violate this assumption  infectious diseases  behavioral interventions  educational interventions Solution: design our simulations to emulate cluster-randomized, two-stage, ring-vaccine, or other interference-friendly trial designs
  • 68. Adapted from: Halloran and Struchiner ( Buchanan, et al. AJE 2021 Interference makes decision-making hard… …but we have frameworks for understanding effects under interference in randomized trials
  • 69. Buchanan, et al. AJE 2021 Murray, et al. AJE 2021 understand results of agent-based models and network analyses
  • 70. Summary 1: Asking good question is hard But the clearer we are about what we are asking, the easier it is to make use of the answer
  • 71. Summary 2: Answering our questions well is also hard too Unbiased Intractably biased Ideal randomized controlled trial Explanatory randomized controlled trials Pragmatic randomized trials Observatio nal studies Relying on “Gut- feeling” Loss to follow-up Non-adherence Baseline confounding Lack of generalizability Simulation studies Ill-defined uncertainty
  • 72. How do we estimate causal effects? Miguel Hernàn’s two-step causal algorithm: 1. Ask good questions 2. Answer them with appropriate methods ? !

Editor's Notes

  • #9: Even when we can actually intervene
  • #10: Even when we can actually intervene
  • #17: The solution is well-defined interventions – i.e. well-defined causal questions
  • #29: Note with these definitions, we don’t need to distinguish between per-protocol versus as-treated
  • #44: “we may be interested in”
  • #54: Both use Monte-Carlo simulation to estimate counterfactual outcome distributions Microsimulations requires knowledge about mechanisms G-formula requires data about individuals
  • #59: No unmeasured confounding for any pair of variables Positivity for every variable in our model, or rules that dictate who can & can’t have it Consistency (well-defined intervention) for every variable in our model
  • #69: Indirect effect == disseminated effect Total effect == composite effect