Causal Effect Inference with Deep Latent-Variable Models

CS592 Presentation #20
Causal Effect Inference
with Deep Latent-Variable Models
20173586 Jeongmin Cha
20193168 Hyunsu Kim
20183239 Jongjin Park
C. Louizos, U. Shalit, J. Mooij, D. Sontag, R. Zemel and M. Welling

Contents
1. Introduction
2. Identification of causal effect
3. Causal effect variational autoencoder
4. Experiments
5. Group Discussion Point

Introduction
y
t
Z
X
# of deaths
MedicineAverage income
Socio-economic status

Introduction
y
t
Z
X
# of deaths
Medicine
Average income
Q] Is this reasonable?

Introduction
y
t
Z
X
# of deaths
Q] Isn’t this more general?

Introduction
y
t
Z
X
# of deaths
Q] Isn’t this more general?
Could be, but X is just a noisy
view on Z. → We can say “only
Z can cause y and t”.

Introduction
y
t
Z
X
# of deaths
Medicine
Q] How about this case?
Average income

Introduction
y
t
Z
X
# of deaths
MedicineAverage consumption
Q] How about this case?

Identification of Causal Effect
● Individual Treatment Effect (ITE)
● Example) “How will the number of deaths(y) vary if we do(t=1) or do
not(t=0) treat the poor(Z), whose annual salary(X) is about 10,000
dollars?”

Do-Calculus
● What if we set X to x ?
● Interventional probability is generally not the same with conditional
probability.
● Example)
X Y X Y

Problem Setup
y
t=1
Z
X ( The case for t=0 is identical )
Bayes’ rule
Do-calculus

When ‘t’ causes ‘X’
y
t=1
Z
X
Bayes’ rule
Do-calculus
( The case for t=0 is identical )

Causal effect variational autoencoder
Parametrize the causal graph as a latent variable model with neural networks
Encoder : model q(z|x,t,y) Decoder : model p(x|z)
Inference Network Model Network

● Assume the observations factorize conditioned
on the latent variables
Model Network

● First, compute the distribution p(t|z) and sample t
● Next, compute p(y|t,z) and sample y
○ For a continuous outcome: Gaussian distribution
○ For a discrete outcome: Bernoulli distribution.
● Then compute p(x|z) for reconstruction
Model Network

Neural networks outputs the parameters of a posterior approximation over the
latent variables z, e.g. a Gaussian
x
y
t
Inference Network
q(z|x,t=0,y)
q(z|x,t=1,y)
For the mean parameters
For the variance parameters

The objective from original VAE - ELBO
x
y
t
Inference NetworkModel Network
q(z|x,t=0,y)
q(z|x,t=1,y)

○ For a continuous outcome: Gaussian distribution
○ For a discrete outcome: Bernoulli distribution.
● We need two auxiliary distributions that predict t, y for new samples.
Inference Network

● ELBO term
● The Objective of the Causal Effect Variational Autoencoder (CEVAE)
Q: Why this two extra terms are needed?

● The Objective of the Causal Effect Variational Autoencoder (CEVAE)
Q: Why this two extra terms are needed?
● For unseen data x, we require to know the treatment assignment t and outcome y
before inferring the distribution over z.
● So we need two auxiliary distributions that predict t, y for new samples, and two
extra terms are for estimating the parameters of these distributions.

Experiment - Dataset
Three main experiment in the present paper
1. Two existing benchmark datasets
● IHDP (Infant Health and Development Program), Jobs
2. a synthetic toy dataset
3. Introduce a new benchmark
● based on twin births and deaths in the USA

Experiment - Implementation
● neural network architecture
○ 3 hidden layers NN with ELU nonlinearities
■ approximate posterior over the latent variables q(Z|X,t,y)
■ generative model p(X|Z)
■ outcome models p(y|t, Z), q(y|t, X).
○ a single hidden layer NN with ELU nonlinearities
■ Treatment models p(t|Z), q(t|X)
● latent variable
○ 20 dimensional latent variable z
○ weight decay term for all of the parameters

Experiment - Baseline models
● LR1 = logistic regression
● LR2 = two separate logistic regressions fit
○ to treated (t=1)
○ to control (t=0)
● TARnet
○ FFNN for causal inference

Experiment 1 - Benchmark datasets
● IHDP (Infant Health and Development Program)
● effect of home visits by specialists on future cognitive test scores
● Metrics
○ PEHE (Precision in Estimation of Heterogeneous Effect)
○ absolute error on ATE (Average Treatment Effect)

● the effect of job training (treatment) on employment after training (outcome)
● Metrics
○ absolute error on Average Treatment effect on the Treated (ATT)
○ Policy risk
■ acts as a proxy to the individual treatment effect.

Experiment 2 - Synthetic experiment on toy data
● the marginal distribution of X is a mixture of Gaussians
● the hidden variable Z is determining the mixture component
● latent variable z => binary variable
● How models can imitate the true latent variable well

● CEVAE bin: 5-dim binary latent z
○ latent model is correctly
specified
○ better with all sample size

● CEVAE cont:
5-dim continuous latent z
○ latent model is not correctly
specified
● We require more samples for the
latent space to imitate more
closely the true latent variable

Experiment 3 - Binary treatment outcome on Twins
● Introduce a new benchmark dataset about twin births in the USA
● t = 1
○ being born the heavier than the other.
● Y (outcome)
○ mortality of each of the twins in their first year of life
● Z (latent variable)
○ the number of gestation weeks (20 weeks, 20-27, 27-34, …) 10 categories
● X (proxy variables)
○ noisy view of Z
○ encoded vector (Z) flipped with a probability of 0.05-0.5
○ 0.5 flip means no direct information from Z

● inferring the mortality of the unobserved twin (counterfactual)

Q: Why does all methods
perform similarly when the
proxy noise is small?

Q: Why does all methods
perform similarly when the
proxy noise is small?
● X ≃ Z
● Only CEVAE uses Z
● The others use only X
● Z (the gestation length
feature) is very informative

● nh = number of hidden
layers
● in CEVAE cases,
larger nh, better AUC

layers
● in CEVAE cases,
Q: “TARnet (nh=0) == LR2“, but
Why are [CEVAE nh=0] and
[LR2] performing differently
when more proxy noisy level?

layers
● in CEVAE cases,
Q: “TARnet (nh=0) == LR2“, but
Why are [CEVAE nh=0] and
[LR2] performing differently
when more proxy noisy level?
● LR2 rely directly on the
noisy proxies instead of
the inferred latent state.

● inferring the average treatment effect
● CEVAE nh=0: not so good
● CEVAE is robust with
increasing proxy noise

Group Discussion Point
Balancing Neural Network (BNN) vs. CEVAE

Group Discussion Point
Balancing Neural Network (BNN) vs. CEVAE
- Model choice
- Discriminative model vs. Generative model
- CEVAE learns latent variable z (i.e. unobserved confounder), which BNN does not learn
- Architecture
- CEVAE split outputs for each treatment group in t after a shared representation
BNN CEVAE (~ TARnet architecture)
It can learn the
difference of t
more explicitly

Causal Effect Inference with Deep Latent-Variable Models

More Related Content

What's hot (20)

Similar to Causal Effect Inference with Deep Latent-Variable Models (20)

More from Jeongmin Cha (8)

Recently uploaded (20)

Causal Effect Inference with Deep Latent-Variable Models