Introduction tocausalinference april02_2020

A whirlwind tour of Causal Inference
The Path From Cause To Effect
By
Viswanath Gangavaram
Senior Data Scientist

Topics
• Introduction to Causal Inference
• Simpson Paradox: Let’s understand the damage done by Lurking Variables (Selection Bias at play)
• Ceteris Paribus: Other Things Equal Comparison
• Rubin Causal Model: Potential Outcomes, Observed Outcomes, Counterfactuals
• Average Treatment Effects ( ATE )
• Five important theorems from the world of propensity score theory
• IPTW: Slaying the Lurking Variable
• Doubly Robust Estimation
• Instrumental Variables: 2SLS, Compiler Average Causal Effect
• Heterogeneous Treatment Effects
• One Model Approach (OMA), Two Model Approach (TMA),
• Transformed Outcome Approach
• Applications of Causal Inference
• Counterfactual Inference framework for Learning To Rank
• Churn reason through Counterfactual Inference framework
• Causal Explanations on Quasi-Experiments

Simpson Paradox
Y=1 Y=0 Row Sums Success Rate
T=1 350 3,650 4,000 0.088
T=0 500 6,500 7,000 0.071
Col Sums 850 10,150 11,000
𝐸 𝑌1
| 𝑇 = 1
𝐸 𝑌0|𝑇 = 0
( 𝐸 𝑌1
| 𝑇 = 1 − 𝐸 𝑌0
| 𝑇 = 0 ) = 0.017
X=1
T=1 300 2,700 3,000 0.100
T=0 300 2,700 3,000 0.100
Col Sums 600 5,400 6,000
X=0
Y=1 Y=0 Row Sums Succes Rate
T=1 50 950 1,000 0.050
T=0 200 3,800 4,000 0.050
Col Sums 250 4,750 5,000

• Standardization: Stratification followed by Normalization
𝐸 𝑌1
| 𝑇 = 1 =
(6000/11000)*0.1 + ( 5000/1000)*0.05 = 0.077
𝐸 𝑌0
|𝑇 = 0 =
(6000/11000)*0.1 + ( 5000/1000)*0.05 = 0.077
In future slides, we will resolve Simpson Paradox with Inverse Propensity Treatment Weighting(IPTW) a Causal
Inference Technique
Standardization
Pitfalls:
• Stratification
• Some strata might not have representation from other treatment groups

Ceteris Paribus: Other Things Equal Comparison
“The notion of Ideal Experiment disciplines our approach to causal inference”
• Comparisons made under ceteris paribus conditions have a causal interpretation
• The Causal Inference craft uses data to get to other things equal in-spite of obstacles-called selection bias or
omitted variables found on the path running from raw numbers to causal knowledge
• Random Assignment with Law Large Of Numbers, ensures Ceteris Paribus in Randomized Control Trails
• Random assignment isn’t the same as holding everything else fixed, but it has the same effect.
• The notion of Ideal Experiment disciplines our approach to causal inference
• Nothing but ensures Ceteris Paribus before marking Causal Statement

Rubin Causal Model or Potential Outcome Framework
𝑌0
• Think of Potential Outcomes as the
outcomes we would see under each
possible treatment option
• Notation: Ya is the outcome that would
be observed if treatment was set to A=a
• Counterfactuals outcomes are ones that
would have been observed, had the
treatment been different.
• Average Causal Effect: E[Y1 – Y0]
𝑌1

Average Causal Effect through Potential Outcome Framework
Population Of Interest
World 1
Everyone gets A=0
World 1
Everyone gets A=1
mean(Y) mean(Y)
Difference is Average Causal Effect: E[Y1-Y0]
• E[Y1 – Y0] ≠ E[Y|A=1] – E[Y|A=0]
• Condition Vs. Setting
• Comparing two different populations
• E[Y1/Y0] : Causal Relative Risk
• E[Y1-Y0|A=1 ]: Causal effect of treatment on
the treated
• Heterogeneity of Treatment Effects

• Fundamental Problem Of Causal Inference
• How do we use observed data to link Observed Outcomes to Potential Outcomes ?
• What assumptions are necessary to estimate causal effect from observed data ?

Untestable Causal Assumptions to link Observed and Potential Outcomes
• Stable Unit Treatment Value Assumption ( SUTVA )
• No Interference
• One Version Of Treatment
• Consistency Assumption: Y = Ya if A=a for all a
• Positivity Assumption: P(A=a | X=x) > 0
• Variability in treatment assignment for Causal Effect Identification
• Ignorability Assumption: (Y0, Y1 ) 𝑖𝑛𝑑𝑒𝑝𝑒𝑛𝑑𝑒𝑛𝑡 𝑜𝑓 𝐴 | X
• Ensures random assignment of treatment given co-variates
• In Simpson Paradox ignorability assumption is violated
• Linking Observed Data and Potential Outcomes
• E[Y|A=a, X=x]
• E[Y | A=a, X=x ] = E[Ya | A=a, X=x] by consistency assumption
• E[Ya | X=x ] by ignorability assumption

Propensity Score & Propensity Score Matching
Short comings
• Discards some data
Propensity Score
Propensity Score as a balancing score
Propensity Score Matching

Distribution of Propensity score before & after matching
Propensity Score Matching

Five important theorems from the world of propensity score theory
The propensity score is a balancing score
Any score that is finer than the propensity score is a balancing score; moreover, x is the finest
balancing score and the propensity score is the coarsest
If treatment assignment is strongly ignorable given x, then it is strongly ignorable given any
balancing score
At any value of a balancing score, the difference between the treatment and control means is an
unbiased estimate of the average treatment effect at that value of the balancing score if
treatment assignment is strongly ignorable. Consequently, with strongly ignorable treatment
assignment, pair matching on balancing score, subclassification on a balancing score and
covariance adjustment on a balancing score can all produce unbiased estimates of treatment
effects.
Using sample estimates of balancing scores can produce sample balance on x

Inverse Propensity Treatment Weighting
• Rather than match, we could use all of the data, but down-weight some and up-weight
others
• This is accomplished by weighting by the inverse of the probability of treatment received
• For treated subjects, weight by the inverse of P(A=1 | X )
• For control subjects, weight by the inverse of P(A=0 | X)
This is known as Inverse Probability Of Treatment Weighting
• There is confounding in the original population
• IPTW creates a pseudo-population where treatment
assignment no-longer depends on X
• No Confounding in the pseudo-population

IPTW
Resolving Simpson Paradox through IPTW
• E[𝑌1
] = ( ( 300 * (1/0.5) ) + ( 50*(1/0.2) ) ) / 11000 = 850/11,000 = 0.0772
• E[𝑌0
] = ( ( 300 * (1/0.5) ) + ( 200*(1/0.8) ) ) / 11000 = 850/11,000 =
0.0772
• E[𝑌1
] - E[𝑌0
] = 0
• Note:
• Propensity Scores for X=1 are 0.5, 0.5
• Propensity Scores for X=0 are 0.2, 0.8
X=1
T=1 300 2700 3,000 0.100
T=0 300 2700 3,000 0.100
Col Sums 600 5,400 6,000
X=0
Y=1 Y=0 Row Sums Succes Rate
T=1 50 950 1,000 0.050
T=0 200 3800 4,000 0.050
Col Sums 250 4,750 5,000

Some Terminology
• Marginal Treatment Probability
• Conditional Treatment Probability ( Propensity Score )
• The unit-level causal effect:
• Observed outcome:
• Conditional Average Treatment Effect( CATE )
• Population Treatment Effect ( ATE )

Heterogenous Treatment Effects: SMA & TMA
Single Model Approach: We can use conventional ML techniques with the observed outcome 𝑌𝑖
𝑜𝑏𝑠
as the outcome
and both the treatment Wi and Xi as the features.
Two Model Approach: Separate Trees for the Observed Outcome by Treatment Group

Heterogenous Treatment Effects: Transformed Outcome Approach

• Unbiased Learning-to-Rank with Biased Feedback: Optimizing Propensity-Weighted ERM Objective
• Position bias is search rankings strongly influences how many clicks a result receives, so that directly using click data
as a training signal in Learning-To-Rank methods yields sub-optimal results
• To overcome this bias problem, this paper presents a Counterfactual Inference Framework that provides the
theoretical basis for unbiased LTR via Empirical Risk Minimization despite biased data
• Propensity-Weighted Ranking SVM for discriminative learning from implicit feedback data, where click models take
the role of the propensity estimators
• Other Quasi-Experiment Designs
• Use IPTW
Use Cases of Causal Inference

References
1. Coursera’s Causality Crash Course
2. https://eng.uber.com/causal-inference-at-uber/
3. The Book Of Why
4. Mastering Metrics
5. Mostly Harmless Econometrics

Introduction tocausalinference april02_2020

More Related Content

What's hot (20)

Similar to Introduction tocausalinference april02_2020 (20)

Recently uploaded (20)

Introduction tocausalinference april02_2020