A Tree-Based Approach for Addressing Self-selection in Impact Studies with Big Data

A Tree-Based Approach
for Addressing Self-selection in Impact Studies
with Big Data
Inbal Yahav Galit Shmueli Deepa Mani
Bar Ilan University Indian School of Business
Israel India
@ HKUST
Business School
Dept of ISBSOM
May 16, 2017

PART A (BACKGROUND):
EXPERIMENTS (& FRIENDS),
RANDOMIZATION, AND CAUSAL INFERENCE
PART C:
OUR NEW TREE APPROACH
(AN ALTERNATIVE TO PSM)
PART B:
DEALING WITH SELF SELECTION
(FOR CAUSAL INFERENCE)

PART A (BACKGROUND):
EXPERIMENTS (& FRIENDS),
RANDOMIZATION, AND CAUSAL INFERENCE

Experimental Studies
• Goal: Causal inference
• Effects of causes (causal description) vs.
causes of effects (causal explanation)
• Manipulable cause

Randomization vs self-selection

RCT: Random Assignment
Manipulation

Quasi-Experiment
(Self-selection or administrator selection)
Manipulation
Self
Selection

Alternative explanations
Random assignment Balanced groups
Confound
(third variable)
Counterfactual
(always?)

Experiments & Variations
• Randomized experiment (RCT), natural experiment,
quasi-experiment
• Lab vs. field experiments
Validity
External validity: generalization
Internal validity: alternative explanations,
heterogeneous treatment effect

PART B:
DEALING WITH SELF SELECTION
(FOR CAUSAL INFERENCE)

Self selection: the challenge
• Large impact studies of an intervention
• Individuals/firms self-select intervention group/duration
• Even in RCT, some variables might remain unbalanced
How to identify and adjust for self-selection?

Three Applications
Impact of training on earnings
Field experiment by US govt
• LaLonde (1986) compared to observational control
• Re-analysis by PSM (Dehejia & Wahba, 1999, 2002)
RCT
Impact of e-Gov service in India
New online passport service
• survey of online + offline users
• bribes, travel time, etc.
Quasi-
experiment
Impact of outsourcing contract features
on financial performance
• pricing mechanism
• contract duration Observational

Common Approaches
• Heckman-type modeling
• Propensity Score Approach (Rubin & Rosenbaum)
Two steps:
1. Selection model: T = f(X)
2. Performance analysis on matched samples
Y = performance measure(s)
T = intervention
X = pre-intervention variables

Propensity Scores Approach
Step 1: Estimate selection model logit(T) = f(X)
to compute propensity scores P(T|X)
Step 3: Estimate Effect on Y (compare groups)
e.g., t-test or Y = b0 + b1 T+ b2 X+ b3 PS +e
Y = performance measure(s)
T = intervention
X = pre-intervention variables
Self-selection: P(T|X) ≠P(T)
Step 2: Use scores to create matched samples
PSM = use matching algorithm
PSS = divide scores into bins

The Idea of PSM: Balancing
“The propensity score allows one to design and
analyze an observational (nonrandomized) study so
that it mimics some of the particular characteristics of
a randomized controlled trial. In particular, the
propensity score is a balancing score: conditional on
the propensity score, the distribution of observed
baseline covariates will be similar between treated and
untreated subjects.”

Study 1: Impact of training on financial gains
(LaLonde 1986)
Experiment: US govt program randomly assigns
eligible candidates to training program
• Goal: increase future earnings
• LaLonde (1986) shows:
Groups statistically equal in terms of demographic
& pre-training earnings
 ATE = $1794 (p<0.004)

Training effect:
Observational control group
LaLonde also compared with observational
control groups (PSID, CPS)
– experimental training group + obs control group
– shows training effect not estimated correctly with
structural equations
PSID = Panel Study of Income Dynamics
CPS = Westat’s Matched Current Population Survey (Social Security Administration)

This paper compares the effect on trainee earnings
of an employment program that was run as a field
experiment where participants were randomly
assigned to treatment and control groups with the
estimates that would have been produced by an
econometrician. This comparison shows that many
of the econometric procedures do not replicate the
experimentally determined results, and it suggests
that researchers should be aware of the potential
for specification errors in other nonexperimental
evaluations.

Yahav et al./Tree-Based Approach for Addressing Self-Selection
Table 4. Summary Statistics of Datasets Used by Dehejia and Wahba (1999) (Average values and standard
deviations computed directly from the datasets in http://sekhon.berkeley.edu/matching/lalonde.html.)
Characteristics (Variable Name)
Experimental NSW Data Nonexperimental CPS Data
Treatment Control Control
Age (age)
25.82 25.05 33.22
(7.16) (7.06) (11.05)
Years of schooling (educ)
10.35 10.09 12.03
(2.01) (1.61) (2.87)
Proportion of blacks (black)
0.84 0.83 0.07
(0.36) (0.38) (0.26)
Proportion of Hispanic (hisp)
0.06 0.11 0.07
(0.24) (0.31) (0.26)
Proportion of married (married)
0.19 0.15 0.71
(0.39) (0.36) (0.45)
Proportion of high school dropouts (nodegr)
0.71 0.83 0.3
(0.46) (0.37) (0.46)
Real earning 24 month prior to training (1974)
(re74)
2,096 2,107 14,024
(4,887) (5,688) (9578.99)
Real earning 12 month prior to training (1975)
(re75)
1532 1,267 13,642
(3,219) (3,103) (9,260)
Proportion of nonworkers in 1974 (u74)
0.71 0.75 0.88
(0.46) (0.43) (0.32)
Proportion of nonworkers in 1975 (u75)
0.6 0.68 0.89
(0.49) (0.47) (0.31)
Outcome (Real earning in 1978) (re78)
6,349 4,555 14,855
(7,867) (5,484) (9,658)
Sample size 185 260 15,991
Table 5. Training Effect in NSW Experiment: Comparison between Approaches (Based on DW99 sample.
Tree-based results are split by presence/absence of a high school degree. Overall tree-approach training effect is
computed by a weighted average of HS degree and HS dropout (computed for comparison only; due to the

PSM:
Observational control group
Dehejia & Wahba (1999,2002) re-analyzed CPS
control group (n=15,991), using PSM
– Effects in range $1122-$1681, depends on settings
– “Best” setting effect: $1360
– PSM uses only 119 control group members

A Tree-Based Approach for Addressing Self-selection in Impact Studies with Big Data

How did Dehejia & Wahba use PSM?
D&W obtained training effects in the range $1,122 to $1,681 under
different PSM settings and several matching schemes:
• Subset selection with/without replacement, combined with low-to-
high/high-to-low/random/nearest-neighbor (NN)/caliper matching.
• DW02 show that selection with replacement followed by NN
matching best captures the effect of the training program. However,
other matching schemes often yield poor performance, such as a
negative training effect.
• The overall training effect under their best settings (they can
compare to the actual experimental results!) is reported as $1,360.
• Note: This setting leads to dropping most of the 15,991 control
records, with only 119 records remaining in the control group.
Bottom line:
The researcher must make choices about settings and
parameters/bins; different choices can lead to different results.

PART C:
OUR NEW TREE APPROACH
(AN ALTERNATIVE TO PSM)

Vol. 40 No. 4, pp. 819-848 / December
2016

Challenges of PS in Big Data
1. Matching leads to severe data loss
2. PS methods suffer from “data dredging”
3. No variable selection (cannot identify variables that
drive the selection)
4. Assumes constant intervention effect
5. Sequential nature is computationally costly
6. Logistic model requires researcher to specify exact
form of selection model

Proposed Solution:
Tree-based approach
Propensity scores
P(T|X)
Y, T, X E(Y|T)
Even E(Y|T,X)
“Kill the Intermediary”

Classification Tree
Output: T (treat/control)
Inputs: X’s (income, edu, family…)
Records in each terminal node share same
profile (X) and same propensity score P(T=1| X)

Tree Creation
Which algorithm?
Conditional-Inference trees (Hothorn et al., 2006)
– Stop tree growth using statistical tests of
independence
– Binary splits

Tree-Based Approach
Four steps:
1. Run selection model: fit tree T = f(X)
2. Present resulting tree; discover unbalanced X’s
3. Treat each terminal node as sub-sample for
measuring Y; conduct terminal-node-level
performance analysis
4. Present terminal-node-analyses visually
5. [optional]: combine analyses from nodes with
homogeneous effects
Like PS, assumes observable self-selection

Tree on Lalonde’s RCT data
If groups are completely
balanced, we expect…
Y = Earnings in 1978
T = Received NSW training (T = 1) or not (T =
0)
X = Demographic information and prior
earnings

Tree reveals…
LaLonde’s naïve
approach (experiment)
Tree approach
HS dropout
(n=348)
HS degree
(n=97)
Not trained (n=260) $4554 $4,495 $4,855
Trained (n=185) $6349 $5,649 $8,047
Training effect
$1794
(p=0.004)
$1,154
(p=0.063)
$3,192
(p=0.015)
Overall: $1598
(p=0.017)
no yes
High school
degree
1. Unbalanced variable (HS degree)
2. Heterogeneous effect

Tree for obs control group reveals…
unemployed prior to training
in 1974 (u74=0 )
-> negative effect
1. Unbalanced variables
2. Heterogeneous effect in u74
3. Outlier
4. Eligibility issue
outlier
eligibility
issue!
some profiles are rare in
trained group but
popular in control group

Solves challenges of PS in Big Data
1. Matching leads to severe data loss
2. PS methods suffer from “data dredging”
3. No variable selection (cannot identify variables that
drive the selection)
4. Assumes constant intervention effect
5. Sequential nature is computationally costly
6. Logistic model requires researcher to specify exact
form of selection model

Why Trees in Explanatory Study?
Flexible non-parametric
selection model (f)
Automated detection of
unbalanced pre-intervention
variables (X)
Easy to interpret,
transparent, visual
Applicable to binary, polytomous,
continuous intervention (T)
Useful in Big Data
context
Identify heterogeneous
effects (effect of T on Y)

Survey commissioned by Govt of India in 2006
• >9500 individuals who used passport services
• Representative sample of 13 Passport Offices
• “Quasi-experimental, non-equivalent groups design”
• Equal number of offline and online users, matched
by geography and demographics
Study 2:
Impact of eGov Initiative
(India)

Current Practice
Assess impact by
comparing
online/offline
performance stats

Awareness of electronic services
provided by Government of India
% bribe RPO
% use agent
%prefer online
% bribe police
Simpson’s Paradox
1. Demographics properly balanced
2. Unbalanced variable (Aware)
3. Heterogeneous effects on various y’s
+ even Simpson’s paradox

PSMAwareness of electronic services
provided by Government of India
Would we detect this
with PSM?

Scaling Up to Big Data
• We inflated eGov dataset by bootstrap
• Up to 9,000,000 records and 360 variables
• 10 runs for each configuration: runtime for tree
20 sec

Big Data Simulation
Binary intervention
T = {0, 1}
Continuous intervention
T∼ N
Sample sizes (n) 10K, 100K, 1M
#Pre-intervention
variables (p)
4, 50 (+interactions)
Pre-intervention
variable types
Binary, Likert-scale, continuous
Outcome
variable types
Binary, continuous
Selection models
#1: P (T=1) = logit (b0 + b1 x1 +…+ bp xp)
#2: P (T=1) = logit (b0 + b1 x1 +…+ bp xp + interactions)
Intervention
effects
1. Homogeneous
Control: E(Y | T = 0) = 0.5
Intervention: E(Y | T = 1) = 0.7
2. Heterogeneous
Control: E(Y | T = 0) = 0.5
Intervention: E(Y | T = 1, X1=0) = 0.7
E(Y | T = 1, X1=1) = 0.3
1. Homogeneous
Control: E(Y | T = 0) = 0
Intervention: E(Y | T = 1) = 1
2. Heterogeneous
Control: E(Y | T = 0) = 0
Intervention: E(Y | T = 1, X1=0) = 1
E(Y | T = 1, X1=1) = -1

Results for selection model
P (T=1 | X ) = logit (b0 + b1 X1 +…+ bp Xp)
PSS (5 bins)

Big Data Scalability
Theoretical Complexity:
• O(mn/p) for binary X
• O(m/p nlog(n) ) for continuous X
Runtime as function of sample size, dimension

Scaling Trees Even Further
• “Big Data” in research vs. industry
• Industrial scaling
– Sequential trees: efficient data structure, access
(SPRINT, SLIQ, RainForest)
– Parallel computing (parallel SPRINT, ScalParC,
SPARK, PLANET) “as long as split metric can be
computed on subsets of the training data and
later aggregated, PLANET can be easily extended”

Continuous Intervention
16 nodes

Tree Approach Benefits
1. Data-driven selection model
2. Scales up to Big Data
3. Less user choices (data dredging)
4. Nuanced insights
• Detect unbalanced variables
• Detect heterogeneous effect from anticipated outcomes
5. Simple to communicate
6. Automatic variable selection
7. Missing values do not remove record
8. Binary, multiple, continuous interventions
9. Post-analysis of RCT, quasi-experiments & observational studies

Tree Approach Limits
1. Assumes selection on observables
2. Need sufficient data
3. Continuous variables can lead to large tree
4. Instability
[possible solution: use variable importance scores (forest)]

Insights from tree-approach in
the three applications
Labor (Lalonde ‘86)
Heterogeneous effect:
Impact of training depends
on High school diploma
Contract Duration
First attempt to study
effect of duration on
contract performance
Price Mechanism
Fixed-price creates long-
term market value (not
productivity), but only in
high-trust contracts
eGov
Impact of online system
depends on user awareness

Yahav, I., Shmueli, G. and Mani, D. (2016) “A Tree-Based Approach for Addressing Self-Selection
in Impact Studies with Big Data”, MIS Quarterly, vol 40 no 4, pp. 819-848.

Impact of IT Outsourcing Contract Attributes
How does financial performance of
outsourcing contracts vary with two
attributes of the contract:
• Pricing mechanisms (6 options)
• Contract duration (continuous)
Observational Data
• >1400 contracts, implemented 1996-2008
• 374 vendors and 710 clients
• Obtained from IDC database, Lexis-Nexis,
COMPUSTAT, etc.

T = Six Pricing Mechanisms
(polytomous intervention)
Interventions (T):
1. Fixed Price
2. Transactional Price
3. Time-and-Materials
4. Incentive
5. Combination
6. Joint Venture
Fixed Price
Variable Price
Pre-Intervention Variables (X):
Task Type
Bid Type
Contract Value
Uncertainty in business requirements
Outsourcing Experience
Firm Size (market value of equity)
Outcomes (Y):
Announcement Returns
Long Term Returns
Median Income Efficiency

Six Pricing Mechanisms
(polytomous intervention)
Interventions (T):
1. Fixed Price
2. Transactional Price
3. Time-and-Materials
4. Incentive
5. Combination
6. Joint Venture
Fixed Price - Fixed payment per billing cycle
Transactional - Fixed payment per transaction
per billing cycle
Time and Materials - Payment based on input
time and materials used during billing cycle
Incentive - Payment based on output
improvements against key performance
indicators or any combination of indicators
Combination - A combination of any of the
above contract types, largely fixed price and
time and materials
Joint Venture - A separately incorporated
entity, jointly owned by the client and the
vendor, used to govern the outsourcing
relationship.
Fixed Price
Variable Price

Six Price Mechanisms
Questions of interest:
1)Do all simple outsourcing engagements, governed by fixed or transactional price
contracts, create value?
2)What types of complex outsourcing engagements create value for the client?
3)How do firms mitigate risks inherent to these engagements?

Impact Measures (Y)
Firm-specific daily abnormal returns ( 𝜀𝑖𝑡, for firm i on day t)
• Computed as 𝜀𝑖𝑡= 𝑟𝑖𝑡- 𝑟𝑖𝑡 , where 𝑟𝑖𝑡 = daily return (to the value weighted S&P500),
estimated from the market model: 𝑟𝑖𝑡 = α𝑖+𝛽𝑖 𝑟 𝑚𝑡+ 𝜀𝑖𝑡.
• Model used to predict daily returns for each firm over announcement period [-5,+5].
Long Term Returns
Monthly abnormal returns
• Estimated from the Fama- French three-factor model as excess of that achieved by
passive investments in systematic risk factors.
• Expected to be zero under the null hypothesis of market efficiency.
• Used to estimate the implied three-year abnormal return following the contract.
Income efficiency is estimated as earnings before interest and taxes divided by
number of employees.
• Median of income efficiency for the three-year period following contract
implementation

Six pricing methodologies
Selection Model
Large custom IT tasks
(complex)
BPO+simple
tasks
high-trust

Node-level Performance Analysis
Combination contracts
create value for
complex engagements
Custom IT
(complex,
high trust)

Contract Duration
(Continuous Intervention)
T = Contract duration (months)
Pre-Intervention Variables (X):
Task Type
Bid Type
Contract Value
Uncertainty in business requirements
Outsourcing Experience
Firm Size (market value of equity)
Outcomes (Y)
Long Term Returns
Contract duration has no impact on
performance gains from outsourcing

Contract Duration
Selection Model (regression tree)
Duration
proportional to
contract value
(1,6,8 vs. 4,5)

Node-Level Performance Analysis (Reg)
Markets reward
long-term for high-
value contracts and
low-value minimal
scope contracts
contracts requiring
specific or non-
contractible
investments (costs
outweigh benefits)

Price methodologies: Main insight
Prior research:
• Role of trust only in complex contracts
• Fixed price known to create value; unrelated to trust
Tree finding:
Fixed-price creates
long-term market value
(not productivity), but
only in high-trust
contracts!

A Tree-Based Approach for Addressing Self-selection in Impact Studies with Big Data

More Related Content

What's hot (19)

Similar to A Tree-Based Approach for Addressing Self-selection in Impact Studies with Big Data (20)

More from Galit Shmueli (20)

Recently uploaded (20)

A Tree-Based Approach for Addressing Self-selection in Impact Studies with Big Data

Editor's Notes