SlideShare a Scribd company logo
A Tree-Based Approach
for Addressing Self-selection in Impact Studies
with Big Data
Inbal Yahav Galit Shmueli Deepa Mani
Bar Ilan University Indian School of Business
Israel India
@ HKUST
Business School
Dept of ISBSOM
May 16, 2017
PART A (BACKGROUND):
EXPERIMENTS (& FRIENDS),
RANDOMIZATION, AND CAUSAL INFERENCE
PART C:
OUR NEW TREE APPROACH
(AN ALTERNATIVE TO PSM)
PART B:
DEALING WITH SELF SELECTION
(FOR CAUSAL INFERENCE)
PART A (BACKGROUND):
EXPERIMENTS (& FRIENDS),
RANDOMIZATION, AND CAUSAL INFERENCE
Experimental Studies
• Goal: Causal inference
• Effects of causes (causal description) vs.
causes of effects (causal explanation)
• Manipulable cause
Randomization vs self-selection
RCT: Random Assignment
Manipulation
Quasi-Experiment
(Self-selection or administrator selection)
Manipulation
Self
Selection
Alternative explanations
Random assignment Balanced groups
Confound
(third variable)
Counterfactual
(always?)
Experiments & Variations
• Randomized experiment (RCT), natural experiment,
quasi-experiment
• Lab vs. field experiments
Validity
External validity: generalization
Internal validity: alternative explanations,
heterogeneous treatment effect
PART B:
DEALING WITH SELF SELECTION
(FOR CAUSAL INFERENCE)
Self selection: the challenge
• Large impact studies of an intervention
• Individuals/firms self-select intervention group/duration
• Even in RCT, some variables might remain unbalanced
How to identify and adjust for self-selection?
Three Applications
Impact of training on earnings
Field experiment by US govt
• LaLonde (1986) compared to observational control
• Re-analysis by PSM (Dehejia & Wahba, 1999, 2002)
RCT
Impact of e-Gov service in India
New online passport service
• survey of online + offline users
• bribes, travel time, etc.
Quasi-
experiment
Impact of outsourcing contract features
on financial performance
• pricing mechanism
• contract duration Observational
Common Approaches
• Heckman-type modeling
• Propensity Score Approach (Rubin & Rosenbaum)
Two steps:
1. Selection model: T = f(X)
2. Performance analysis on matched samples
Y = performance measure(s)
T = intervention
X = pre-intervention variables
Propensity Scores Approach
Step 1: Estimate selection model logit(T) = f(X)
to compute propensity scores P(T|X)
Step 3: Estimate Effect on Y (compare groups)
e.g., t-test or Y = b0 + b1 T+ b2 X+ b3 PS +e
Y = performance measure(s)
T = intervention
X = pre-intervention variables
Self-selection: P(T|X) ≠P(T)
Step 2: Use scores to create matched samples
PSM = use matching algorithm
PSS = divide scores into bins
The Idea of PSM: Balancing
“The propensity score allows one to design and
analyze an observational (nonrandomized) study so
that it mimics some of the particular characteristics of
a randomized controlled trial. In particular, the
propensity score is a balancing score: conditional on
the propensity score, the distribution of observed
baseline covariates will be similar between treated and
untreated subjects.”
Study 1: Impact of training on financial gains
(LaLonde 1986)
Experiment: US govt program randomly assigns
eligible candidates to training program
• Goal: increase future earnings
• LaLonde (1986) shows:
Groups statistically equal in terms of demographic
& pre-training earnings
 ATE = $1794 (p<0.004)
Training effect:
Observational control group
LaLonde also compared with observational
control groups (PSID, CPS)
– experimental training group + obs control group
– shows training effect not estimated correctly with
structural equations
PSID = Panel Study of Income Dynamics
CPS = Westat’s Matched Current Population Survey (Social Security Administration)
This paper compares the effect on trainee earnings
of an employment program that was run as a field
experiment where participants were randomly
assigned to treatment and control groups with the
estimates that would have been produced by an
econometrician. This comparison shows that many
of the econometric procedures do not replicate the
experimentally determined results, and it suggests
that researchers should be aware of the potential
for specification errors in other nonexperimental
evaluations.
Yahav et al./Tree-Based Approach for Addressing Self-Selection
Table 4. Summary Statistics of Datasets Used by Dehejia and Wahba (1999) (Average values and standard
deviations computed directly from the datasets in http://sekhon.berkeley.edu/matching/lalonde.html.)
Characteristics (Variable Name)
Experimental NSW Data Nonexperimental CPS Data
Treatment Control Control
Age (age)
25.82 25.05 33.22
(7.16) (7.06) (11.05)
Years of schooling (educ)
10.35 10.09 12.03
(2.01) (1.61) (2.87)
Proportion of blacks (black)
0.84 0.83 0.07
(0.36) (0.38) (0.26)
Proportion of Hispanic (hisp)
0.06 0.11 0.07
(0.24) (0.31) (0.26)
Proportion of married (married)
0.19 0.15 0.71
(0.39) (0.36) (0.45)
Proportion of high school dropouts (nodegr)
0.71 0.83 0.3
(0.46) (0.37) (0.46)
Real earning 24 month prior to training (1974)
(re74)
2,096 2,107 14,024
(4,887) (5,688) (9578.99)
Real earning 12 month prior to training (1975)
(re75)
1532 1,267 13,642
(3,219) (3,103) (9,260)
Proportion of nonworkers in 1974 (u74)
0.71 0.75 0.88
(0.46) (0.43) (0.32)
Proportion of nonworkers in 1975 (u75)
0.6 0.68 0.89
(0.49) (0.47) (0.31)
Outcome (Real earning in 1978) (re78)
6,349 4,555 14,855
(7,867) (5,484) (9,658)
Sample size 185 260 15,991
Table 5. Training Effect in NSW Experiment: Comparison between Approaches (Based on DW99 sample.
Tree-based results are split by presence/absence of a high school degree. Overall tree-approach training effect is
computed by a weighted average of HS degree and HS dropout (computed for comparison only; due to the
PSM:
Observational control group
Dehejia & Wahba (1999,2002) re-analyzed CPS
control group (n=15,991), using PSM
– Effects in range $1122-$1681, depends on settings
– “Best” setting effect: $1360
– PSM uses only 119 control group members
A Tree-Based Approach  for Addressing Self-selection in Impact Studies with Big Data
A Tree-Based Approach  for Addressing Self-selection in Impact Studies with Big Data
How did Dehejia & Wahba use PSM?
D&W obtained training effects in the range $1,122 to $1,681 under
different PSM settings and several matching schemes:
• Subset selection with/without replacement, combined with low-to-
high/high-to-low/random/nearest-neighbor (NN)/caliper matching.
• DW02 show that selection with replacement followed by NN
matching best captures the effect of the training program. However,
other matching schemes often yield poor performance, such as a
negative training effect.
• The overall training effect under their best settings (they can
compare to the actual experimental results!) is reported as $1,360.
• Note: This setting leads to dropping most of the 15,991 control
records, with only 119 records remaining in the control group.
Bottom line:
The researcher must make choices about settings and
parameters/bins; different choices can lead to different results.
PART C:
OUR NEW TREE APPROACH
(AN ALTERNATIVE TO PSM)
Vol. 40 No. 4, pp. 819-848 / December
2016
Challenges of PS in Big Data
1. Matching leads to severe data loss
2. PS methods suffer from “data dredging”
3. No variable selection (cannot identify variables that
drive the selection)
4. Assumes constant intervention effect
5. Sequential nature is computationally costly
6. Logistic model requires researcher to specify exact
form of selection model
Proposed Solution:
Tree-based approach
Propensity scores
P(T|X)
Y, T, X E(Y|T)
Even E(Y|T,X)
“Kill the Intermediary”
Classification Tree
Output: T (treat/control)
Inputs: X’s (income, edu, family…)
Records in each terminal node share same
profile (X) and same propensity score P(T=1| X)
Tree Creation
Which algorithm?
Conditional-Inference trees (Hothorn et al., 2006)
– Stop tree growth using statistical tests of
independence
– Binary splits
Tree-Based Approach
Four steps:
1. Run selection model: fit tree T = f(X)
2. Present resulting tree; discover unbalanced X’s
3. Treat each terminal node as sub-sample for
measuring Y; conduct terminal-node-level
performance analysis
4. Present terminal-node-analyses visually
5. [optional]: combine analyses from nodes with
homogeneous effects
Like PS, assumes observable self-selection
Tree on Lalonde’s RCT data
If groups are completely
balanced, we expect…
Y = Earnings in 1978
T = Received NSW training (T = 1) or not (T =
0)
X = Demographic information and prior
earnings
Tree reveals…
LaLonde’s naïve
approach (experiment)
Tree approach
HS dropout
(n=348)
HS degree
(n=97)
Not trained (n=260) $4554 $4,495 $4,855
Trained (n=185) $6349 $5,649 $8,047
Training effect
$1794
(p=0.004)
$1,154
(p=0.063)
$3,192
(p=0.015)
Overall: $1598
(p=0.017)
no yes
High school
degree
1. Unbalanced variable (HS degree)
2. Heterogeneous effect
Tree for obs control group reveals…
unemployed prior to training
in 1974 (u74=0 )
-> negative effect
1. Unbalanced variables
2. Heterogeneous effect in u74
3. Outlier
4. Eligibility issue
outlier
eligibility
issue!
some profiles are rare in
trained group but
popular in control group
Solves challenges of PS in Big Data
1. Matching leads to severe data loss
2. PS methods suffer from “data dredging”
3. No variable selection (cannot identify variables that
drive the selection)
4. Assumes constant intervention effect
5. Sequential nature is computationally costly
6. Logistic model requires researcher to specify exact
form of selection model
Why Trees in Explanatory Study?
Flexible non-parametric
selection model (f)
Automated detection of
unbalanced pre-intervention
variables (X)
Easy to interpret,
transparent, visual
Applicable to binary, polytomous,
continuous intervention (T)
Useful in Big Data
context
Identify heterogeneous
effects (effect of T on Y)
Survey commissioned by Govt of India in 2006
• >9500 individuals who used passport services
• Representative sample of 13 Passport Offices
• “Quasi-experimental, non-equivalent groups design”
• Equal number of offline and online users, matched
by geography and demographics
Study 2:
Impact of eGov Initiative
(India)
Current Practice
Assess impact by
comparing
online/offline
performance stats
Awareness of electronic services
provided by Government of India
% bribe RPO
% use agent
%prefer online
% bribe police
Simpson’s Paradox
1. Demographics properly balanced
2. Unbalanced variable (Aware)
3. Heterogeneous effects on various y’s
+ even Simpson’s paradox
PSMAwareness of electronic services
provided by Government of India
Would we detect this
with PSM?
Heterogeneous effect
Scaling Up to Big Data
• We inflated eGov dataset by bootstrap
• Up to 9,000,000 records and 360 variables
• 10 runs for each configuration: runtime for tree
20 sec
Big Data Simulation
Binary intervention
T = {0, 1}
Continuous intervention
T∼ N
Sample sizes (n) 10K, 100K, 1M
#Pre-intervention
variables (p)
4, 50 (+interactions)
Pre-intervention
variable types
Binary, Likert-scale, continuous
Outcome
variable types
Binary, continuous
Selection models
#1: P (T=1) = logit (b0 + b1 x1 +…+ bp xp)
#2: P (T=1) = logit (b0 + b1 x1 +…+ bp xp + interactions)
Intervention
effects
1. Homogeneous
Control: E(Y | T = 0) = 0.5
Intervention: E(Y | T = 1) = 0.7
2. Heterogeneous
Control: E(Y | T = 0) = 0.5
Intervention: E(Y | T = 1, X1=0) = 0.7
E(Y | T = 1, X1=1) = 0.3
1. Homogeneous
Control: E(Y | T = 0) = 0
Intervention: E(Y | T = 1) = 1
2. Heterogeneous
Control: E(Y | T = 0) = 0
Intervention: E(Y | T = 1, X1=0) = 1
E(Y | T = 1, X1=1) = -1
Results for selection model
P (T=1 | X ) = logit (b0 + b1 X1 +…+ bp Xp)
PSS (5 bins)
Big Data Scalability
Theoretical Complexity:
• O(mn/p) for binary X
• O(m/p nlog(n) ) for continuous X
Runtime as function of sample size, dimension
Scaling Trees Even Further
• “Big Data” in research vs. industry
• Industrial scaling
– Sequential trees: efficient data structure, access
(SPRINT, SLIQ, RainForest)
– Parallel computing (parallel SPRINT, ScalParC,
SPARK, PLANET) “as long as split metric can be
computed on subsets of the training data and
later aggregated, PLANET can be easily extended”
Heterogeneous Effect
Continuous Intervention
16 nodes
Tree Approach Benefits
1. Data-driven selection model
2. Scales up to Big Data
3. Less user choices (data dredging)
4. Nuanced insights
• Detect unbalanced variables
• Detect heterogeneous effect from anticipated outcomes
5. Simple to communicate
6. Automatic variable selection
7. Missing values do not remove record
8. Binary, multiple, continuous interventions
9. Post-analysis of RCT, quasi-experiments & observational studies
Tree Approach Limits
1. Assumes selection on observables
2. Need sufficient data
3. Continuous variables can lead to large tree
4. Instability
[possible solution: use variable importance scores (forest)]
Insights from tree-approach in
the three applications
Labor (Lalonde ‘86)
Heterogeneous effect:
Impact of training depends
on High school diploma
Contract Duration
First attempt to study
effect of duration on
contract performance
Price Mechanism
Heterogeneous effect:
Fixed-price creates long-
term market value (not
productivity), but only in
high-trust contracts
eGov
Heterogeneous effect:
Impact of online system
depends on user awareness
Yahav, I., Shmueli, G. and Mani, D. (2016) “A Tree-Based Approach for Addressing Self-Selection
in Impact Studies with Big Data”, MIS Quarterly, vol 40 no 4, pp. 819-848.
Impact of IT Outsourcing Contract Attributes
How does financial performance of
outsourcing contracts vary with two
attributes of the contract:
• Pricing mechanisms (6 options)
• Contract duration (continuous)
Observational Data
• >1400 contracts, implemented 1996-2008
• 374 vendors and 710 clients
• Obtained from IDC database, Lexis-Nexis,
COMPUSTAT, etc.
T = Six Pricing Mechanisms
(polytomous intervention)
Interventions (T):
1. Fixed Price
2. Transactional Price
3. Time-and-Materials
4. Incentive
5. Combination
6. Joint Venture
Fixed Price
Variable Price
Pre-Intervention Variables (X):
Task Type
Bid Type
Contract Value
Uncertainty in business requirements
Outsourcing Experience
Firm Size (market value of equity)
Outcomes (Y):
Announcement Returns
Long Term Returns
Median Income Efficiency
Six Pricing Mechanisms
(polytomous intervention)
Interventions (T):
1. Fixed Price
2. Transactional Price
3. Time-and-Materials
4. Incentive
5. Combination
6. Joint Venture
Fixed Price - Fixed payment per billing cycle
Transactional - Fixed payment per transaction
per billing cycle
Time and Materials - Payment based on input
time and materials used during billing cycle
Incentive - Payment based on output
improvements against key performance
indicators or any combination of indicators
Combination - A combination of any of the
above contract types, largely fixed price and
time and materials
Joint Venture - A separately incorporated
entity, jointly owned by the client and the
vendor, used to govern the outsourcing
relationship.
Fixed Price
Variable Price
Six Price Mechanisms
Questions of interest:
1)Do all simple outsourcing engagements, governed by fixed or transactional price
contracts, create value?
2)What types of complex outsourcing engagements create value for the client?
3)How do firms mitigate risks inherent to these engagements?
Impact Measures (Y)
Announcement Returns
Firm-specific daily abnormal returns ( 𝜀𝑖𝑡, for firm i on day t)
• Computed as 𝜀𝑖𝑡= 𝑟𝑖𝑡- 𝑟𝑖𝑡 , where 𝑟𝑖𝑡 = daily return (to the value weighted S&P500),
estimated from the market model: 𝑟𝑖𝑡 = α𝑖+𝛽𝑖 𝑟 𝑚𝑡+ 𝜀𝑖𝑡.
• Model used to predict daily returns for each firm over announcement period [-5,+5].
Long Term Returns
Monthly abnormal returns
• Estimated from the Fama- French three-factor model as excess of that achieved by
passive investments in systematic risk factors.
• Expected to be zero under the null hypothesis of market efficiency.
• Used to estimate the implied three-year abnormal return following the contract.
Median Income Efficiency
Income efficiency is estimated as earnings before interest and taxes divided by
number of employees.
• Median of income efficiency for the three-year period following contract
implementation
Six pricing methodologies
Selection Model
Large custom IT tasks
(complex)
BPO+simple
tasks
high-trust
Node-level Performance Analysis
Combination contracts
create value for
complex engagements
Custom IT
(complex,
high trust)
Contract Duration
(Continuous Intervention)
T = Contract duration (months)
Pre-Intervention Variables (X):
Task Type
Bid Type
Contract Value
Uncertainty in business requirements
Outsourcing Experience
Firm Size (market value of equity)
Outcomes (Y)
Announcement Returns
Long Term Returns
Median Income Efficiency
Contract duration has no impact on
performance gains from outsourcing
Contract Duration
Selection Model (regression tree)
Duration
proportional to
contract value
(1,6,8 vs. 4,5)
Node-Level Performance Analysis (Reg)
Markets reward
long-term for high-
value contracts and
low-value minimal
scope contracts
contracts requiring
specific or non-
contractible
investments (costs
outweigh benefits)
Price methodologies: Main insight
Prior research:
• Role of trust only in complex contracts
• Fixed price known to create value; unrelated to trust
Tree finding:
Fixed-price creates
long-term market value
(not productivity), but
only in high-trust
contracts!

More Related Content

PPTX
The t Test for Two Related Samples
PPTX
Repeated Measures ANOVA
PPTX
One-Sample Hypothesis Tests
PDF
Causal Inference and Program Evaluation
PDF
Instrumental Variables and Control Functions
PDF
Day 12 t test for dependent samples and single samples pdf
PPT
One Way Anova
PDF
t Test- Thiyagu
The t Test for Two Related Samples
Repeated Measures ANOVA
One-Sample Hypothesis Tests
Causal Inference and Program Evaluation
Instrumental Variables and Control Functions
Day 12 t test for dependent samples and single samples pdf
One Way Anova
t Test- Thiyagu

What's hot (19)

PDF
Thiyagu viva voce prsesentation
PPT
T test statistics
PPT
Anova by Hazilah Mohd Amin
PPT
T12 non-parametric tests
PPTX
Analyzing experimental research data
PPTX
Parametric tests seminar
PPT
Day 11 t test for independent samples
PDF
Analysis of Data - Dr. K. Thiyagu
PDF
Applied statistics lecture_2
PDF
PPTX
Stats chapter 5
PDF
Reporting statistics in psychology
PPTX
Parametric test
PPTX
PDF
PMED: APPM Workshop: Overview of Methods for Subgroup Identification in Clini...
ODP
Exploratory
PDF
Potential Solutions to the Fundamental Problem of Causal Inference: An Overview
PDF
Student's T-test, Paired T-Test, ANOVA & Proportionate Test
Thiyagu viva voce prsesentation
T test statistics
Anova by Hazilah Mohd Amin
T12 non-parametric tests
Analyzing experimental research data
Parametric tests seminar
Day 11 t test for independent samples
Analysis of Data - Dr. K. Thiyagu
Applied statistics lecture_2
Stats chapter 5
Reporting statistics in psychology
Parametric test
PMED: APPM Workshop: Overview of Methods for Subgroup Identification in Clini...
Exploratory
Potential Solutions to the Fundamental Problem of Causal Inference: An Overview
Student's T-test, Paired T-Test, ANOVA & Proportionate Test
Ad

Similar to A Tree-Based Approach for Addressing Self-selection in Impact Studies with Big Data (20)

PDF
A Tree-Based Approach for Addressing Self-Selection in Impact Studies with Bi...
PPTX
Repurposing Classification & Regression Trees for Causal Research with High-D...
PPTX
Repurposing predictive tools for causal research
PDF
Nbe rcausalpredictionv111 lecture2
PPTX
A causal machine learning evaluation of training in Belgium.pptx
PDF
Machine Learning for Aerospace Training
PDF
What's the Science in Data Science? - Skipper Seabold
PDF
Weapons of Math Instruction: Evolving from Data0-Driven to Science-Driven
PPTX
Summary of Propensity Score Matching in Education_updated 03_02-2015.pptx
PPT
Heuristic Search
PDF
Applied machine learning: Insurance
PPTX
Data Mining Lecture_10(a).pptx
PPTX
Data Mining Project for student academic specialization and performance
PPTX
E.SUN Academic Award presentation (Jan 2016)
PPT
ppt
PPT
Hiring Session 6 (2006)
PPTX
Data Science and Analytics, Computer Science
PPT
Machine Learning
PDF
Lg ph d_slides_vfinal
A Tree-Based Approach for Addressing Self-Selection in Impact Studies with Bi...
Repurposing Classification & Regression Trees for Causal Research with High-D...
Repurposing predictive tools for causal research
Nbe rcausalpredictionv111 lecture2
A causal machine learning evaluation of training in Belgium.pptx
Machine Learning for Aerospace Training
What's the Science in Data Science? - Skipper Seabold
Weapons of Math Instruction: Evolving from Data0-Driven to Science-Driven
Summary of Propensity Score Matching in Education_updated 03_02-2015.pptx
Heuristic Search
Applied machine learning: Insurance
Data Mining Lecture_10(a).pptx
Data Mining Project for student academic specialization and performance
E.SUN Academic Award presentation (Jan 2016)
ppt
Hiring Session 6 (2006)
Data Science and Analytics, Computer Science
Machine Learning
Lg ph d_slides_vfinal
Ad

More from Galit Shmueli (20)

PDF
“Improving” prediction of human behavior using behavior modification
PPTX
To Explain, To Predict, or To Describe?
PPTX
Behavioral Big Data & Healthcare Research
PDF
Reinventing the Data Analytics Classroom
PPTX
Behavioral Big Data & Healthcare Research: Talk at WiDS Taipei
PPTX
Statistical Modeling in 3D: Describing, Explaining and Predicting
PPTX
Workshop on Information Quality
PPTX
Behavioral Big Data: Why Quality Engineers Should Care
PPTX
Statistical Modeling in 3D: Explaining, Predicting, Describing
PPTX
Researcher Dilemmas using Behavioral Big Data in Healthcare (INFORMS DMDA Wo...
PPTX
Prediction-based Model Selection in PLS-PM
PDF
When Prediction Met PLS: What We learned in 3 Years of Marriage
PPTX
Research Using Behavioral Big Data: A Tour and Why Mechanical Engineers Shoul...
PDF
Research Using Behavioral Big Data (BBD)
PDF
Analyzing Behavioral Big Data: Methodological, Practical, Ethical & Moral Issues
PPTX
Big Data - To Explain or To Predict? Talk at U Toronto's Rotman School of Ma...
PDF
Information Quality: A Framework for Evaluating Empirical Studies
PPTX
Big Data & Analytics in the Digital Creative Industries
PPTX
On Information Quality: Can Your Data Do The Job? (SCECR 2015 Keynote)
PPTX
Predictive Model Selection in PLS-PM (SCECR 2015)
“Improving” prediction of human behavior using behavior modification
To Explain, To Predict, or To Describe?
Behavioral Big Data & Healthcare Research
Reinventing the Data Analytics Classroom
Behavioral Big Data & Healthcare Research: Talk at WiDS Taipei
Statistical Modeling in 3D: Describing, Explaining and Predicting
Workshop on Information Quality
Behavioral Big Data: Why Quality Engineers Should Care
Statistical Modeling in 3D: Explaining, Predicting, Describing
Researcher Dilemmas using Behavioral Big Data in Healthcare (INFORMS DMDA Wo...
Prediction-based Model Selection in PLS-PM
When Prediction Met PLS: What We learned in 3 Years of Marriage
Research Using Behavioral Big Data: A Tour and Why Mechanical Engineers Shoul...
Research Using Behavioral Big Data (BBD)
Analyzing Behavioral Big Data: Methodological, Practical, Ethical & Moral Issues
Big Data - To Explain or To Predict? Talk at U Toronto's Rotman School of Ma...
Information Quality: A Framework for Evaluating Empirical Studies
Big Data & Analytics in the Digital Creative Industries
On Information Quality: Can Your Data Do The Job? (SCECR 2015 Keynote)
Predictive Model Selection in PLS-PM (SCECR 2015)

Recently uploaded (20)

PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PPTX
Introduction to Knowledge Engineering Part 1
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PPTX
Computer network topology notes for revision
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
PDF
Fluorescence-microscope_Botany_detailed content
PPTX
1_Introduction to advance data techniques.pptx
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PPT
ISS -ESG Data flows What is ESG and HowHow
PPT
Reliability_Chapter_ presentation 1221.5784
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PDF
Mega Projects Data Mega Projects Data
PDF
Lecture1 pattern recognition............
PDF
Foundation of Data Science unit number two notes
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
Business Ppt On Nestle.pptx huunnnhhgfvu
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
Introduction to Knowledge Engineering Part 1
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
Computer network topology notes for revision
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
Fluorescence-microscope_Botany_detailed content
1_Introduction to advance data techniques.pptx
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
ISS -ESG Data flows What is ESG and HowHow
Reliability_Chapter_ presentation 1221.5784
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
Mega Projects Data Mega Projects Data
Lecture1 pattern recognition............
Foundation of Data Science unit number two notes
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb

A Tree-Based Approach for Addressing Self-selection in Impact Studies with Big Data

  • 1. A Tree-Based Approach for Addressing Self-selection in Impact Studies with Big Data Inbal Yahav Galit Shmueli Deepa Mani Bar Ilan University Indian School of Business Israel India @ HKUST Business School Dept of ISBSOM May 16, 2017
  • 2. PART A (BACKGROUND): EXPERIMENTS (& FRIENDS), RANDOMIZATION, AND CAUSAL INFERENCE PART C: OUR NEW TREE APPROACH (AN ALTERNATIVE TO PSM) PART B: DEALING WITH SELF SELECTION (FOR CAUSAL INFERENCE)
  • 3. PART A (BACKGROUND): EXPERIMENTS (& FRIENDS), RANDOMIZATION, AND CAUSAL INFERENCE
  • 4. Experimental Studies • Goal: Causal inference • Effects of causes (causal description) vs. causes of effects (causal explanation) • Manipulable cause
  • 7. Quasi-Experiment (Self-selection or administrator selection) Manipulation Self Selection
  • 8. Alternative explanations Random assignment Balanced groups Confound (third variable) Counterfactual (always?)
  • 9. Experiments & Variations • Randomized experiment (RCT), natural experiment, quasi-experiment • Lab vs. field experiments Validity External validity: generalization Internal validity: alternative explanations, heterogeneous treatment effect
  • 10. PART B: DEALING WITH SELF SELECTION (FOR CAUSAL INFERENCE)
  • 11. Self selection: the challenge • Large impact studies of an intervention • Individuals/firms self-select intervention group/duration • Even in RCT, some variables might remain unbalanced How to identify and adjust for self-selection?
  • 12. Three Applications Impact of training on earnings Field experiment by US govt • LaLonde (1986) compared to observational control • Re-analysis by PSM (Dehejia & Wahba, 1999, 2002) RCT Impact of e-Gov service in India New online passport service • survey of online + offline users • bribes, travel time, etc. Quasi- experiment Impact of outsourcing contract features on financial performance • pricing mechanism • contract duration Observational
  • 13. Common Approaches • Heckman-type modeling • Propensity Score Approach (Rubin & Rosenbaum) Two steps: 1. Selection model: T = f(X) 2. Performance analysis on matched samples Y = performance measure(s) T = intervention X = pre-intervention variables
  • 14. Propensity Scores Approach Step 1: Estimate selection model logit(T) = f(X) to compute propensity scores P(T|X) Step 3: Estimate Effect on Y (compare groups) e.g., t-test or Y = b0 + b1 T+ b2 X+ b3 PS +e Y = performance measure(s) T = intervention X = pre-intervention variables Self-selection: P(T|X) ≠P(T) Step 2: Use scores to create matched samples PSM = use matching algorithm PSS = divide scores into bins
  • 15. The Idea of PSM: Balancing “The propensity score allows one to design and analyze an observational (nonrandomized) study so that it mimics some of the particular characteristics of a randomized controlled trial. In particular, the propensity score is a balancing score: conditional on the propensity score, the distribution of observed baseline covariates will be similar between treated and untreated subjects.”
  • 16. Study 1: Impact of training on financial gains (LaLonde 1986) Experiment: US govt program randomly assigns eligible candidates to training program • Goal: increase future earnings • LaLonde (1986) shows: Groups statistically equal in terms of demographic & pre-training earnings  ATE = $1794 (p<0.004)
  • 17. Training effect: Observational control group LaLonde also compared with observational control groups (PSID, CPS) – experimental training group + obs control group – shows training effect not estimated correctly with structural equations PSID = Panel Study of Income Dynamics CPS = Westat’s Matched Current Population Survey (Social Security Administration)
  • 18. This paper compares the effect on trainee earnings of an employment program that was run as a field experiment where participants were randomly assigned to treatment and control groups with the estimates that would have been produced by an econometrician. This comparison shows that many of the econometric procedures do not replicate the experimentally determined results, and it suggests that researchers should be aware of the potential for specification errors in other nonexperimental evaluations.
  • 19. Yahav et al./Tree-Based Approach for Addressing Self-Selection Table 4. Summary Statistics of Datasets Used by Dehejia and Wahba (1999) (Average values and standard deviations computed directly from the datasets in http://sekhon.berkeley.edu/matching/lalonde.html.) Characteristics (Variable Name) Experimental NSW Data Nonexperimental CPS Data Treatment Control Control Age (age) 25.82 25.05 33.22 (7.16) (7.06) (11.05) Years of schooling (educ) 10.35 10.09 12.03 (2.01) (1.61) (2.87) Proportion of blacks (black) 0.84 0.83 0.07 (0.36) (0.38) (0.26) Proportion of Hispanic (hisp) 0.06 0.11 0.07 (0.24) (0.31) (0.26) Proportion of married (married) 0.19 0.15 0.71 (0.39) (0.36) (0.45) Proportion of high school dropouts (nodegr) 0.71 0.83 0.3 (0.46) (0.37) (0.46) Real earning 24 month prior to training (1974) (re74) 2,096 2,107 14,024 (4,887) (5,688) (9578.99) Real earning 12 month prior to training (1975) (re75) 1532 1,267 13,642 (3,219) (3,103) (9,260) Proportion of nonworkers in 1974 (u74) 0.71 0.75 0.88 (0.46) (0.43) (0.32) Proportion of nonworkers in 1975 (u75) 0.6 0.68 0.89 (0.49) (0.47) (0.31) Outcome (Real earning in 1978) (re78) 6,349 4,555 14,855 (7,867) (5,484) (9,658) Sample size 185 260 15,991 Table 5. Training Effect in NSW Experiment: Comparison between Approaches (Based on DW99 sample. Tree-based results are split by presence/absence of a high school degree. Overall tree-approach training effect is computed by a weighted average of HS degree and HS dropout (computed for comparison only; due to the
  • 20. PSM: Observational control group Dehejia & Wahba (1999,2002) re-analyzed CPS control group (n=15,991), using PSM – Effects in range $1122-$1681, depends on settings – “Best” setting effect: $1360 – PSM uses only 119 control group members
  • 23. How did Dehejia & Wahba use PSM? D&W obtained training effects in the range $1,122 to $1,681 under different PSM settings and several matching schemes: • Subset selection with/without replacement, combined with low-to- high/high-to-low/random/nearest-neighbor (NN)/caliper matching. • DW02 show that selection with replacement followed by NN matching best captures the effect of the training program. However, other matching schemes often yield poor performance, such as a negative training effect. • The overall training effect under their best settings (they can compare to the actual experimental results!) is reported as $1,360. • Note: This setting leads to dropping most of the 15,991 control records, with only 119 records remaining in the control group. Bottom line: The researcher must make choices about settings and parameters/bins; different choices can lead to different results.
  • 24. PART C: OUR NEW TREE APPROACH (AN ALTERNATIVE TO PSM)
  • 25. Vol. 40 No. 4, pp. 819-848 / December 2016
  • 26. Challenges of PS in Big Data 1. Matching leads to severe data loss 2. PS methods suffer from “data dredging” 3. No variable selection (cannot identify variables that drive the selection) 4. Assumes constant intervention effect 5. Sequential nature is computationally costly 6. Logistic model requires researcher to specify exact form of selection model
  • 27. Proposed Solution: Tree-based approach Propensity scores P(T|X) Y, T, X E(Y|T) Even E(Y|T,X) “Kill the Intermediary”
  • 28. Classification Tree Output: T (treat/control) Inputs: X’s (income, edu, family…) Records in each terminal node share same profile (X) and same propensity score P(T=1| X)
  • 29. Tree Creation Which algorithm? Conditional-Inference trees (Hothorn et al., 2006) – Stop tree growth using statistical tests of independence – Binary splits
  • 30. Tree-Based Approach Four steps: 1. Run selection model: fit tree T = f(X) 2. Present resulting tree; discover unbalanced X’s 3. Treat each terminal node as sub-sample for measuring Y; conduct terminal-node-level performance analysis 4. Present terminal-node-analyses visually 5. [optional]: combine analyses from nodes with homogeneous effects Like PS, assumes observable self-selection
  • 31. Tree on Lalonde’s RCT data If groups are completely balanced, we expect… Y = Earnings in 1978 T = Received NSW training (T = 1) or not (T = 0) X = Demographic information and prior earnings
  • 32. Tree reveals… LaLonde’s naïve approach (experiment) Tree approach HS dropout (n=348) HS degree (n=97) Not trained (n=260) $4554 $4,495 $4,855 Trained (n=185) $6349 $5,649 $8,047 Training effect $1794 (p=0.004) $1,154 (p=0.063) $3,192 (p=0.015) Overall: $1598 (p=0.017) no yes High school degree 1. Unbalanced variable (HS degree) 2. Heterogeneous effect
  • 33. Tree for obs control group reveals… unemployed prior to training in 1974 (u74=0 ) -> negative effect 1. Unbalanced variables 2. Heterogeneous effect in u74 3. Outlier 4. Eligibility issue outlier eligibility issue! some profiles are rare in trained group but popular in control group
  • 34. Solves challenges of PS in Big Data 1. Matching leads to severe data loss 2. PS methods suffer from “data dredging” 3. No variable selection (cannot identify variables that drive the selection) 4. Assumes constant intervention effect 5. Sequential nature is computationally costly 6. Logistic model requires researcher to specify exact form of selection model
  • 35. Why Trees in Explanatory Study? Flexible non-parametric selection model (f) Automated detection of unbalanced pre-intervention variables (X) Easy to interpret, transparent, visual Applicable to binary, polytomous, continuous intervention (T) Useful in Big Data context Identify heterogeneous effects (effect of T on Y)
  • 36. Survey commissioned by Govt of India in 2006 • >9500 individuals who used passport services • Representative sample of 13 Passport Offices • “Quasi-experimental, non-equivalent groups design” • Equal number of offline and online users, matched by geography and demographics Study 2: Impact of eGov Initiative (India)
  • 37. Current Practice Assess impact by comparing online/offline performance stats
  • 38. Awareness of electronic services provided by Government of India % bribe RPO % use agent %prefer online % bribe police Simpson’s Paradox 1. Demographics properly balanced 2. Unbalanced variable (Aware) 3. Heterogeneous effects on various y’s + even Simpson’s paradox
  • 39. PSMAwareness of electronic services provided by Government of India Would we detect this with PSM?
  • 41. Scaling Up to Big Data • We inflated eGov dataset by bootstrap • Up to 9,000,000 records and 360 variables • 10 runs for each configuration: runtime for tree 20 sec
  • 42. Big Data Simulation Binary intervention T = {0, 1} Continuous intervention T∼ N Sample sizes (n) 10K, 100K, 1M #Pre-intervention variables (p) 4, 50 (+interactions) Pre-intervention variable types Binary, Likert-scale, continuous Outcome variable types Binary, continuous Selection models #1: P (T=1) = logit (b0 + b1 x1 +…+ bp xp) #2: P (T=1) = logit (b0 + b1 x1 +…+ bp xp + interactions) Intervention effects 1. Homogeneous Control: E(Y | T = 0) = 0.5 Intervention: E(Y | T = 1) = 0.7 2. Heterogeneous Control: E(Y | T = 0) = 0.5 Intervention: E(Y | T = 1, X1=0) = 0.7 E(Y | T = 1, X1=1) = 0.3 1. Homogeneous Control: E(Y | T = 0) = 0 Intervention: E(Y | T = 1) = 1 2. Heterogeneous Control: E(Y | T = 0) = 0 Intervention: E(Y | T = 1, X1=0) = 1 E(Y | T = 1, X1=1) = -1
  • 43. Results for selection model P (T=1 | X ) = logit (b0 + b1 X1 +…+ bp Xp) PSS (5 bins)
  • 44. Big Data Scalability Theoretical Complexity: • O(mn/p) for binary X • O(m/p nlog(n) ) for continuous X Runtime as function of sample size, dimension
  • 45. Scaling Trees Even Further • “Big Data” in research vs. industry • Industrial scaling – Sequential trees: efficient data structure, access (SPRINT, SLIQ, RainForest) – Parallel computing (parallel SPRINT, ScalParC, SPARK, PLANET) “as long as split metric can be computed on subsets of the training data and later aggregated, PLANET can be easily extended”
  • 48. Tree Approach Benefits 1. Data-driven selection model 2. Scales up to Big Data 3. Less user choices (data dredging) 4. Nuanced insights • Detect unbalanced variables • Detect heterogeneous effect from anticipated outcomes 5. Simple to communicate 6. Automatic variable selection 7. Missing values do not remove record 8. Binary, multiple, continuous interventions 9. Post-analysis of RCT, quasi-experiments & observational studies
  • 49. Tree Approach Limits 1. Assumes selection on observables 2. Need sufficient data 3. Continuous variables can lead to large tree 4. Instability [possible solution: use variable importance scores (forest)]
  • 50. Insights from tree-approach in the three applications Labor (Lalonde ‘86) Heterogeneous effect: Impact of training depends on High school diploma Contract Duration First attempt to study effect of duration on contract performance Price Mechanism Heterogeneous effect: Fixed-price creates long- term market value (not productivity), but only in high-trust contracts eGov Heterogeneous effect: Impact of online system depends on user awareness
  • 51. Yahav, I., Shmueli, G. and Mani, D. (2016) “A Tree-Based Approach for Addressing Self-Selection in Impact Studies with Big Data”, MIS Quarterly, vol 40 no 4, pp. 819-848.
  • 52. Impact of IT Outsourcing Contract Attributes How does financial performance of outsourcing contracts vary with two attributes of the contract: • Pricing mechanisms (6 options) • Contract duration (continuous) Observational Data • >1400 contracts, implemented 1996-2008 • 374 vendors and 710 clients • Obtained from IDC database, Lexis-Nexis, COMPUSTAT, etc.
  • 53. T = Six Pricing Mechanisms (polytomous intervention) Interventions (T): 1. Fixed Price 2. Transactional Price 3. Time-and-Materials 4. Incentive 5. Combination 6. Joint Venture Fixed Price Variable Price Pre-Intervention Variables (X): Task Type Bid Type Contract Value Uncertainty in business requirements Outsourcing Experience Firm Size (market value of equity) Outcomes (Y): Announcement Returns Long Term Returns Median Income Efficiency
  • 54. Six Pricing Mechanisms (polytomous intervention) Interventions (T): 1. Fixed Price 2. Transactional Price 3. Time-and-Materials 4. Incentive 5. Combination 6. Joint Venture Fixed Price - Fixed payment per billing cycle Transactional - Fixed payment per transaction per billing cycle Time and Materials - Payment based on input time and materials used during billing cycle Incentive - Payment based on output improvements against key performance indicators or any combination of indicators Combination - A combination of any of the above contract types, largely fixed price and time and materials Joint Venture - A separately incorporated entity, jointly owned by the client and the vendor, used to govern the outsourcing relationship. Fixed Price Variable Price
  • 55. Six Price Mechanisms Questions of interest: 1)Do all simple outsourcing engagements, governed by fixed or transactional price contracts, create value? 2)What types of complex outsourcing engagements create value for the client? 3)How do firms mitigate risks inherent to these engagements?
  • 56. Impact Measures (Y) Announcement Returns Firm-specific daily abnormal returns ( 𝜀𝑖𝑡, for firm i on day t) • Computed as 𝜀𝑖𝑡= 𝑟𝑖𝑡- 𝑟𝑖𝑡 , where 𝑟𝑖𝑡 = daily return (to the value weighted S&P500), estimated from the market model: 𝑟𝑖𝑡 = α𝑖+𝛽𝑖 𝑟 𝑚𝑡+ 𝜀𝑖𝑡. • Model used to predict daily returns for each firm over announcement period [-5,+5]. Long Term Returns Monthly abnormal returns • Estimated from the Fama- French three-factor model as excess of that achieved by passive investments in systematic risk factors. • Expected to be zero under the null hypothesis of market efficiency. • Used to estimate the implied three-year abnormal return following the contract. Median Income Efficiency Income efficiency is estimated as earnings before interest and taxes divided by number of employees. • Median of income efficiency for the three-year period following contract implementation
  • 57. Six pricing methodologies Selection Model Large custom IT tasks (complex) BPO+simple tasks high-trust
  • 58. Node-level Performance Analysis Combination contracts create value for complex engagements Custom IT (complex, high trust)
  • 59. Contract Duration (Continuous Intervention) T = Contract duration (months) Pre-Intervention Variables (X): Task Type Bid Type Contract Value Uncertainty in business requirements Outsourcing Experience Firm Size (market value of equity) Outcomes (Y) Announcement Returns Long Term Returns Median Income Efficiency Contract duration has no impact on performance gains from outsourcing
  • 60. Contract Duration Selection Model (regression tree) Duration proportional to contract value (1,6,8 vs. 4,5)
  • 61. Node-Level Performance Analysis (Reg) Markets reward long-term for high- value contracts and low-value minimal scope contracts contracts requiring specific or non- contractible investments (costs outweigh benefits)
  • 62. Price methodologies: Main insight Prior research: • Role of trust only in complex contracts • Fixed price known to create value; unrelated to trust Tree finding: Fixed-price creates long-term market value (not productivity), but only in high-trust contracts!

Editor's Notes

  • #5: Experimental causes must be manipulable (Shadish, Cook & Campbell, p8)
  • #24: http://sekhon.berkeley.edu/matching/lalonde.html
  • #54: Joint ventures are equity arrangements, so, they usually don't come in the gambit of contractual agreements. Positive long term returns for Transactional: Market underestimates the value created by transactional pricing contracts. Positive income efficiency gains for Fixed Price, Incentive, Combination – but not impounded in the market. Q’s that arise: Do all simple outsourcing engagements, governed by fixed or transactional price contracts, create value? What types of complex outsourcing engagements create value for the client? How do firms mitigate risks inherent to these engagements?
  • #55: Joint ventures are equity arrangements, so, they usually don't come in the gambit of contractual agreements. Positive long term returns for Transactional: Market underestimates the value created by transactional pricing contracts. Positive income efficiency gains for Fixed Price, Incentive, Combination – but not impounded in the market. Q’s that arise: Do all simple outsourcing engagements, governed by fixed or transactional price contracts, create value? What types of complex outsourcing engagements create value for the client? How do firms mitigate risks inherent to these engagements?
  • #59: Fixed and Transactional: positive Med.Income.Eff but zero Ann.Returns = market underestimates the value created by these contracts when they are announced. Nodes 1,4 – Incentive. No data for Node 4 Ann.Returns.
  • #60: Novelty: literature has not investigated contract duration as a self-selected intervention; no tool for continuous intervention.
  • #62: Ann.Returns: Positive: Markets reward long-term contracts for high-value contracts (node 5) and low-value custom-IT (=minimal scope) contracts (node 8) Negative: In long-term IT contracts that require specific or non-contractible investments (nodes 3, 10), as the scope of the engagement increases, the costs outweigh the benefits.