The Data Errors we Make by Sean Taylor at Big Data Spain 2017

The Data Errors We Make
Sean J Taylor
Core Data Science Team
Facebook

About Me
• 5 years at Facebook as a
Research Scientist
• PhD in Information Systems
from New York University
• Research Interests:
• Field Experiments
• Forecasting
• Sports and sports fans
https://facebook.github.io/prophet/

Strategic Decisions Micro-decisions at Scale

Data
Algorithm
Human 
Choices
Estimate Decision Outcome
Truth
statistical  
error
practical  
error
Optimal
Decision
Optimal 
Outcome

H0: You are not pregnant.
H1: You are pregnant.

H0 is True
Product is Bad
H1 is True
Product is Good
Accept Null
Hypothesis
(Don’t ship product)
Right decision
Type II Error
(wrong decision)
Reject Null
Hypothesis
(Ship Product)
Type I Error
(wrong decision)
Right decision

Receiver Operating Characteristic (ROC) Curve
tells us Type I and II error rates
Type I error rate
(1 - Type II error rate)

Outline
1. Reﬁnements to the Type I/II error model
2. A simple causal model of how we make errors
3. What we can effectively do about errors

Reﬁnement 1: 
Assign Costs to Errors
H0 is True
Product is Bad
H1 is True
Product is Good
Accept Null
Hypothesis
Right decision
Type II Error
(wrong decision)
Reject Null
Hypothesis
(Ship Product)
Type I Error
(wrong decision)
Right decision

Reﬁnement 1: 
Assign Costs to Errors
H0 is True
Product is Bad
H1 is True
Product is Good
Accept Null
Hypothesis
0 -100
Reject Null
Hypothesis
(Ship Product)
-200 +100

Example:  
Expected value of a product launch
P(Type I) is 1% and P(Type II) is 20%
P(good) * (100 * .80 + -100 * .2)
+ (1 - P(good)) * (-200 * .01 + 0 * .99)
= (.5 * 60) + (.5 * -2)
= 30 - 1
= 29

Allowing more Type I errors lowers Type II rate.
Optimal choice depends on payoffs and P(H1).

P(Type I) is 5% and P(Type II) is 7%
P(good) * (100 * .93 + -100 *.07)
+ (1 - P(good)) * (-200 * .05 + 0 * .95)
= (.5 * 86) + (.5 * -10)
= 43 - 5
= 38 > 29
Example 2:  
Expected value of a product launch

Reﬁnement 2:
Opportunity Cost
Key Idea: If we devote resources to minimizing Type I
and II errors for one problem, we will have fewer
resources for other problems.
• Few organizations makes a single decision, we
usually make many of them.
• Acquiring more data, investing more time into
problems has diminishing marginal returns.

Examples of Constraints
• Sample size for online
experiments
• Gathering more data
• Analyst time

Reﬁnement 3:
Mosteller’s Type III Errors
 
Type III error: “correctly rejecting the null hypothesis
for the wrong reason” -- Frederick Mosteller
More clearly: The process you used worked this time,
but is unlikely to continue working in the future.

Good Process vs.
Good Outcome
Good Outcome Bad Outcome
Good Process Deserved Success Bad Break
Bad Process Dumb Luck Poetic Justice

Reﬁnement 4:
Kimball’s Type III Errors
 
Type III error: “the error committed by giving the right
answer to the wrong problem” -- Allyn W. Kimball

Data
Algorithm
Human 
Choices
Estimate

Cause 1: Data
• Inadequate data
• Non-representative data
• Measuring the wrong thing

made data
designed to be adequate
found data
adequate if we are fortunate

2014 World Cup
First Facebook Check-ins in Brazil from non-Brazilian users

Bias?
2014 World Cup Check-ins by Country

Common Pattern
• High volume of of cheap, easy to measure
“surrogate”  
(e.g. steps, clicks)
• Surrogate is correlated with true measurement of
interest (e.g. overall health, purchase intention)
• key question: sign and magnitude of
“interpretation bias”

Cause 2: Algorithms
• The model/procedure we choose primarily
concerns what side of the bias-variance tradeoff
we'd like to be on.
• Common mistakes are:
• Using a model that’s too complex for the data.
• Focusing too much on algorithms instead of
gathering the right data or correctness.

Optimizing models
Reducing bias
• Choose a more ﬂexible model.
Reducing variance
• Choosing a less ﬂexible
model.
• Get more data.

Tree Induction vs. Logistic
Regression: A Learning-Curve
Analysis 
Perlich et al. (2003)
• logistic regression is better for
smaller training sets and tree
induction for larger data sets
• logistic regression is usually
better when the signal-to-
noise ratio is lower

Cause 3: Human choices
Many analysts, one dataset: Making transparent
how variations in analytical choices affect results 
(Silberzahn et al. 2017)
• 29 teams involving 61 analysts used the same
dataset to address the same research question
• Are soccer ⚽ referees are more likely to give red
cards to dark skin toned players than light skin
toned players?

• effect sizes ranged from 0.89 to 2.93 in odds ratio units
• 20 teams (69%) found a statistically signiﬁcant positive effect
• 9 teams (31%) observed a nonsigniﬁcant relationship

Ways Forward
• prevent errors
• opinionated analysis development
• test driven data analysis
• be honest about uncertainty
• estimate uncertainty using the bootstrap

Opinionated Analysis Development 
(by Hilary Parker)

No algorithm in Scikit Learn  
will estimate uncertainty.

The Bootstrap
R1
All Your
Data
R2
…
R500
Generate random
sub-samples
s1
s2
s500
Compute statistics
or estimate model
parameters
…
} 0.0
2.5
5.0
7.5
-2 -1 0 1 2
Statistic
Count
Get a distribution
over statistic of interest
(usually the prediction)
- take mean
- CIs == 95% quantiles
- SEs == standard deviation

Summary
Think about errors!
• What kind of errors are we making?
• Where did the come from?
Prevent errors!
• Use a reasonable and reproducible
process.
• Test your analysis as you test your code.
Estimate uncertainty!
• Models that estimate uncertainty are more
useful than those that don’t.
• They facilitate better learning and
experimentation.

The Data Errors we Make by Sean Taylor at Big Data Spain 2017

More Related Content

What's hot (19)

Similar to The Data Errors we Make by Sean Taylor at Big Data Spain 2017 (20)

More from Big Data Spain (20)

Recently uploaded (20)

The Data Errors we Make by Sean Taylor at Big Data Spain 2017