SlideShare a Scribd company logo
The Data Errors we Make by Sean Taylor at Big Data Spain 2017
The Data Errors We Make
Sean J Taylor
Core Data Science Team
Facebook
About Me
• 5 years at Facebook as a
Research Scientist
• PhD in Information Systems
from New York University
• Research Interests:
• Field Experiments
• Forecasting
• Sports and sports fans
https://facebook.github.io/prophet/
Strategic Decisions Micro-decisions at Scale
The Data Errors we Make by Sean Taylor at Big Data Spain 2017
Data
Algorithm
Human

Choices
Estimate Decision Outcome
Truth
statistical 

error
practical 

error
Optimal
Decision
Optimal

Outcome
Simplest Error Model
H0: You are not pregnant.
H1: You are pregnant.
H0 is True
Product is Bad
H1 is True
Product is Good
Accept Null
Hypothesis
(Don’t ship product)
Right decision
Type II Error
(wrong decision)
Reject Null
Hypothesis
(Ship Product)
Type I Error
(wrong decision)
Right decision
Receiver Operating Characteristic (ROC) Curve
tells us Type I and II error rates
Type I error rate
(1 - Type II error rate)
Outline
1. Refinements to the Type I/II error model
2. A simple causal model of how we make errors
3. What we can effectively do about errors
Refinements
Refinement 1:

Assign Costs to Errors
H0 is True
Product is Bad
H1 is True
Product is Good
Accept Null
Hypothesis
(Don’t ship product)
Right decision
Type II Error
(wrong decision)
Reject Null
Hypothesis
(Ship Product)
Type I Error
(wrong decision)
Right decision
Refinement 1:

Assign Costs to Errors
H0 is True
Product is Bad
H1 is True
Product is Good
Accept Null
Hypothesis
(Don’t ship product)
0 -100
Reject Null
Hypothesis
(Ship Product)
-200 +100
Example: 

Expected value of a product launch
P(Type	I)	is	1%	and	P(Type	II)	is	20%	
		P(good)									*	(100	*	.80	+	-100	*	.2)		
		+	(1	-	P(good))	*	(-200	*	.01	+	0	*	.99)	
=	(.5	*	60)	+	(.5	*	-2)	
=	30	-	1	
=	29
Allowing more Type I errors lowers Type II rate.
Optimal choice depends on payoffs and P(H1).
P(Type	I)	is	5%	and	P(Type	II)	is	7%	
		P(good)									*	(100	*	.93	+	-100	*.07)		
		+	(1	-	P(good))	*	(-200	*	.05	+	0	*	.95)	
=	(.5	*	86)	+	(.5	*	-10)	
=	43	-	5	
=	38	>	29
Example 2: 

Expected value of a product launch
Refinement 2:
Opportunity Cost
Key Idea: If we devote resources to minimizing Type I
and II errors for one problem, we will have fewer
resources for other problems.
• Few organizations makes a single decision, we
usually make many of them.
• Acquiring more data, investing more time into
problems has diminishing marginal returns.
Examples of Constraints
• Sample size for online
experiments
• Gathering more data
• Analyst time
Refinement 3:
Mosteller’s Type III Errors


Type III error: “correctly rejecting the null hypothesis
for the wrong reason” -- Frederick Mosteller
More clearly: The process you used worked this time,
but is unlikely to continue working in the future.
Good Process vs.
Good Outcome
Good Outcome Bad Outcome
Good Process Deserved Success Bad Break
Bad Process Dumb Luck Poetic Justice
Refinement 4:
Kimball’s Type III Errors


Type III error: “the error committed by giving the right
answer to the wrong problem” -- Allyn W. Kimball
The Data Errors we Make by Sean Taylor at Big Data Spain 2017
The Data Errors we Make by Sean Taylor at Big Data Spain 2017
Why we make errors
Data
Algorithm
Human

Choices
Estimate
Cause 1: Data
• Inadequate data
• Non-representative data
• Measuring the wrong thing
made data
designed to be adequate
found data
adequate if we are fortunate
Non-representative
data
2014 World Cup
First Facebook Check-ins in Brazil from non-Brazilian users
Bias?
2014 World Cup Check-ins by Country
Measuring the wrong thing
The Data Errors we Make by Sean Taylor at Big Data Spain 2017
The Data Errors we Make by Sean Taylor at Big Data Spain 2017
Common Pattern
• High volume of of cheap, easy to measure
“surrogate” 

(e.g. steps, clicks)
• Surrogate is correlated with true measurement of
interest (e.g. overall health, purchase intention)
• key question: sign and magnitude of
“interpretation bias”
Cause 2: Algorithms
• The model/procedure we choose primarily
concerns what side of the bias-variance tradeoff
we'd like to be on.
• Common mistakes are:
• Using a model that’s too complex for the data.
• Focusing too much on algorithms instead of
gathering the right data or correctness.
Optimizing models
Reducing bias
• Choose a more flexible model.
Reducing variance
• Choosing a less flexible
model.
• Get more data.
Tree Induction vs. Logistic
Regression: A Learning-Curve
Analysis

Perlich et al. (2003)
• logistic regression is better for
smaller training sets and tree
induction for larger data sets
• logistic regression is usually
better when the signal-to-
noise ratio is lower
Cause 3: Human choices
Many analysts, one dataset: Making transparent
how variations in analytical choices affect results

(Silberzahn et al. 2017)
• 29 teams involving 61 analysts used the same
dataset to address the same research question
• Are soccer ⚽ referees are more likely to give red
cards to dark skin toned players than light skin
toned players?
• effect sizes ranged from 0.89 to 2.93 in odds ratio units
• 20 teams (69%) found a statistically significant positive effect
• 9 teams (31%) observed a nonsignificant relationship
Overconfidence
Incentives
Ways Forward
• prevent errors
• opinionated analysis development
• test driven data analysis
• be honest about uncertainty
• estimate uncertainty using the bootstrap
Opinionated Analysis Development

(by Hilary Parker)
Test-Driven Data Analysis
The Data Errors we Make by Sean Taylor at Big Data Spain 2017
The Data Errors we Make by Sean Taylor at Big Data Spain 2017
Estimating Uncertainty
No algorithm in Scikit Learn 

will estimate uncertainty.
The Bootstrap
R1
All Your
Data
R2
…
R500
Generate random
sub-samples
s1
s2
s500
Compute statistics
or estimate model
parameters
…
} 0.0
2.5
5.0
7.5
-2 -1 0 1 2
Statistic
Count
Get a distribution
over statistic of interest
(usually the prediction)
- take mean
- CIs == 95% quantiles
- SEs == standard deviation
Summary
Think about errors!
• What kind of errors are we making?
• Where did the come from?
Prevent errors!
• Use a reasonable and reproducible
process.
• Test your analysis as you test your code.
Estimate uncertainty!
• Models that estimate uncertainty are more
useful than those that don’t.
• They facilitate better learning and
experimentation.

More Related Content

PDF
Why big data didn’t end causal inference by Totte Harinen at Big Data Spain 2017
PDF
BDW17 London - Totte Harinen, Uber - Why Big Data Didn’t End Causal Inference
PPTX
Measuring effectiveness of machine learning systems
PDF
Putting the Magic in Data Science
PDF
Deep learning for fun and profit (a simple introduction to Artificial Intelli...
PDF
Trends on Pinterest
PDF
Be Data Informed Without Being a Data Scientist
PDF
Correctness in Data Science - Data Science Pop-up Seattle
Why big data didn’t end causal inference by Totte Harinen at Big Data Spain 2017
BDW17 London - Totte Harinen, Uber - Why Big Data Didn’t End Causal Inference
Measuring effectiveness of machine learning systems
Putting the Magic in Data Science
Deep learning for fun and profit (a simple introduction to Artificial Intelli...
Trends on Pinterest
Be Data Informed Without Being a Data Scientist
Correctness in Data Science - Data Science Pop-up Seattle

What's hot (19)

PPT
Web science - How is it different?
PDF
IRJET- Review Analyser with Bot
PPTX
Data science concept by Raj Krishna Paul
PPTX
User behavior modelling & recommendation system based on social networks
PDF
Deep Learning Use Cases - Data Science Pop-up Seattle
PDF
Interpretations of data
PDF
From Research to Production: ML/DL in the Enterprise
PDF
Math in data
PPT
Social Science Applications of Agent Based Modelling
PDF
2016 Data Science Salary Survey
PDF
Analysis of the article "A Predictive Analytics Primer" by Thomas H. Davenport
PDF
Welcome to the world of Analytics
PPTX
Bayesian reasoning
PDF
Data Science at Intersection of Security and Privacy
PPTX
Machine Learning 101
PPTX
A predictive analytics primer
PPTX
Advancing Foundation and Practice of Software Analytics
PDF
Social Search in a Professional Context
PDF
Introduction to Machine Learning
Web science - How is it different?
IRJET- Review Analyser with Bot
Data science concept by Raj Krishna Paul
User behavior modelling & recommendation system based on social networks
Deep Learning Use Cases - Data Science Pop-up Seattle
Interpretations of data
From Research to Production: ML/DL in the Enterprise
Math in data
Social Science Applications of Agent Based Modelling
2016 Data Science Salary Survey
Analysis of the article "A Predictive Analytics Primer" by Thomas H. Davenport
Welcome to the world of Analytics
Bayesian reasoning
Data Science at Intersection of Security and Privacy
Machine Learning 101
A predictive analytics primer
Advancing Foundation and Practice of Software Analytics
Social Search in a Professional Context
Introduction to Machine Learning
Ad

Similar to The Data Errors we Make by Sean Taylor at Big Data Spain 2017 (20)

PPTX
intro_big_data.pptx
PPT
AAPOR 2012 Langer Probability
PDF
Mayo O&M slides (4-28-13)
PDF
Jsm big-data
DOCX
Answer questions Minimum 100 words each and reference (questions.docx
PPTX
Data Manipulation And Data Integrity ethics in research
PDF
An Introduction to AI (Formerly Data Science)
PPT
How NOT to Aggregrate Polling Data
PDF
Mayo, DG March 8-Emory AI Systems and society conference slides.pdf
PDF
Turning Data into Infographics: An Interactive Workshop for Problem Solvers
PPTX
CO 3. Hypothesis Testing which is basicl
PPTX
IE_expressyourself_EssayH
PPTX
Identification1
PDF
d4365515-cf15-43ce-9147-da4991ae0dff.pdf
PPTX
1.1 statistical and critical thinking
PPTX
Designing Indicators
PPTX
sience 2.0 : an illustration of good research practices in a real study
PDF
Data interpretation in description analysis
PPTX
CS194Lec0hbh6EDA.pptx
PPTX
Investigating Performance: Design & Outcomes with xAPI | LSCon 2017
intro_big_data.pptx
AAPOR 2012 Langer Probability
Mayo O&M slides (4-28-13)
Jsm big-data
Answer questions Minimum 100 words each and reference (questions.docx
Data Manipulation And Data Integrity ethics in research
An Introduction to AI (Formerly Data Science)
How NOT to Aggregrate Polling Data
Mayo, DG March 8-Emory AI Systems and society conference slides.pdf
Turning Data into Infographics: An Interactive Workshop for Problem Solvers
CO 3. Hypothesis Testing which is basicl
IE_expressyourself_EssayH
Identification1
d4365515-cf15-43ce-9147-da4991ae0dff.pdf
1.1 statistical and critical thinking
Designing Indicators
sience 2.0 : an illustration of good research practices in a real study
Data interpretation in description analysis
CS194Lec0hbh6EDA.pptx
Investigating Performance: Design & Outcomes with xAPI | LSCon 2017
Ad

More from Big Data Spain (20)

PDF
Big Data, Big Quality? by Irene Gonzálvez at Big Data Spain 2017
PDF
Scaling a backend for a big data and blockchain environment by Rafael Ríos at...
PDF
AI: The next frontier by Amparo Alonso at Big Data Spain 2017
PDF
Disaster Recovery for Big Data by Carlos Izquierdo at Big Data Spain 2017
PDF
Presentation: Boost Hadoop and Spark with in-memory technologies by Akmal Cha...
PDF
Data science for lazy people, Automated Machine Learning by Diego Hueltes at ...
PDF
Training Deep Learning Models on Multiple GPUs in the Cloud by Enrique Otero ...
PDF
Unbalanced data: Same algorithms different techniques by Eric Martín at Big D...
PDF
State of the art time-series analysis with deep learning by Javier Ordóñez at...
PDF
Trading at market speed with the latest Kafka features by Iñigo González at B...
PDF
Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data...
PDF
The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a...
PDF
Artificial Intelligence and Data-centric businesses by Óscar Méndez at Big Da...
PDF
Meme Index. Analyzing fads and sensations on the Internet by Miguel Romero at...
PDF
Vehicle Big Data that Drives Smart City Advancement by Mike Branch at Big Dat...
PDF
End of the Myth: Ultra-Scalable Transactional Management by Ricardo Jiménez-P...
PDF
Attacking Machine Learning used in AntiVirus with Reinforcement by Rubén Mart...
PDF
More people, less banking: Blockchain by Salvador Casquero at Big Data Spain ...
PDF
Make the elephant fly, once again by Sourygna Luangsay at Big Data Spain 2017
PDF
Feature selection for Big Data: advances and challenges by Verónica Bolón-Can...
Big Data, Big Quality? by Irene Gonzálvez at Big Data Spain 2017
Scaling a backend for a big data and blockchain environment by Rafael Ríos at...
AI: The next frontier by Amparo Alonso at Big Data Spain 2017
Disaster Recovery for Big Data by Carlos Izquierdo at Big Data Spain 2017
Presentation: Boost Hadoop and Spark with in-memory technologies by Akmal Cha...
Data science for lazy people, Automated Machine Learning by Diego Hueltes at ...
Training Deep Learning Models on Multiple GPUs in the Cloud by Enrique Otero ...
Unbalanced data: Same algorithms different techniques by Eric Martín at Big D...
State of the art time-series analysis with deep learning by Javier Ordóñez at...
Trading at market speed with the latest Kafka features by Iñigo González at B...
Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data...
The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a...
Artificial Intelligence and Data-centric businesses by Óscar Méndez at Big Da...
Meme Index. Analyzing fads and sensations on the Internet by Miguel Romero at...
Vehicle Big Data that Drives Smart City Advancement by Mike Branch at Big Dat...
End of the Myth: Ultra-Scalable Transactional Management by Ricardo Jiménez-P...
Attacking Machine Learning used in AntiVirus with Reinforcement by Rubén Mart...
More people, less banking: Blockchain by Salvador Casquero at Big Data Spain ...
Make the elephant fly, once again by Sourygna Luangsay at Big Data Spain 2017
Feature selection for Big Data: advances and challenges by Verónica Bolón-Can...

Recently uploaded (20)

PDF
A novel scalable deep ensemble learning framework for big data classification...
PDF
A comparative analysis of optical character recognition models for extracting...
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Zenith AI: Advanced Artificial Intelligence
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PPTX
TLE Review Electricity (Electricity).pptx
PPTX
A Presentation on Touch Screen Technology
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
A comparative study of natural language inference in Swahili using monolingua...
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Encapsulation theory and applications.pdf
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Microsoft Solutions Partner Drive Digital Transformation with D365.pdf
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
DP Operators-handbook-extract for the Mautical Institute
PPTX
1. Introduction to Computer Programming.pptx
PDF
Getting Started with Data Integration: FME Form 101
PPTX
OMC Textile Division Presentation 2021.pptx
PDF
Mushroom cultivation and it's methods.pdf
PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
A novel scalable deep ensemble learning framework for big data classification...
A comparative analysis of optical character recognition models for extracting...
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Zenith AI: Advanced Artificial Intelligence
MIND Revenue Release Quarter 2 2025 Press Release
TLE Review Electricity (Electricity).pptx
A Presentation on Touch Screen Technology
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
A comparative study of natural language inference in Swahili using monolingua...
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Encapsulation theory and applications.pdf
Unlocking AI with Model Context Protocol (MCP)
Microsoft Solutions Partner Drive Digital Transformation with D365.pdf
Agricultural_Statistics_at_a_Glance_2022_0.pdf
DP Operators-handbook-extract for the Mautical Institute
1. Introduction to Computer Programming.pptx
Getting Started with Data Integration: FME Form 101
OMC Textile Division Presentation 2021.pptx
Mushroom cultivation and it's methods.pdf
gpt5_lecture_notes_comprehensive_20250812015547.pdf

The Data Errors we Make by Sean Taylor at Big Data Spain 2017