SlideShare a Scribd company logo
Eight Years of
Data Science Mistakes
Caitlin Hudon | rstudio::conf
● Lead Data Scientist @ OnlineMedEd
● Co-organizer of R-Ladies Austin
● Traveling, hiking, books, and tacos
caitlinhudon.com | @beeonaposy
#DSLearnings
I’m Caitlin Hudon
Why share mistakes?
Transparency To help Growth
Learning from 8 Years of Data Science Mistakes
Mistake Zone #1: Technical / Analysis
Learning from 8 Years of Data Science Mistakes
Learning from 8 Years of Data Science Mistakes
Not enough info
Too good to be true
Technical Mistakes I’ve Made While Building Models
Created a non-reproducible data prep step
Evaluated a model based on performance of training set
Didn’t notice large outliers
Dropped missing values when it made sense to flag them
Flagged missing values when it made sense to drop them
Set missing values to zero
Not comparing a complex model to a simple baseline
Understanding
of data source(s)
Assumptions
Filters
Missing
values
Understanding
of model goals
#DSLearnings: Avoiding Analysis Mistakes
Mistake Zone #2:
Communicating with Devs
Learning from 8 Years of Data Science Mistakes
Learning from 8 Years of Data Science Mistakes
Everyone has to
get bit by time
zones once
Lack of data
dictionary /
resources
Data munging
decisions
hid errors!
Learning from 8 Years of Data Science Mistakes
Learning from 8 Years of Data Science Mistakes
#DSLearnings: Working with Devs
Avoid jargon for clarity
Don’t play telephone
Be willing to teach, and to learn (respectfully)
Same team! (Focus on common goals)
Mistake Zone #3:
Communicating with Business Stakeholders
Learning from 8 Years of Data Science Mistakes
Learning from 8 Years of Data Science Mistakes
The Rhetorical Triangle
AKA “how to frame your analysis for anyone”
Speaker
AudienceContext
Learning from 8 Years of Data Science Mistakes
#DSLearnings: Communicating with Business Stakeholders
Get stakeholders involved early
Make sure you understand the business problem
Frame analyses in a way that makes sense
Basics: who, what, how many?
Know where they get their data
Danger Zone #4:
Infrastructure / Team
“The”
Algorithm
Learning from 8 Years of Data Science Mistakes
Pseudocode
Inputs
Outputs
Process
Explanations
… all in plain English
#DSLearnings: Infrastructure / Team
Documentation
-- Data dictionaries
-- SQL query library
Code review
Core team meetings
Pseudocode helps everyone!
Quick Advice from a Mistake-Maker
Advice for aspiring data scientists
Learn SQL, communication is a technical skill,
start a blog, teach others, don't worry about
learning everything at first, find your
community, stay curious, have fun, don't be
afraid to use Google, and remember that
everyone is winging it!
caitlinhudon.com | @beeonaposy | #DSLearnings

More Related Content

PPT
Enhance Your ELA Classroom Using Technology
PDF
Creating a Virtuous Cycle - The Research and Design Feedback Loop
PPTX
Learning from a summer with my i pad
PDF
How to start thinking like a data scientist
PDF
Michelle Samplin-Salgado - Social media labs, lounges, and what we've learned
PPT
Words too success_nadia s
PDF
Fooled by best practice
PDF
Things Future IT Students Should Know (But Don't)
Enhance Your ELA Classroom Using Technology
Creating a Virtuous Cycle - The Research and Design Feedback Loop
Learning from a summer with my i pad
How to start thinking like a data scientist
Michelle Samplin-Salgado - Social media labs, lounges, and what we've learned
Words too success_nadia s
Fooled by best practice
Things Future IT Students Should Know (But Don't)

What's hot (11)

PPTX
Designing simulations and games_design tips and tricks
PPTX
Ed psy 510 3rd class 2014
PPTX
Final project
PDF
Easy and affordable user testing - Front Trends 2017
PDF
Blind mountain climbing: design process
PDF
Dan Szuc & Josephine Wong UXID2014 Designing Healthier and Smarter Life
PDF
Adma Digital Marketing Pres Sep 08 Breeze
PPTX
PMs and Engineers
PDF
Design considerations for machine learning system
PPTX
Millennial Management
PDF
The Last Mile Learner
Designing simulations and games_design tips and tricks
Ed psy 510 3rd class 2014
Final project
Easy and affordable user testing - Front Trends 2017
Blind mountain climbing: design process
Dan Szuc & Josephine Wong UXID2014 Designing Healthier and Smarter Life
Adma Digital Marketing Pres Sep 08 Breeze
PMs and Engineers
Design considerations for machine learning system
Millennial Management
The Last Mile Learner
Ad

Similar to Learning from 8 Years of Data Science Mistakes (20)

PDF
Data Science Introduction - Data Science: What Art Thou?
PDF
Artur Suchwalko “What are common mistakes in Data Science projects and how to...
PDF
Mistakes beginner data science Students Make.pdf
PDF
PyData London 2024: Mistakes were made (Dr. Rebecca Bilbro)
PPTX
Managing Data Science | Lessons from the Field
PPTX
How to build a data science project in a corporate setting, by Soraya Christi...
PDF
5 physical data modeling blunders 09092010
PPTX
The Rise of the Data Scientist
PPTX
Why we fail at ml ai why we fail at ml_ai
PDF
What's the Value of Data Science for Organizations: Tips for Invincibility in...
PDF
Essential Insights Top 7 Data Conversion Mistakes and Solutions
PDF
Data Science Transforming Security Operations
PPTX
Data Science Demystified
PDF
Implementing Data Science
PDF
The Softer Side of Data Science
PDF
How Data Scientists Make Reliable Decisions with Data
PDF
DAMA Webinar: What Does "Manage Data Assets" Really Mean?
PDF
Five Pitfalls when Operationalizing Data Science and a Strategy for Success
PDF
assumption_data.pdf
PDF
Big data & data science challenges and opportunities
Data Science Introduction - Data Science: What Art Thou?
Artur Suchwalko “What are common mistakes in Data Science projects and how to...
Mistakes beginner data science Students Make.pdf
PyData London 2024: Mistakes were made (Dr. Rebecca Bilbro)
Managing Data Science | Lessons from the Field
How to build a data science project in a corporate setting, by Soraya Christi...
5 physical data modeling blunders 09092010
The Rise of the Data Scientist
Why we fail at ml ai why we fail at ml_ai
What's the Value of Data Science for Organizations: Tips for Invincibility in...
Essential Insights Top 7 Data Conversion Mistakes and Solutions
Data Science Transforming Security Operations
Data Science Demystified
Implementing Data Science
The Softer Side of Data Science
How Data Scientists Make Reliable Decisions with Data
DAMA Webinar: What Does "Manage Data Assets" Really Mean?
Five Pitfalls when Operationalizing Data Science and a Strategy for Success
assumption_data.pdf
Big data & data science challenges and opportunities
Ad

Recently uploaded (20)

PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PPTX
Business Acumen Training GuidePresentation.pptx
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PPTX
Qualitative Qantitative and Mixed Methods.pptx
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PPTX
IB Computer Science - Internal Assessment.pptx
PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PPTX
Computer network topology notes for revision
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
PDF
annual-report-2024-2025 original latest.
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PDF
Business Analytics and business intelligence.pdf
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PPTX
Introduction to Knowledge Engineering Part 1
Business Ppt On Nestle.pptx huunnnhhgfvu
Business Acumen Training GuidePresentation.pptx
IBA_Chapter_11_Slides_Final_Accessible.pptx
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
Qualitative Qantitative and Mixed Methods.pptx
Galatica Smart Energy Infrastructure Startup Pitch Deck
IB Computer Science - Internal Assessment.pptx
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
Data_Analytics_and_PowerBI_Presentation.pptx
Miokarditis (Inflamasi pada Otot Jantung)
Computer network topology notes for revision
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
annual-report-2024-2025 original latest.
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
Business Analytics and business intelligence.pdf
STUDY DESIGN details- Lt Col Maksud (21).pptx
Introduction to Knowledge Engineering Part 1

Learning from 8 Years of Data Science Mistakes