SlideShare a Scribd company logo
CONFIDENTIAL
Mark Landry
October 26, 2016
Top 10 Data Science
Practitioner Pitfalls
1 of 10
Top 10 Data Science Practitioner Pitfalls
Train vs Test
1. Train vs Test
Training Set vs.
Test Set
• Partition the original data (randomly or stratified) into a
training set and a test set. (e.g. 70/30)
• It can be useful to evaluate the training error, but you
should not look at training error alone.
• Training error is not an estimate of generalization error (on
a test set or cross-validated), which is what you should care
more about.
• Training error vs test error over time is an useful thing to
calculate. It can tell you when you start to overfit your
model, so it is a useful metric in supervised machine
learning.
Training Error vs.
Test Error
Source: Elements of Statistical Learning
1. Train vs Test
2 of 10
Top 10 Data Science Practitioner Pitfalls
Validation Set
2. Train vs Test vs Valid
Training Set vs.
Validation Set vs.
Test Set
• If you have “enough” data and plan to do some model
tuning, you should really partition your data into three parts
— Training, Validation and Test sets.
• There is no general rule for how you should partition the
data and it will depend on how strong the signal in your
data is, but an example could be: 50% Train, 25% Validation
and 25% Test
• The validation set is used strictly for model tuning (via
validation of models with different parameters) and the test
set is used to make a final estimate of the generalization
error.
Validation is for
Model Tuning
3 of 10
Top 10 Data Science Practitioner Pitfalls
Model Performance
3. Model Performance
Test Error
• Partition the original data (randomly) into a training set
and a test set. (e.g. 70/30)
• Train a model using the training set and evaluate
performance (a single time) on the test set.
• Train & test K models
as shown.
• Average the model
performance over
the K test sets.
• Report cross-
validated metrics.
• Regression: R^2, MSE, RMSE
• Classification: Accuracy, F1, H-measure, Log-loss
• Ranking (Binary Outcome): AUC, Partial AUC
K-fold
Cross-validation
Performance Metrics
4 of 10
Top 10 Data Science Practitioner Pitfalls
Class Imbalance
4. Class Imbalance
Imbalanced
Response Variable
• A dataset is said to be imbalanced when the binomial or
multinomial response variable has one or more classes that
are underrepresented in the training data, with respect to
the other classes.
• This is incredibly common in real-word datasets.
• In practice, balanced datasets are the rarity, unless they
have been artificially created.
• There is no precise definition of what defines an
imbalanced vs balanced dataset — the term is vague.
• My rule of thumb for binary response: If the minority class
makes <10% of the data, this can cause issues.
• Advertising — Probability that someone clicks on ad is
very low… very very low.
• Healthcare & Medicine — Certain diseases or adverse medical
conditions are rare.
• Fraud Detection — Insurance or credit fraud is rare.
Very common
Industries
Artificial Balance • You can balance the training set using sampling.
• Notice that we don’t say to balance the test set. The test
set represents the true data distribution. The only way to
get “honest” model performance on your test set is to use
the original, unbalanced, test set.
• The same goes for the hold-out sets in cross-validation.
• H2O has a “balance_classes” argument that can be used to do
this properly & automatically.
• H2O’s GBM includes a “sample rate per class” feature
• You can manually upsample (or downsample) your minority (or
majority) class(es) set either by duplicating (or sub-sampling)
rows, or by using row weights.
• The SMOTE (Synthetic Minority Oversampling Technique)
algorithm generates simulated training examples from the
minority class instead of upsampling.
Potential Pitfalls
Solutions
4. Remedies
5 of 10
Top 10 Data Science Practitioner Pitfalls
Categorical Data
5. Categorical Data
Real Data • Most real world datasets contain categorical data.
• Problems can arise if you have too many categories.
• A lot of ML software will place limits on the number of
categories allowed in a single column (e.g. 1024) so you may
be forced to deal with this whether you like it or not.
• When there are high-cardinality categorical columns, often
there will be many categories that only occur a small
number of times (not very useful).
• H2O handles categoricals automatically to best fit the algorithm
• GBM and Random Forest handle categorical columns as a single
feature and can separate multiple categories at each split
• GLM and Deep Learning automatically create binary indicator
variables (one hot encoding)
• Applying hierarchical knowledge about the data may allow the
number of categories to be reduced.
• Example: ICD-9 codes — thousands of unique diagnosis and
procedure codes. You can map each category to a higher level
super-category to reduce the cardinality.
Too Many
Categories
Solutions
6 of 10
Top 10 Data Science Practitioner Pitfalls
Missing Data
6. Missing Data
Types of
Missing Data
• Unavailable: Valid for the observation, but not available in
the data set.
• Removed: Observation quality threshold may have not been
reached, and data removed
• Not applicable: measurement does not apply to the
particular observation (e.g. number of tires on a boat
observation)
• It depends! Some options:
• Ignore entire observation.
• Create an binary variable for each predictor to indicate
whether the data was missing or not
• Segment model based on data availability.
• Use alternative algorithm: decision trees accept missing
values; linear models typically do not.
What to Do
7 of 10
Top 10 Data Science Practitioner Pitfalls
Outliers
7. Outliers/Extreme Values
Types of Outliers • Outliers can exist in response or predictors
• Valid outliers: rare, extreme events
• Invalid outliers: erroneous measurements
• Remove observations.
• Apply a transformation to reduce impact: e.g. log or bins.
• Choose a loss function that is more robust: e.g. MAE vs
MSE.
• Use H2O’s histogram_type “QuantilesGlobal”
• Impose a constraint on data range (cap values).
• Ask questions: Understand whether the values are valid or
invalid, to make the most appropriate choice.
What Can
Happen
• Outlier values can have a disproportionate weight on the
model.
• MSE will focus on fitting outlier observations more to
reduce squared error.
• Boosting will spend considerable modeling effort fitting
these observations.
What to Do
8 of 10
Top 10 Data Science Practitioner Pitfalls
Data Leakage
8. Data Leakage
What Is It • Leakage is allowing your model to use information that will
not be available in a production setting.
• Simple example: allowing the model to use future index
values when predicting security pricing
• Subtle example: using the overall average star rating for a
business, calculated on the full training set
• Model is overfit.
• Will make predictions inconsistent with those you scored
when fitting the model (even with a validation set).
• Insights derived from the model will be incorrect.
• Understand the nature of your problem and data.
• Scrutinize model feedback, such as variable importances or
coefficient magnitudes.
• Mimic your production environment as closely as possible
in your entire modeling pipeline
What Happens
What to Do
9 of 10
Top 10 Data Science Practitioner Pitfalls
Useless Models
9. Useless Models
What is a
“Useless” Model?
• Solving the Wrong Problem.
• Not collecting appropriate data.
• Not structuring data correctly to solve the problem.
• Choosing a target/loss measure that does not optimize the
end use case; e.g. using accuracy to prioritize resources.
• Having a model that is not actionable.
• Using a complicated model that is less accurate than a
simple model.
• Understand the problem statement.
• Solving the wrong problem is an issue in all problem-solving
domains, but arguably easier with black box techniques common
to ML
• Utilize post-processing measures
• Create simple baseline models to understand lift of more
complex models
• Plan on an iterative approach: start quickly, even if on imperfect
data
• Question your models and attempt to understand them
What To Do
10 of 10
Top 10 Data Science Practitioner Pitfalls
No Free Lunch
10. No Free Lunch
No Such Thing as a
Free Lunch
• No general purpose algorithm to solve all problems.
• No right answer on optimal data preparation.
• General heuristics are not always true:
• Tree models solve problems equivalently with any
order-preserving transformation.
• Decision trees and neural networks will automatically
find interactions.
• High number of predictors may be handled, but lead
to a less optimal result than fewer key predictors.
• Model feedback can be misleading: relative influence,
linear coefficients
• Understand how the underlying algorithms operate
• Try several algorithms and observe relative performance and the
characteristics of your data
• Feature engineering & feature selection
• Interpret and react to model feedback
What To Do
Thank You!
Q&A if time permits

More Related Content

PDF
H2O World - Ensembles with Erin LeDell
PPTX
Top 10 Data Science Practioner Pitfalls - Mark Landry
PDF
H2O World - Top 10 Deep Learning Tips & Tricks - Arno Candel
PDF
Jan vitek distributedrandomforest_5-2-2013
PDF
GLM & GBM in H2O
PDF
General Tips for participating Kaggle Competitions
PDF
Ted Willke, Intel Labs MLconf 2013
PDF
Winning Kaggle 101: Introduction to Stacking
H2O World - Ensembles with Erin LeDell
Top 10 Data Science Practioner Pitfalls - Mark Landry
H2O World - Top 10 Deep Learning Tips & Tricks - Arno Candel
Jan vitek distributedrandomforest_5-2-2013
GLM & GBM in H2O
General Tips for participating Kaggle Competitions
Ted Willke, Intel Labs MLconf 2013
Winning Kaggle 101: Introduction to Stacking

What's hot (20)

PDF
Kaggle presentation
PDF
Machine Learning for Everyone
PPTX
Hadoop and Machine Learning
PDF
Hacking Predictive Modeling - RoadSec 2018
PPTX
Automatic Visualization - Leland Wilkinson, Chief Scientist, H2O.ai
PPTX
Session 06 machine learning.pptx
PDF
Exposé Ontology
PDF
H2O World - Sparkling water on the Spark Notebook: Interactive Genomes Clust...
PDF
QCon Rio - Machine Learning for Everyone
PDF
Machine Learning for Dummies
PPTX
Machine learning 101 dkom 2017
PPTX
Classification with Naive Bayes
PDF
Data exploration validation and sanitization
PDF
AutoML lectures (ACDL 2019)
PDF
Open and Automated Machine Learning
PDF
Machine learning the next revolution or just another hype
PPTX
Intro to Mahout -- DC Hadoop
PDF
OpenML 2019
PDF
Learning how to learn
PPTX
Using the search engine as recommendation engine
Kaggle presentation
Machine Learning for Everyone
Hadoop and Machine Learning
Hacking Predictive Modeling - RoadSec 2018
Automatic Visualization - Leland Wilkinson, Chief Scientist, H2O.ai
Session 06 machine learning.pptx
Exposé Ontology
H2O World - Sparkling water on the Spark Notebook: Interactive Genomes Clust...
QCon Rio - Machine Learning for Everyone
Machine Learning for Dummies
Machine learning 101 dkom 2017
Classification with Naive Bayes
Data exploration validation and sanitization
AutoML lectures (ACDL 2019)
Open and Automated Machine Learning
Machine learning the next revolution or just another hype
Intro to Mahout -- DC Hadoop
OpenML 2019
Learning how to learn
Using the search engine as recommendation engine
Ad

Viewers also liked (20)

PDF
Deep Learning with MXNet - Dmitry Larko
PDF
Deep Water - GPU Deep Learning for H2O - Arno Candel
PDF
ArnoCandelAIFrontiers011217
PDF
Cybersecurity with AI - Ashrith Barthur
PPTX
Using Machine Learning For Solving Time Series Probelms
PPTX
PDF
Stacked Ensembles in H2O
PDF
sparklyr - Jeff Allen
PPTX
Interpretable machine learning
PDF
H2O Deep Water - Making Deep Learning Accessible to Everyone
PDF
H2O AutoML roadmap - Ray Peck
PDF
Sparkling Water 2.0 - Michal Malohlava
PPTX
Nvidia Deep Learning Solutions - Alex Sabatier
PPTX
Skutil - H2O meets Sklearn - Taylor Smith
PPTX
H2O & Tensorflow - Fabrizio
PDF
Deep Water - Bringing Tensorflow, Caffe, Mxnet to H2O
PPTX
Data Science, Machine Learning, and H2O
PDF
H2O World - NCS Continuous Media Optimization w/H2O - Satya Satyamoorthy
PDF
Transformation, H2O Open Dallas 2016, Keynote by Sri Ambati,
PPTX
Driving In-Store Sales with Real-Time Personalization - Cyril Nigg, Catalina ...
Deep Learning with MXNet - Dmitry Larko
Deep Water - GPU Deep Learning for H2O - Arno Candel
ArnoCandelAIFrontiers011217
Cybersecurity with AI - Ashrith Barthur
Using Machine Learning For Solving Time Series Probelms
Stacked Ensembles in H2O
sparklyr - Jeff Allen
Interpretable machine learning
H2O Deep Water - Making Deep Learning Accessible to Everyone
H2O AutoML roadmap - Ray Peck
Sparkling Water 2.0 - Michal Malohlava
Nvidia Deep Learning Solutions - Alex Sabatier
Skutil - H2O meets Sklearn - Taylor Smith
H2O & Tensorflow - Fabrizio
Deep Water - Bringing Tensorflow, Caffe, Mxnet to H2O
Data Science, Machine Learning, and H2O
H2O World - NCS Continuous Media Optimization w/H2O - Satya Satyamoorthy
Transformation, H2O Open Dallas 2016, Keynote by Sri Ambati,
Driving In-Store Sales with Real-Time Personalization - Cyril Nigg, Catalina ...
Ad

Similar to Top 10 Data Science Practitioner Pitfalls (20)

PPTX
H2O World - Top 10 Data Science Pitfalls - Mark Landry
PDF
Top 10 Data Science Practitioner Pitfalls
PPTX
Build_Machine_Learning_System for Machine Learning Course
PPTX
Statistics in the age of data science, issues you can not ignore
PPTX
The 8 Step Data Mining Process
PDF
Lecture 9: Machine Learning in Practice (2)
PDF
Barga Data Science lecture 10
PPTX
MACHINE LEARNING INTRODUCTION DIFFERENCE BETWEEN SUOERVISED , UNSUPERVISED AN...
PPTX
Statistical Learning and Model Selection module 2.pptx
PPTX
Model Development And Evaluation in ML.pptx
PPT
Introduce to approaches of classifiers combination
PDF
Complete picture of Ensemble-Learning, boosting, bagging
PDF
Machine Learning Basics and Supervised, unsupervised
PPTX
Mini datathon - Bengaluru
PDF
Introduction to Artificial Intelligence_ Lec 10
PPTX
Unit 1-ML (1) (1).pptx
PPTX
Lecture2_machine learning training+testing.pptx
PDF
Experimental Design for Distributed Machine Learning with Myles Baker
PPTX
Statistical Learning and Model Selection (1).pptx
PPTX
Machine learning basics using python programking
H2O World - Top 10 Data Science Pitfalls - Mark Landry
Top 10 Data Science Practitioner Pitfalls
Build_Machine_Learning_System for Machine Learning Course
Statistics in the age of data science, issues you can not ignore
The 8 Step Data Mining Process
Lecture 9: Machine Learning in Practice (2)
Barga Data Science lecture 10
MACHINE LEARNING INTRODUCTION DIFFERENCE BETWEEN SUOERVISED , UNSUPERVISED AN...
Statistical Learning and Model Selection module 2.pptx
Model Development And Evaluation in ML.pptx
Introduce to approaches of classifiers combination
Complete picture of Ensemble-Learning, boosting, bagging
Machine Learning Basics and Supervised, unsupervised
Mini datathon - Bengaluru
Introduction to Artificial Intelligence_ Lec 10
Unit 1-ML (1) (1).pptx
Lecture2_machine learning training+testing.pptx
Experimental Design for Distributed Machine Learning with Myles Baker
Statistical Learning and Model Selection (1).pptx
Machine learning basics using python programking

More from Sri Ambati (20)

PDF
H2O Label Genie Starter Track - Support Presentation
PDF
H2O.ai Agents : From Theory to Practice - Support Presentation
PDF
H2O Generative AI Starter Track - Support Presentation Slides.pdf
PDF
H2O Gen AI Ecosystem Overview - Level 1 - Slide Deck
PDF
An In-depth Exploration of Enterprise h2oGPTe Slide Deck
PDF
Intro to Enterprise h2oGPTe Presentation Slides
PDF
Enterprise h2o GPTe Learning Path Slide Deck
PDF
H2O Wave Course Starter - Presentation Slides
PDF
Large Language Models (LLMs) - Level 3 Slides
PDF
Data Science and Machine Learning Platforms (2024) Slides
PDF
Data Prep for H2O Driverless AI - Slides
PDF
H2O Cloud AI Developer Services - Slides (2024)
PDF
LLM Learning Path Level 2 - Presentation Slides
PDF
LLM Learning Path Level 1 - Presentation Slides
PDF
Hydrogen Torch - Starter Course - Presentation Slides
PDF
Presentation Resources - H2O Gen AI Ecosystem Overview - Level 2
PDF
H2O Driverless AI Starter Course - Slides and Assignments
PPTX
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
PDF
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
PPTX
Generative AI Masterclass - Model Risk Management.pptx
H2O Label Genie Starter Track - Support Presentation
H2O.ai Agents : From Theory to Practice - Support Presentation
H2O Generative AI Starter Track - Support Presentation Slides.pdf
H2O Gen AI Ecosystem Overview - Level 1 - Slide Deck
An In-depth Exploration of Enterprise h2oGPTe Slide Deck
Intro to Enterprise h2oGPTe Presentation Slides
Enterprise h2o GPTe Learning Path Slide Deck
H2O Wave Course Starter - Presentation Slides
Large Language Models (LLMs) - Level 3 Slides
Data Science and Machine Learning Platforms (2024) Slides
Data Prep for H2O Driverless AI - Slides
H2O Cloud AI Developer Services - Slides (2024)
LLM Learning Path Level 2 - Presentation Slides
LLM Learning Path Level 1 - Presentation Slides
Hydrogen Torch - Starter Course - Presentation Slides
Presentation Resources - H2O Gen AI Ecosystem Overview - Level 2
H2O Driverless AI Starter Course - Slides and Assignments
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
Generative AI Masterclass - Model Risk Management.pptx

Recently uploaded (20)

PDF
Zenith AI: Advanced Artificial Intelligence
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
1 - Historical Antecedents, Social Consideration.pdf
PDF
A novel scalable deep ensemble learning framework for big data classification...
PPTX
Group 1 Presentation -Planning and Decision Making .pptx
PDF
Enhancing emotion recognition model for a student engagement use case through...
PDF
NewMind AI Weekly Chronicles - August'25-Week II
PPTX
SOPHOS-XG Firewall Administrator PPT.pptx
PDF
Accuracy of neural networks in brain wave diagnosis of schizophrenia
PDF
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
PPTX
TechTalks-8-2019-Service-Management-ITIL-Refresh-ITIL-4-Framework-Supports-Ou...
PDF
Encapsulation theory and applications.pdf
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
PDF
Getting Started with Data Integration: FME Form 101
PDF
A comparative analysis of optical character recognition models for extracting...
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
project resource management chapter-09.pdf
PDF
Microsoft Solutions Partner Drive Digital Transformation with D365.pdf
Zenith AI: Advanced Artificial Intelligence
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
1 - Historical Antecedents, Social Consideration.pdf
A novel scalable deep ensemble learning framework for big data classification...
Group 1 Presentation -Planning and Decision Making .pptx
Enhancing emotion recognition model for a student engagement use case through...
NewMind AI Weekly Chronicles - August'25-Week II
SOPHOS-XG Firewall Administrator PPT.pptx
Accuracy of neural networks in brain wave diagnosis of schizophrenia
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
TechTalks-8-2019-Service-Management-ITIL-Refresh-ITIL-4-Framework-Supports-Ou...
Encapsulation theory and applications.pdf
Encapsulation_ Review paper, used for researhc scholars
gpt5_lecture_notes_comprehensive_20250812015547.pdf
Getting Started with Data Integration: FME Form 101
A comparative analysis of optical character recognition models for extracting...
Digital-Transformation-Roadmap-for-Companies.pptx
Programs and apps: productivity, graphics, security and other tools
project resource management chapter-09.pdf
Microsoft Solutions Partner Drive Digital Transformation with D365.pdf

Top 10 Data Science Practitioner Pitfalls

  • 1. CONFIDENTIAL Mark Landry October 26, 2016 Top 10 Data Science Practitioner Pitfalls
  • 2. 1 of 10 Top 10 Data Science Practitioner Pitfalls Train vs Test
  • 3. 1. Train vs Test Training Set vs. Test Set • Partition the original data (randomly or stratified) into a training set and a test set. (e.g. 70/30) • It can be useful to evaluate the training error, but you should not look at training error alone. • Training error is not an estimate of generalization error (on a test set or cross-validated), which is what you should care more about. • Training error vs test error over time is an useful thing to calculate. It can tell you when you start to overfit your model, so it is a useful metric in supervised machine learning. Training Error vs. Test Error
  • 4. Source: Elements of Statistical Learning 1. Train vs Test
  • 5. 2 of 10 Top 10 Data Science Practitioner Pitfalls Validation Set
  • 6. 2. Train vs Test vs Valid Training Set vs. Validation Set vs. Test Set • If you have “enough” data and plan to do some model tuning, you should really partition your data into three parts — Training, Validation and Test sets. • There is no general rule for how you should partition the data and it will depend on how strong the signal in your data is, but an example could be: 50% Train, 25% Validation and 25% Test • The validation set is used strictly for model tuning (via validation of models with different parameters) and the test set is used to make a final estimate of the generalization error. Validation is for Model Tuning
  • 7. 3 of 10 Top 10 Data Science Practitioner Pitfalls Model Performance
  • 8. 3. Model Performance Test Error • Partition the original data (randomly) into a training set and a test set. (e.g. 70/30) • Train a model using the training set and evaluate performance (a single time) on the test set. • Train & test K models as shown. • Average the model performance over the K test sets. • Report cross- validated metrics. • Regression: R^2, MSE, RMSE • Classification: Accuracy, F1, H-measure, Log-loss • Ranking (Binary Outcome): AUC, Partial AUC K-fold Cross-validation Performance Metrics
  • 9. 4 of 10 Top 10 Data Science Practitioner Pitfalls Class Imbalance
  • 10. 4. Class Imbalance Imbalanced Response Variable • A dataset is said to be imbalanced when the binomial or multinomial response variable has one or more classes that are underrepresented in the training data, with respect to the other classes. • This is incredibly common in real-word datasets. • In practice, balanced datasets are the rarity, unless they have been artificially created. • There is no precise definition of what defines an imbalanced vs balanced dataset — the term is vague. • My rule of thumb for binary response: If the minority class makes <10% of the data, this can cause issues. • Advertising — Probability that someone clicks on ad is very low… very very low. • Healthcare & Medicine — Certain diseases or adverse medical conditions are rare. • Fraud Detection — Insurance or credit fraud is rare. Very common Industries
  • 11. Artificial Balance • You can balance the training set using sampling. • Notice that we don’t say to balance the test set. The test set represents the true data distribution. The only way to get “honest” model performance on your test set is to use the original, unbalanced, test set. • The same goes for the hold-out sets in cross-validation. • H2O has a “balance_classes” argument that can be used to do this properly & automatically. • H2O’s GBM includes a “sample rate per class” feature • You can manually upsample (or downsample) your minority (or majority) class(es) set either by duplicating (or sub-sampling) rows, or by using row weights. • The SMOTE (Synthetic Minority Oversampling Technique) algorithm generates simulated training examples from the minority class instead of upsampling. Potential Pitfalls Solutions 4. Remedies
  • 12. 5 of 10 Top 10 Data Science Practitioner Pitfalls Categorical Data
  • 13. 5. Categorical Data Real Data • Most real world datasets contain categorical data. • Problems can arise if you have too many categories. • A lot of ML software will place limits on the number of categories allowed in a single column (e.g. 1024) so you may be forced to deal with this whether you like it or not. • When there are high-cardinality categorical columns, often there will be many categories that only occur a small number of times (not very useful). • H2O handles categoricals automatically to best fit the algorithm • GBM and Random Forest handle categorical columns as a single feature and can separate multiple categories at each split • GLM and Deep Learning automatically create binary indicator variables (one hot encoding) • Applying hierarchical knowledge about the data may allow the number of categories to be reduced. • Example: ICD-9 codes — thousands of unique diagnosis and procedure codes. You can map each category to a higher level super-category to reduce the cardinality. Too Many Categories Solutions
  • 14. 6 of 10 Top 10 Data Science Practitioner Pitfalls Missing Data
  • 15. 6. Missing Data Types of Missing Data • Unavailable: Valid for the observation, but not available in the data set. • Removed: Observation quality threshold may have not been reached, and data removed • Not applicable: measurement does not apply to the particular observation (e.g. number of tires on a boat observation) • It depends! Some options: • Ignore entire observation. • Create an binary variable for each predictor to indicate whether the data was missing or not • Segment model based on data availability. • Use alternative algorithm: decision trees accept missing values; linear models typically do not. What to Do
  • 16. 7 of 10 Top 10 Data Science Practitioner Pitfalls Outliers
  • 17. 7. Outliers/Extreme Values Types of Outliers • Outliers can exist in response or predictors • Valid outliers: rare, extreme events • Invalid outliers: erroneous measurements • Remove observations. • Apply a transformation to reduce impact: e.g. log or bins. • Choose a loss function that is more robust: e.g. MAE vs MSE. • Use H2O’s histogram_type “QuantilesGlobal” • Impose a constraint on data range (cap values). • Ask questions: Understand whether the values are valid or invalid, to make the most appropriate choice. What Can Happen • Outlier values can have a disproportionate weight on the model. • MSE will focus on fitting outlier observations more to reduce squared error. • Boosting will spend considerable modeling effort fitting these observations. What to Do
  • 18. 8 of 10 Top 10 Data Science Practitioner Pitfalls Data Leakage
  • 19. 8. Data Leakage What Is It • Leakage is allowing your model to use information that will not be available in a production setting. • Simple example: allowing the model to use future index values when predicting security pricing • Subtle example: using the overall average star rating for a business, calculated on the full training set • Model is overfit. • Will make predictions inconsistent with those you scored when fitting the model (even with a validation set). • Insights derived from the model will be incorrect. • Understand the nature of your problem and data. • Scrutinize model feedback, such as variable importances or coefficient magnitudes. • Mimic your production environment as closely as possible in your entire modeling pipeline What Happens What to Do
  • 20. 9 of 10 Top 10 Data Science Practitioner Pitfalls Useless Models
  • 21. 9. Useless Models What is a “Useless” Model? • Solving the Wrong Problem. • Not collecting appropriate data. • Not structuring data correctly to solve the problem. • Choosing a target/loss measure that does not optimize the end use case; e.g. using accuracy to prioritize resources. • Having a model that is not actionable. • Using a complicated model that is less accurate than a simple model. • Understand the problem statement. • Solving the wrong problem is an issue in all problem-solving domains, but arguably easier with black box techniques common to ML • Utilize post-processing measures • Create simple baseline models to understand lift of more complex models • Plan on an iterative approach: start quickly, even if on imperfect data • Question your models and attempt to understand them What To Do
  • 22. 10 of 10 Top 10 Data Science Practitioner Pitfalls No Free Lunch
  • 23. 10. No Free Lunch No Such Thing as a Free Lunch • No general purpose algorithm to solve all problems. • No right answer on optimal data preparation. • General heuristics are not always true: • Tree models solve problems equivalently with any order-preserving transformation. • Decision trees and neural networks will automatically find interactions. • High number of predictors may be handled, but lead to a less optimal result than fewer key predictors. • Model feedback can be misleading: relative influence, linear coefficients • Understand how the underlying algorithms operate • Try several algorithms and observe relative performance and the characteristics of your data • Feature engineering & feature selection • Interpret and react to model feedback What To Do
  • 24. Thank You! Q&A if time permits