SlideShare a Scribd company logo
Creativity and Curiosity 
THE TRIAL AND ERROR OF DATA SCIENCE
I love data, everything comes easy to me…
There are so many things to try and explore on a given problem, where to start? 
• Language (Julia, Python,R, C++,etc) 
• Visualization (ggplot, Tableau, D3,etc) 
• Pre-process (standardize, variance scaling, feature encoding, etc) 
• Classifier (GLM, SVM, SGD, Knn, Random Forest, etc) 
• Post-process (Rule-truncatation, post-pruning, etc) 
• Ensemble (weighted average, min, max, probabilities, etc)
Where Many Individual Come To Die… 
(Model Tuning Hell)
Structured Process 
Allows you to remove uncertainty and 
ensure outcomes in a methodical way. 
Gives you an idea of what activities to 
do and when. 
Details for each project varies, 
however the structure should stay 
the same. 
The process is almost never linear, 
you should revisit each step again 
and again. 
Knowledge Discovery Process 
1. Define the goal 
2. Explore the data 
3. Prepare the data 
4. Choosing and evaluating 
models 
5. Ensemble
Define the Goal 
• Why do the sponsors want the project in the first place? 
What do they lack, and what do they need? 
• What are they doing to solve the problem now, and why 
isn’t that good enough? 
• What resources will you need: what kind of data? Do 
you have domain experts to collaborate with, and what 
are the computational resources? 
• How do the project sponsors plan to deploy your 
results? What are the constraints that have to be met 
for successful deployment? 
• Is the data quality good enough?
Define the Goal 
Modeling: 
• Classification 
• Scoring 
• Ranking 
• Clustering 
• Finding relations 
• Characterization 
Model Evaluation and critique 
• Is it accurate enough for your needs? Does it 
generalize well? 
• Does it perform better than “the obvious guess”? 
Better than whatever is currently in use? 
• Do the results of the model (coefficients, clusters, 
rules) make sense in the context of the problem 
domain?
Explore the Data 
Use summary statistics to spot problems 
• Missingness 
• Data ranges (too wide/too 
narrow) 
• Invalid values 
• Outliers 
• Units
Explore the Data 
Use graphics and visualization to spot problems 
Single-Variable First 
• Peak of distribution? 
• How many peaks? 
• How normal (or lognormal is the data? 
• How much data variation is there? Is it 
concentrated in a certain interval or 
category? 
• Use histograms, density plots, bar 
charts, scatter plots with smoothing 
curve.
Prepare the Data 
Cleaning Data 
• Treating missing 
values (NAs) 
• Data 
Transformations 
Sampling for Modeling and Validation 
• Test and training splits 
• Creating sample group column 
• Record grouping
Choosing and Evaluating Models 
Mapping problems to machine learning tasks 
(use a problem-to-method mapping) 
• Solving classification problems 
• Naïve Bayes 
• Decision Trees 
• Logistic Regression 
• Solving scoring problems 
• Linear Regression 
• Logistic Regression 
• Working without known targets 
• K-means clustering 
• Apriori algo to find association rules 
• Nearest neighbor
Choosing and Evaluating Models 
Evaluating models 
• Evaluating classification models 
• Confusion matrix 
• Precision 
• Recall 
• Sensitivity 
• Specificity 
• Evaluating scoring models 
• Root Mean Square Error 
• R-squared 
• Correlation 
• Absolute Error
Choosing and Evaluating Models 
Evaluating models 
• Evaluating probability models 
• Area Under the Curve 
• Log Likelihood 
• Deviance 
• Akaike Information Criterion (AIC) 
• Entropy 
• Evaluating ranking models 
• Intra-cluster distances 
• Cross-cluster distances
Choosing and Evaluating Models 
Validating models 
• Identify common model problems 
• Bias – systematic error 
• Variance – oversensitivity of the model 
• Overfit – doesn’t generalize well 
• Nonsignficance – relation may not hold 
• Ensuring model quality 
• Testing on Held-Out Data 
• K-Fold Cross Validation 
• Significance Testing 
• Confidence Intervals
Ensemble 
How do I bring all my work together? 
• Weighted average 
• Min 
• Max 
• Voting 
• Stacking 
• Neural network
More Ideas 
Learn about ensemble methods, regularization, and principled dimension 
reduction 
• Hastie, Tibshirani, and Friedman’s The Elements of Statistical Learning, 
Second Edition 
• If you want to understand the consequences of a method, has a math 
bent 
Keep your saw sharp Plug-in
Using your creativity and curiosity you can slay mighty data science problems.
@DamianMingle 
http://www.WPC-Services.com 
http://www.DamianMingle.com

More Related Content

PPTX
The zen of predictive modelling
PDF
RecSys 2016 Talk: Feature Selection For Human Recommenders
PDF
Making Creativity Happen: Flanders DC
PPTX
Trial and error for wiki
PPTX
Why You Must Always Embrace Trial and Error
PDF
Lesson 11 trial error learning 2013
PPTX
Psychological Learning
PDF
Trial and error. Learning from a coworking space failure
The zen of predictive modelling
RecSys 2016 Talk: Feature Selection For Human Recommenders
Making Creativity Happen: Flanders DC
Trial and error for wiki
Why You Must Always Embrace Trial and Error
Lesson 11 trial error learning 2013
Psychological Learning
Trial and error. Learning from a coworking space failure

Similar to Creativity and Curiosity - The Trial and Error of Data Science (20)

PPTX
Unit 2- Machine Learninnonjjnkbhkhjjljknkmg.pptx
PPTX
Informs presentation new ppt
PPT
Kevin Swingler: Introduction to Data Mining
PPTX
351315535-Module-1-Intro-to-Data-Science-pptx.pptx
PPTX
10 best practices in operational analytics
PPTX
Machine learning module 2
PDF
Machine learning at b.e.s.t. summer university
PDF
Steps in the Data Science Process | IABAC
PDF
data-science-lifecycle-ebook.pdf
PDF
Introduction to Data Science
PDF
Choosing a Machine Learning technique to solve your need
PDF
Data Mining methodology
PPTX
DataAnalyticsIntroduction and its ci.pptx
PPTX
Что такое Data Science
PDF
Learning from data
PPTX
Machine_Learning.pptx
PPTX
Supervised learning
PDF
The Data Science Process
PDF
Big Data LDN 2018: TIPS AND TRICKS TO WRANGLE BIG, DIRTY DATA
PPTX
JamieStainer ATA SCIEnCE path finder.pptx
Unit 2- Machine Learninnonjjnkbhkhjjljknkmg.pptx
Informs presentation new ppt
Kevin Swingler: Introduction to Data Mining
351315535-Module-1-Intro-to-Data-Science-pptx.pptx
10 best practices in operational analytics
Machine learning module 2
Machine learning at b.e.s.t. summer university
Steps in the Data Science Process | IABAC
data-science-lifecycle-ebook.pdf
Introduction to Data Science
Choosing a Machine Learning technique to solve your need
Data Mining methodology
DataAnalyticsIntroduction and its ci.pptx
Что такое Data Science
Learning from data
Machine_Learning.pptx
Supervised learning
The Data Science Process
Big Data LDN 2018: TIPS AND TRICKS TO WRANGLE BIG, DIRTY DATA
JamieStainer ATA SCIEnCE path finder.pptx
Ad

Recently uploaded (20)

PDF
[EN] Industrial Machine Downtime Prediction
PPTX
Computer network topology notes for revision
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PDF
Introduction to the R Programming Language
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PPTX
IB Computer Science - Internal Assessment.pptx
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PPTX
Introduction to Knowledge Engineering Part 1
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PDF
Mega Projects Data Mega Projects Data
PDF
Clinical guidelines as a resource for EBP(1).pdf
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PDF
.pdf is not working space design for the following data for the following dat...
PDF
Lecture1 pattern recognition............
[EN] Industrial Machine Downtime Prediction
Computer network topology notes for revision
Business Ppt On Nestle.pptx huunnnhhgfvu
Introduction to the R Programming Language
oil_refinery_comprehensive_20250804084928 (1).pptx
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
Introduction-to-Cloud-ComputingFinal.pptx
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
IB Computer Science - Internal Assessment.pptx
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
Introduction to Knowledge Engineering Part 1
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
Mega Projects Data Mega Projects Data
Clinical guidelines as a resource for EBP(1).pdf
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
Galatica Smart Energy Infrastructure Startup Pitch Deck
.pdf is not working space design for the following data for the following dat...
Lecture1 pattern recognition............
Ad

Creativity and Curiosity - The Trial and Error of Data Science

  • 1. Creativity and Curiosity THE TRIAL AND ERROR OF DATA SCIENCE
  • 2. I love data, everything comes easy to me…
  • 3. There are so many things to try and explore on a given problem, where to start? • Language (Julia, Python,R, C++,etc) • Visualization (ggplot, Tableau, D3,etc) • Pre-process (standardize, variance scaling, feature encoding, etc) • Classifier (GLM, SVM, SGD, Knn, Random Forest, etc) • Post-process (Rule-truncatation, post-pruning, etc) • Ensemble (weighted average, min, max, probabilities, etc)
  • 4. Where Many Individual Come To Die… (Model Tuning Hell)
  • 5. Structured Process Allows you to remove uncertainty and ensure outcomes in a methodical way. Gives you an idea of what activities to do and when. Details for each project varies, however the structure should stay the same. The process is almost never linear, you should revisit each step again and again. Knowledge Discovery Process 1. Define the goal 2. Explore the data 3. Prepare the data 4. Choosing and evaluating models 5. Ensemble
  • 6. Define the Goal • Why do the sponsors want the project in the first place? What do they lack, and what do they need? • What are they doing to solve the problem now, and why isn’t that good enough? • What resources will you need: what kind of data? Do you have domain experts to collaborate with, and what are the computational resources? • How do the project sponsors plan to deploy your results? What are the constraints that have to be met for successful deployment? • Is the data quality good enough?
  • 7. Define the Goal Modeling: • Classification • Scoring • Ranking • Clustering • Finding relations • Characterization Model Evaluation and critique • Is it accurate enough for your needs? Does it generalize well? • Does it perform better than “the obvious guess”? Better than whatever is currently in use? • Do the results of the model (coefficients, clusters, rules) make sense in the context of the problem domain?
  • 8. Explore the Data Use summary statistics to spot problems • Missingness • Data ranges (too wide/too narrow) • Invalid values • Outliers • Units
  • 9. Explore the Data Use graphics and visualization to spot problems Single-Variable First • Peak of distribution? • How many peaks? • How normal (or lognormal is the data? • How much data variation is there? Is it concentrated in a certain interval or category? • Use histograms, density plots, bar charts, scatter plots with smoothing curve.
  • 10. Prepare the Data Cleaning Data • Treating missing values (NAs) • Data Transformations Sampling for Modeling and Validation • Test and training splits • Creating sample group column • Record grouping
  • 11. Choosing and Evaluating Models Mapping problems to machine learning tasks (use a problem-to-method mapping) • Solving classification problems • Naïve Bayes • Decision Trees • Logistic Regression • Solving scoring problems • Linear Regression • Logistic Regression • Working without known targets • K-means clustering • Apriori algo to find association rules • Nearest neighbor
  • 12. Choosing and Evaluating Models Evaluating models • Evaluating classification models • Confusion matrix • Precision • Recall • Sensitivity • Specificity • Evaluating scoring models • Root Mean Square Error • R-squared • Correlation • Absolute Error
  • 13. Choosing and Evaluating Models Evaluating models • Evaluating probability models • Area Under the Curve • Log Likelihood • Deviance • Akaike Information Criterion (AIC) • Entropy • Evaluating ranking models • Intra-cluster distances • Cross-cluster distances
  • 14. Choosing and Evaluating Models Validating models • Identify common model problems • Bias – systematic error • Variance – oversensitivity of the model • Overfit – doesn’t generalize well • Nonsignficance – relation may not hold • Ensuring model quality • Testing on Held-Out Data • K-Fold Cross Validation • Significance Testing • Confidence Intervals
  • 15. Ensemble How do I bring all my work together? • Weighted average • Min • Max • Voting • Stacking • Neural network
  • 16. More Ideas Learn about ensemble methods, regularization, and principled dimension reduction • Hastie, Tibshirani, and Friedman’s The Elements of Statistical Learning, Second Edition • If you want to understand the consequences of a method, has a math bent Keep your saw sharp Plug-in
  • 17. Using your creativity and curiosity you can slay mighty data science problems.

Editor's Notes

  • #5: Number of iterations Interaction depth Number of trees Learning rate