SlideShare a Scribd company logo
Revolution Confidential

Data Science
Not just for big data!
David Smith
Revolution Analytics
@revodavid
October 16, 2013
Big Data: the new oil?

Photo: Sarah&Boston (flickr: pocheco) Creative Commons BY-SA 2.0

Revolution Confidential

2
Big Data is just raw material

Revolution Confidential

 Data Distillation
 Extract quantities of interest
 Find complete cases
 Derive missing information

 Big Data Pitfalls:
 Data cleanliness & accuracy
 Observational bias
 Do the data I have represent the population I’m
interested in?
3
Surveys & Experiments

Revolution Confidential

 Even with Big Data, the data you need isn’t
always in the building!
 … so ask (survey)!
 Survey design
 Stratified sampling

 … or experiment!
 A/B Testing
 Experimental Design
4
Data Exploration & Visualization

Revolution Confidential

 Limited by pixels
 Big data = a big black
blob

 Extract signal from
noise





Aggregations
Heat maps
Smoothing
Small multiples
5
Statistical Modeling & Forecasting

Revolution Confidential

 You don’t always need big data
 Sampling can help with observational bias

 Model selection
 Feature extraction
 Confounding?
 Interactions?

 Model validation
 Overfitting

 Prediction
 Extrapolation
 Confidence
http://xkcd.com/605/

6
Summary

Revolution Confidential

 Big Data is great, but think of it as the “raw
materials” for data science
 After refining, “big” isn’t always so “Big”

 Use statistical insight to avoid pitfalls:
 Inferences: Observational bias / Sampling bias
 Predictions: Confounding / Overfitting
 Think about variances and means (risk!)

 Some data scientists may miss these issues
 Look for statistical expertise

 Further reading:
 ComputerWorld: 12 predictive analytics screw-ups
7

More Related Content

PDF
Data science presentation 2nd CI day
PDF
Introduction on Data Science
PDF
Introduction to data science intro,ch(1,2,3)
PDF
Introduction to Data Science
PPTX
Introduction of Data Science
PDF
Data science presentation
PDF
An Obligatory Introduction to Data Science
PDF
Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Data science presentation 2nd CI day
Introduction on Data Science
Introduction to data science intro,ch(1,2,3)
Introduction to Data Science
Introduction of Data Science
Data science presentation
An Obligatory Introduction to Data Science
Big Data [sorry] & Data Science: What Does a Data Scientist Do?

What's hot (20)

PPTX
Intro to Data Science Concepts
PPS
Big Data Science: Intro and Benefits
PDF
Introduction To Data Science
PDF
Unit 3 part 2
PPTX
Data Science presentation for elementary school students
PDF
Data science and_analytics_for_ordinary_people_ebook
PPTX
Introduction to data science
PDF
Life of a data scientist (pub)
PDF
Introduction to Data Science
PDF
Data science
PDF
Data Science
PDF
Data Science Project Lifecycle
PPTX
Intro to Data Science by DatalentTeam at Data Science Clinic#11
PDF
Introduction to Data Science (Data Science Thailand Meetup #1)
PDF
Python for Data Science - TDC 2015
PPTX
Big Data and the Art of Data Science
PPTX
Data Science using Python
PDF
Data science e machine learning
PDF
Data Science Introduction - Data Science: What Art Thou?
PDF
Introduction to Data Science and Analytics
Intro to Data Science Concepts
Big Data Science: Intro and Benefits
Introduction To Data Science
Unit 3 part 2
Data Science presentation for elementary school students
Data science and_analytics_for_ordinary_people_ebook
Introduction to data science
Life of a data scientist (pub)
Introduction to Data Science
Data science
Data Science
Data Science Project Lifecycle
Intro to Data Science by DatalentTeam at Data Science Clinic#11
Introduction to Data Science (Data Science Thailand Meetup #1)
Python for Data Science - TDC 2015
Big Data and the Art of Data Science
Data Science using Python
Data science e machine learning
Data Science Introduction - Data Science: What Art Thou?
Introduction to Data Science and Analytics
Ad

Viewers also liked (20)

KEY
Intro to Data Science for Enterprise Big Data
PDF
Myths and Mathemagical Superpowers of Data Scientists
PDF
A Statistician's View on Big Data and Data Science (Version 1)
ZIP
Open Access, Open Data. Open Research?
PDF
Data Tactics Analytics Brown Bag (Aug 22, 2013)
PDF
How to Become a Data Scientist
PDF
How to Interview a Data Scientist
PDF
Revolution Analytics Supports the Open Source R Community
PPTX
American Century (Revolution Analytics Customer Day)
DOCX
R2DOCX example
PPTX
Are You Ready for Big Data Big Analytics?
PPTX
Big Data Predictive Analytics with Revolution R Enterprise (Gartner BI Summit...
PDF
High Performance Predictive Analytics in R and Hadoop
PPTX
2013 Future of Open Source - 7th Annual Survey results
PDF
Webinar: Survival Analysis for Marketing Attribution - July 17, 2013
PPT
Be seen! Be cited! Have impact! Open Access and the Academy of Science of SA ...
PDF
Data Science, Machine Learning and Neural Networks
PPTX
05Nov13 Webinar: Introducing Revolution R Enterprise 7 - The Big Data Big Ana...
PDF
High Performance Predictive Analytics in R and Hadoop
PDF
Titan: The Rise of Big Graph Data
Intro to Data Science for Enterprise Big Data
Myths and Mathemagical Superpowers of Data Scientists
A Statistician's View on Big Data and Data Science (Version 1)
Open Access, Open Data. Open Research?
Data Tactics Analytics Brown Bag (Aug 22, 2013)
How to Become a Data Scientist
How to Interview a Data Scientist
Revolution Analytics Supports the Open Source R Community
American Century (Revolution Analytics Customer Day)
R2DOCX example
Are You Ready for Big Data Big Analytics?
Big Data Predictive Analytics with Revolution R Enterprise (Gartner BI Summit...
High Performance Predictive Analytics in R and Hadoop
2013 Future of Open Source - 7th Annual Survey results
Webinar: Survival Analysis for Marketing Attribution - July 17, 2013
Be seen! Be cited! Have impact! Open Access and the Academy of Science of SA ...
Data Science, Machine Learning and Neural Networks
05Nov13 Webinar: Introducing Revolution R Enterprise 7 - The Big Data Big Ana...
High Performance Predictive Analytics in R and Hadoop
Titan: The Rise of Big Graph Data
Ad

Similar to Data Science: Not Just For Big Data (20)

PPSX
Intro to Data Science Big Data
PPTX
Essential Prerequisites for Maximizing Success from Big Data
PPTX
Revolution Analytics Podcast
PDF
R for SAS Users Complement or Replace Two Strategies
PPTX
R and Data Science
PPTX
DataScienceIntroduction.pptx
PDF
R and Big Data using Revolution R Enterprise with Hadoop
PDF
Copy-of-Data-Science-Unlocking-Insights-Driving-Innovation (1)
PPT
SC4 BigDataEurope - Business angle - Dave Marples
PDF
Real-time Big Data Analytics: From Deployment to Production
PPTX
Umsl big data
PDF
Random notes on big data
PPTX
Big data
PDF
You Don't Have to Be a Data Scientist to Do Data Science
PPTX
Unit 1 (DSBDA) PD.pptx
PDF
Scalable Data Analysis in R Webinar Presentation
PPT
ai based computer basic learning Lecture about Bigdata.ppt
PPT
Data Science in the Real World: Making a Difference
PDF
Bigdata (1) converted
Intro to Data Science Big Data
Essential Prerequisites for Maximizing Success from Big Data
Revolution Analytics Podcast
R for SAS Users Complement or Replace Two Strategies
R and Data Science
DataScienceIntroduction.pptx
R and Big Data using Revolution R Enterprise with Hadoop
Copy-of-Data-Science-Unlocking-Insights-Driving-Innovation (1)
SC4 BigDataEurope - Business angle - Dave Marples
Real-time Big Data Analytics: From Deployment to Production
Umsl big data
Random notes on big data
Big data
You Don't Have to Be a Data Scientist to Do Data Science
Unit 1 (DSBDA) PD.pptx
Scalable Data Analysis in R Webinar Presentation
ai based computer basic learning Lecture about Bigdata.ppt
Data Science in the Real World: Making a Difference
Bigdata (1) converted

More from Revolution Analytics (20)

PPTX
Speeding up R with Parallel Programming in the Cloud
PPTX
Migrating Existing Open Source Machine Learning to Azure
PPTX
R in Minecraft
PPTX
The case for R for AI developers
PPTX
Speed up R with parallel programming in the Cloud
PPTX
The R Ecosystem
PPTX
R Then and Now
PPTX
Predicting Loan Delinquency at One Million Transactions per Second
PPTX
Reproducible Data Science with R
PPTX
The Value of Open Source Communities
PPTX
The R Ecosystem
PPTX
R at Microsoft (useR! 2016)
PPTX
Building a scalable data science platform with R
PPTX
R at Microsoft
PPTX
The Business Economics and Opportunity of Open Source Data Science
PPTX
Taking R Analytics to SQL and the Cloud
PPTX
The Network structure of R packages on CRAN & BioConductor
PPTX
The network structure of cran 2015 07-02 final
PPTX
Simple Reproducibility with the checkpoint package
PPTX
R at Microsoft
Speeding up R with Parallel Programming in the Cloud
Migrating Existing Open Source Machine Learning to Azure
R in Minecraft
The case for R for AI developers
Speed up R with parallel programming in the Cloud
The R Ecosystem
R Then and Now
Predicting Loan Delinquency at One Million Transactions per Second
Reproducible Data Science with R
The Value of Open Source Communities
The R Ecosystem
R at Microsoft (useR! 2016)
Building a scalable data science platform with R
R at Microsoft
The Business Economics and Opportunity of Open Source Data Science
Taking R Analytics to SQL and the Cloud
The Network structure of R packages on CRAN & BioConductor
The network structure of cran 2015 07-02 final
Simple Reproducibility with the checkpoint package
R at Microsoft

Recently uploaded (20)

PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PDF
Machine learning based COVID-19 study performance prediction
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Encapsulation_ Review paper, used for researhc scholars
PPT
Teaching material agriculture food technology
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Encapsulation theory and applications.pdf
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PPTX
sap open course for s4hana steps from ECC to s4
PDF
Review of recent advances in non-invasive hemoglobin estimation
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Approach and Philosophy of On baking technology
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
Electronic commerce courselecture one. Pdf
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Chapter 3 Spatial Domain Image Processing.pdf
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
Machine learning based COVID-19 study performance prediction
Network Security Unit 5.pdf for BCA BBA.
Encapsulation_ Review paper, used for researhc scholars
Teaching material agriculture food technology
“AI and Expert System Decision Support & Business Intelligence Systems”
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Spectral efficient network and resource selection model in 5G networks
Encapsulation theory and applications.pdf
Digital-Transformation-Roadmap-for-Companies.pptx
The Rise and Fall of 3GPP – Time for a Sabbatical?
sap open course for s4hana steps from ECC to s4
Review of recent advances in non-invasive hemoglobin estimation
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Approach and Philosophy of On baking technology
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Electronic commerce courselecture one. Pdf
Unlocking AI with Model Context Protocol (MCP)
Chapter 3 Spatial Domain Image Processing.pdf

Data Science: Not Just For Big Data

  • 1. Revolution Confidential Data Science Not just for big data! David Smith Revolution Analytics @revodavid October 16, 2013
  • 2. Big Data: the new oil? Photo: Sarah&Boston (flickr: pocheco) Creative Commons BY-SA 2.0 Revolution Confidential 2
  • 3. Big Data is just raw material Revolution Confidential  Data Distillation  Extract quantities of interest  Find complete cases  Derive missing information  Big Data Pitfalls:  Data cleanliness & accuracy  Observational bias  Do the data I have represent the population I’m interested in? 3
  • 4. Surveys & Experiments Revolution Confidential  Even with Big Data, the data you need isn’t always in the building!  … so ask (survey)!  Survey design  Stratified sampling  … or experiment!  A/B Testing  Experimental Design 4
  • 5. Data Exploration & Visualization Revolution Confidential  Limited by pixels  Big data = a big black blob  Extract signal from noise     Aggregations Heat maps Smoothing Small multiples 5
  • 6. Statistical Modeling & Forecasting Revolution Confidential  You don’t always need big data  Sampling can help with observational bias  Model selection  Feature extraction  Confounding?  Interactions?  Model validation  Overfitting  Prediction  Extrapolation  Confidence http://xkcd.com/605/ 6
  • 7. Summary Revolution Confidential  Big Data is great, but think of it as the “raw materials” for data science  After refining, “big” isn’t always so “Big”  Use statistical insight to avoid pitfalls:  Inferences: Observational bias / Sampling bias  Predictions: Confounding / Overfitting  Think about variances and means (risk!)  Some data scientists may miss these issues  Look for statistical expertise  Further reading:  ComputerWorld: 12 predictive analytics screw-ups 7