SlideShare a Scribd company logo
The Wild West of Data
Wrangling
Sarah Guido
PyCon 2017
@sarah_guido
This talk:
•  A day in the life
•  Three examples of dealing with uncooperative data
•  Not ground truth!
Who am I?
•  Senior data scientist at Mashable
•  Mashable == internet culture media!
•  Data sciencing in Python
•  Twitter: @sarah_guido
Iris Dataset
Iris Dataset
The Wild West of Data Wrangling
The Wild West of Data Wrangling
Example 1: Predicting building sales
•  The problem: can we predict if a building will sell the
following year?
•  The data: floors, location, square footage, price per sqft,
etc
•  The goal: provide valuable insight to platform users
Example 1: Predicting building sales
•  First thought: logistic regression using scikit-learn
•  Binary classification: sale/no sale
Problem…
Data: 95% no sale, 5% sale
Logistic regression: 95% accurate
DONE!
The Wild West of Data Wrangling
Problem: Class imbalance
Class imbalance
When the values you are trying to predict are not equal, this
can create bias in classification models.
Solution: Gradient boosting
Gradient boosting
Produces a prediction model in the form of an ensemble of
weak prediction models, typically decision trees.
Example 2: Clustering user interactions
The problem: how can we identify similar patterns based on
click data?
The data: time, geolocation, cookie, browser useragent
string, referrer
The goal: understand how people interact with content over
time
Why Scala?
Problem: Clustering user interactions
K-means clustering
An unsupervised learning method of grouping data together
based on a distance metric.
Problem: Clustering the data
•  Only look at users with 5 or more interactions
•  Each user has a different number of interactions
•  Each data point ends up in a different cluster
The Wild West of Data Wrangling
The Wild West of Data Wrangling
The Wild West of Data Wrangling
The Wild West of Data Wrangling
Solution: Transform the data
Solution: Transform the data
date: 2017-04-09, 2017-04-13, 2017-04-30, 2017-05-01,
2017-05-12
Length of interactions: 5
Average time between interactions: ~8 days
Solution: Transform the data
referrer: facebook, twitter
One-hot encode and transform to matrix
•  Facebook: [1, 0]
•  Twitter: [0, 1]
Solution: Transform the data
Example 3: Understand audience composition
The problem: how can we effectively describe our audience?
The data: anonymized demographic and psychographic data
The goal: audience segmentation and channel analysis
Problem: insufficient data
•  Google Analytics data – 1/3 of urls
•  Finicky API
•  Semi-useless psychographic data
Solution: accept defeat
Solution: accept defeat make it work!
Solution: make it work!
•  Theory of highly-performant links
•  Segmentation through archetypal analysis
•  Go get more data!
General strategy
•  What problem are you trying to solve?
•  What’s wrong with your data?
•  What do you need that you don’t have?
Keep in mind…
•  Data your company collects is complicated
•  What you do to your data will affect the model
•  Creativity is your friend
•  Lots of ways to solve the problem
The Wild West of Data Wrangling
Thank you!
@sarah_guido

More Related Content

PPTX
A Beginner's Guide to Machine Learning with Scikit-Learn
PDF
Machine learning in action at Pipedrive
PDF
Pre processing big data
PPTX
Conrad - Separating the Wheat from the Chaff
PDF
Machine Learning with Big Data using Apache Spark
PDF
An Interactive Visual Analytics Dashboard for the Employment Situation Report
PDF
H2O World - Intro to Data Science with Erin Ledell
PDF
Azure Machine Learning
A Beginner's Guide to Machine Learning with Scikit-Learn
Machine learning in action at Pipedrive
Pre processing big data
Conrad - Separating the Wheat from the Chaff
Machine Learning with Big Data using Apache Spark
An Interactive Visual Analytics Dashboard for the Employment Situation Report
H2O World - Intro to Data Science with Erin Ledell
Azure Machine Learning

What's hot (19)

PPTX
Mapping a Privacy Framework to a Reference Model of Learning Analytics
PPT
Wilson Confidence, Skills, And Accepting that Good Enough is Good Enough
PDF
Analytics 101 - Getting Started
PDF
Session III Census and registers - R.Radini, M.Scannapieco, L.Tosco, The ital...
PDF
Group Concpet Mapping Learning Analytics @ LASI Amsterdam
PDF
Creating a Data Science Ecosystem for Scientific, Societal and Educational Im...
PDF
Learning about Systems Using Machine Learning:Towards More Data-Driven Feedba...
PDF
Predict oscars (4:17)
PDF
Creating data dashboards to support planning
PPTX
Adopting data8 at a two year college
PPTX
Adding Open Data Value to 'Closed Data' Problems
PDF
Citi Global T4I Accelerator Data and Analytics Presentation
PPTX
CIRPA 2016: Individual Level Predictive Analytics for Improving Student Enrol...
PDF
Introduction to Python for Data Science
PDF
GTU GeekDay Data Science and Applications
PPTX
Intro to quant_s_tudents
DOCX
Self Study Business Approach to DS_01022022.docx
PDF
Kaggle and data science
PDF
Machine Learning part 3 - Introduction to data science
Mapping a Privacy Framework to a Reference Model of Learning Analytics
Wilson Confidence, Skills, And Accepting that Good Enough is Good Enough
Analytics 101 - Getting Started
Session III Census and registers - R.Radini, M.Scannapieco, L.Tosco, The ital...
Group Concpet Mapping Learning Analytics @ LASI Amsterdam
Creating a Data Science Ecosystem for Scientific, Societal and Educational Im...
Learning about Systems Using Machine Learning:Towards More Data-Driven Feedba...
Predict oscars (4:17)
Creating data dashboards to support planning
Adopting data8 at a two year college
Adding Open Data Value to 'Closed Data' Problems
Citi Global T4I Accelerator Data and Analytics Presentation
CIRPA 2016: Individual Level Predictive Analytics for Improving Student Enrol...
Introduction to Python for Data Science
GTU GeekDay Data Science and Applications
Intro to quant_s_tudents
Self Study Business Approach to DS_01022022.docx
Kaggle and data science
Machine Learning part 3 - Introduction to data science
Ad

Similar to The Wild West of Data Wrangling (20)

PPTX
The Wild West of Data Wrangling (PyTN)
PPTX
MULTI-TOUCH ATTRIBUTION: A MARKETING PROBLEM SOLVED? - ABIGAIL LEBRECHT
PDF
TDWI Solution Summit San Diego 2014 Advanced Analytics at Macys.com
PDF
Community-Assisted Software Engineering Decision Making
PDF
Recommender Systems In Industry
PPTX
Drinking from the Digital Data Fire Hose
PDF
Getting Started in Data Science
PDF
Career in Data Science (July 2017, DTLA)
PPTX
Week2 chapters1 3
PDF
Search, Discovery and Questions at Quora
PDF
Decoding Learner Digital Body Language: What our learners' actions tell us
PDF
Getting started in Data Science (April 2017, Los Angeles)
PPTX
Emoocs2017 who wants to chat on a mooc v1.2
PPTX
Ringing the changes: transforming teams and technologies
PPTX
A new direction for recommender systems: balancing privacy and personalisation
PDF
Big Data Analysis and Business Intelligence
PPTX
Data to Insights with Gogo's Data Science Lead
PDF
CC TEL- Simulation-based co-design of algorithms
PPTX
Designing Big Content - Search Exchange 2013
PDF
Max Shron, Thinking with Data at the NYC Data Science Meetup
The Wild West of Data Wrangling (PyTN)
MULTI-TOUCH ATTRIBUTION: A MARKETING PROBLEM SOLVED? - ABIGAIL LEBRECHT
TDWI Solution Summit San Diego 2014 Advanced Analytics at Macys.com
Community-Assisted Software Engineering Decision Making
Recommender Systems In Industry
Drinking from the Digital Data Fire Hose
Getting Started in Data Science
Career in Data Science (July 2017, DTLA)
Week2 chapters1 3
Search, Discovery and Questions at Quora
Decoding Learner Digital Body Language: What our learners' actions tell us
Getting started in Data Science (April 2017, Los Angeles)
Emoocs2017 who wants to chat on a mooc v1.2
Ringing the changes: transforming teams and technologies
A new direction for recommender systems: balancing privacy and personalisation
Big Data Analysis and Business Intelligence
Data to Insights with Gogo's Data Science Lead
CC TEL- Simulation-based co-design of algorithms
Designing Big Content - Search Exchange 2013
Max Shron, Thinking with Data at the NYC Data Science Meetup
Ad

More from Sarah Guido (7)

PDF
Data Science Retrospective
PDF
The Importance of Community
PPTX
Spark: The Good, the Bad, and the Ugly
PPTX
Data Science at Scale: Using Apache Spark for Data Science at Bitly
PPTX
Network theory - PyCon 2015
PPTX
Analyzing Data With Python
PPTX
K-means Clustering with Scikit-Learn
Data Science Retrospective
The Importance of Community
Spark: The Good, the Bad, and the Ugly
Data Science at Scale: Using Apache Spark for Data Science at Bitly
Network theory - PyCon 2015
Analyzing Data With Python
K-means Clustering with Scikit-Learn

Recently uploaded (20)

PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Approach and Philosophy of On baking technology
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
A comparative analysis of optical character recognition models for extracting...
PPTX
Spectroscopy.pptx food analysis technology
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Accuracy of neural networks in brain wave diagnosis of schizophrenia
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
cuic standard and advanced reporting.pdf
PDF
Encapsulation theory and applications.pdf
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PPTX
Group 1 Presentation -Planning and Decision Making .pptx
PPT
Teaching material agriculture food technology
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PPTX
Tartificialntelligence_presentation.pptx
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Machine learning based COVID-19 study performance prediction
Spectral efficient network and resource selection model in 5G networks
Approach and Philosophy of On baking technology
20250228 LYD VKU AI Blended-Learning.pptx
A comparative analysis of optical character recognition models for extracting...
Spectroscopy.pptx food analysis technology
Mobile App Security Testing_ A Comprehensive Guide.pdf
MYSQL Presentation for SQL database connectivity
Accuracy of neural networks in brain wave diagnosis of schizophrenia
Diabetes mellitus diagnosis method based random forest with bat algorithm
cuic standard and advanced reporting.pdf
Encapsulation theory and applications.pdf
Unlocking AI with Model Context Protocol (MCP)
MIND Revenue Release Quarter 2 2025 Press Release
Group 1 Presentation -Planning and Decision Making .pptx
Teaching material agriculture food technology
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
“AI and Expert System Decision Support & Business Intelligence Systems”
Tartificialntelligence_presentation.pptx
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Machine learning based COVID-19 study performance prediction

The Wild West of Data Wrangling

  • 1. The Wild West of Data Wrangling Sarah Guido PyCon 2017 @sarah_guido
  • 2. This talk: •  A day in the life •  Three examples of dealing with uncooperative data •  Not ground truth!
  • 3. Who am I? •  Senior data scientist at Mashable •  Mashable == internet culture media! •  Data sciencing in Python •  Twitter: @sarah_guido
  • 8. Example 1: Predicting building sales •  The problem: can we predict if a building will sell the following year? •  The data: floors, location, square footage, price per sqft, etc •  The goal: provide valuable insight to platform users
  • 9. Example 1: Predicting building sales •  First thought: logistic regression using scikit-learn •  Binary classification: sale/no sale
  • 10. Problem… Data: 95% no sale, 5% sale Logistic regression: 95% accurate DONE!
  • 12. Problem: Class imbalance Class imbalance When the values you are trying to predict are not equal, this can create bias in classification models.
  • 13. Solution: Gradient boosting Gradient boosting Produces a prediction model in the form of an ensemble of weak prediction models, typically decision trees.
  • 14. Example 2: Clustering user interactions The problem: how can we identify similar patterns based on click data? The data: time, geolocation, cookie, browser useragent string, referrer The goal: understand how people interact with content over time
  • 16. Problem: Clustering user interactions K-means clustering An unsupervised learning method of grouping data together based on a distance metric.
  • 17. Problem: Clustering the data •  Only look at users with 5 or more interactions •  Each user has a different number of interactions •  Each data point ends up in a different cluster
  • 23. Solution: Transform the data date: 2017-04-09, 2017-04-13, 2017-04-30, 2017-05-01, 2017-05-12 Length of interactions: 5 Average time between interactions: ~8 days
  • 24. Solution: Transform the data referrer: facebook, twitter One-hot encode and transform to matrix •  Facebook: [1, 0] •  Twitter: [0, 1]
  • 26. Example 3: Understand audience composition The problem: how can we effectively describe our audience? The data: anonymized demographic and psychographic data The goal: audience segmentation and channel analysis
  • 27. Problem: insufficient data •  Google Analytics data – 1/3 of urls •  Finicky API •  Semi-useless psychographic data
  • 29. Solution: accept defeat make it work!
  • 30. Solution: make it work! •  Theory of highly-performant links •  Segmentation through archetypal analysis •  Go get more data!
  • 31. General strategy •  What problem are you trying to solve? •  What’s wrong with your data? •  What do you need that you don’t have?
  • 32. Keep in mind… •  Data your company collects is complicated •  What you do to your data will affect the model •  Creativity is your friend •  Lots of ways to solve the problem