The Wild West of Data Wrangling

The Wild West of Data
Wrangling
Sarah Guido
PyCon 2017
@sarah_guido

This talk:
•  A day in the life
•  Three examples of dealing with uncooperative data
•  Not ground truth!

Who am I?
•  Senior data scientist at Mashable
•  Mashable == internet culture media!
•  Data sciencing in Python
•  Twitter: @sarah_guido

Example 1: Predicting building sales
•  The problem: can we predict if a building will sell the
following year?
•  The data: floors, location, square footage, price per sqft,
etc
•  The goal: provide valuable insight to platform users

Example 1: Predicting building sales
•  First thought: logistic regression using scikit-learn
•  Binary classification: sale/no sale

Problem…
Data: 95% no sale, 5% sale
Logistic regression: 95% accurate
DONE!

Problem: Class imbalance
Class imbalance
When the values you are trying to predict are not equal, this
can create bias in classification models.

Solution: Gradient boosting
Gradient boosting
Produces a prediction model in the form of an ensemble of
weak prediction models, typically decision trees.

Example 2: Clustering user interactions
The problem: how can we identify similar patterns based on
click data?
The data: time, geolocation, cookie, browser useragent
string, referrer
The goal: understand how people interact with content over
time

Problem: Clustering user interactions
K-means clustering
An unsupervised learning method of grouping data together
based on a distance metric.

Problem: Clustering the data
•  Only look at users with 5 or more interactions
•  Each user has a different number of interactions
•  Each data point ends up in a different cluster

Solution: Transform the data
date: 2017-04-09, 2017-04-13, 2017-04-30, 2017-05-01,
2017-05-12
Length of interactions: 5
Average time between interactions: ~8 days

Solution: Transform the data
referrer: facebook, twitter
One-hot encode and transform to matrix
•  Facebook: [1, 0]
•  Twitter: [0, 1]

Example 3: Understand audience composition
The problem: how can we effectively describe our audience?
The data: anonymized demographic and psychographic data
The goal: audience segmentation and channel analysis

Problem: insufficient data
•  Google Analytics data – 1/3 of urls
•  Finicky API
•  Semi-useless psychographic data

Solution: accept defeat make it work!

Solution: make it work!
•  Theory of highly-performant links
•  Segmentation through archetypal analysis
•  Go get more data!

General strategy
•  What problem are you trying to solve?
•  What’s wrong with your data?
•  What do you need that you don’t have?

Keep in mind…
•  Data your company collects is complicated
•  What you do to your data will affect the model
•  Creativity is your friend
•  Lots of ways to solve the problem

The Wild West of Data Wrangling

More Related Content

What's hot (19)

Similar to The Wild West of Data Wrangling (20)

More from Sarah Guido (7)

Recently uploaded (20)

The Wild West of Data Wrangling