MLSD18. Basic Transformations - BigML

1st edition
November 4-5, 2018
Machine Learning School in Doha

BigML, Inc X
Basic Transformations
Making Data Machine Learning Ready
Poul Petersen
CIO, BigML

BigML, Inc X· @bigmlcom · @QatarComputing · #MLSD18 ·
In a Perfect World…
Q: How does a physicist milk a cow?
A: Well, ﬁrst let us consider a spherical cow...
Q: How does a data scientist build a model?
A: Well, ﬁrst let us consider perfectly formatted data…

The Dream
Source Dataset Model Proﬁt!

The Reality
CRM
Web Accounts
Transactions
ML Ready?

Obstacles
• Data Structure
• Scattered across systems
• Wrong "shape"
• Unlabelled data
• Data Value
• Format: spelling, units
• Missing values
• Non-optimal correlation
• Non-existant correlation
• Data Signiﬁcance
• Unwanted: PII, Non-Preferred
• Expensive to collect
• Insidious: Leakage, obviously correlated
Data Transformation
Feature Engineering
Feature Selection

Data Structure

Remember ML Tasks?
CLASSIFICATION Will this component fail?
REGRESSION How many days until this component fails?
TIME SERIES FORECASTING How many components will fail in a week from now?
CLUSTER ANALYSIS Which machines behave similarly?
ANOMALY DETECTION Is this behavior normal?
ASSOCIATION DISCOVERY What alerts are triggered together before a failure?
What “shape” is the data for each ML task?

Classiﬁcation
CategoricalTrainingTesting
Predicting

Regression
NumericTrainingTesting
Predicting

Time Series
NumericTrainingTesting
Forecasting
Time

Anomaly Detection

Cluster Analysis

Association Discovery

Data Labels
Unsupervised Learning Supervised Learning
• Anomaly Detection
• Clustering
• Association Discovery
• Classiﬁcation
• Regression
• Time Series
The only difference, in terms of
ML-Ready structure is the
presence of a "label"

ML Ready DataInstances
Fields (Features)
Tabular Data (rows and columns):
• Each row
• is one instance
• contains all the information about that one instance.
• For Time Series, the rows are not independent
• Each column
• is a ﬁeld that describes a property of the instance.

Data Transformations

SF Restaurants Example
https://data.sfgov.org/Health-and-Social-Services/Restaurant-Scores/stya-26eb
https://blog.bigml.com/2013/10/30/data-preparation-for-machine-learning-using-mysql/
create database sf_restaurants;
use sf_restaurants;
create table businesses (business_id int, name varchar(1000), address varchar(1000), city varchar(1000), state varchar(100),
postal_code varchar(100), latitude varchar(100), longitude varchar(100), phone_number varchar(100));
load data local infile './businesses.csv' into table businesses fields terminated by ',' enclosed by '"' lines terminated by
'rn' ignore 1 lines;
create table inspections (business_id int, score varchar(10), idate varchar(8), itype varchar(100));
load data local infile './inspections.csv' into table inspections fields terminated by ',' enclosed by '"' lines terminated
by 'rn' ignore 1 lines;
create table violations (business_id int, vdate varchar(8), description varchar(1000));
load data local infile './violations.csv' into table violations fields terminated by ',' enclosed by '"' lines terminated by
'rn' ignore 1 lines;
create table legend (Minimum_Score int, Maximum_Score int, Description varchar(100));
load data local infile './legend.csv' into table legend fields terminated by ',' enclosed by '"' lines terminated by 'rn'
ignore 1 lines;

SF Restaurants Data

Building a ML Application
State the problem as an ML task
Data wrangling
Feature engineering
Modeling and Evaluations
Predictions
Measure Results
Data transformations
Task

State the Problem
• Predict rating: Score from 0 to 100
• This is a regression problem
• Based on business proﬁle:
• Description: kitchen, cafe, etc.
• Location: zip, latitude, longitude

Denormalizing with Joins
business
inspections
violations
scores
Instances
Features
(millions)
join
Data is usually normalized in relational databases, ML-Ready
datasets need the information de-normalized in a single dataset.
create table scores select * from businesses left join inspections using (business_id);
create table scores_last select a.* from scores as a JOIN (select business_id,max(idate)
as idate from scores group by business_id) as b where a.business_id=b.business_id and
a.idate=b.idate;
Denormalize
ML-Ready: Each row contains all the information about that one instance.
create table scores_last_label select scores_last.*, Description as score_label from
scores_last join legend on score <= Maximum_Score and score >= Minimum_Score;
Add Label

Joins
• Datasets to join need to have a ﬁeld in common
• joining sales and demographics on customer_id
• joining employee and budget details on department_id
• Datasets to join do not need to have the same dimensions
• Joins can be performed in several ways
• Left, Right, Inner, Outer…

Left Join
• In a Left join of dataset A to B:
• Returns all records from the left A,  
and the matched records from B
• The result is NULL from B, if there is no match.
_id field1
1 34
2 56
3 123
4 56
5 79
A
_id field2
1 red
2 green
4 blue
6 black
BLeft join
_id field1 field2
1 34 red
2 56 green
3 123 null
4 56 blue
5 79 null
A left join B=
A B
No “3” or “5”

Right Join
• In a Right join of dataset A to B:
• Returns all records from the right B,  
and the matched records from A
• The result is NULL from A, if there is no match.
_id field1
1 34
2 56
3 123
4 56
5 79
A
_id field2
1 red
2 green
4 blue
6 black
BRight join
_id field2 field1
1 red 34
2 green 56
4 blue 56
6 black null
A right join B=
BA
No “6”,
“3” unused

Inner Join
• In an Inner join of dataset A to B:
• Returns only records from the left A,  
that match records from B
• If there is no match between A and B, the record is ignored
A B
_id field1
1 34
2 56
3 123
4 56
5 79
A
_id field2
1 red
2 green
4 blue
6 black
BInner join
_id field1 field2
1 34 red
2 56 green
4 56 blue
A inner join B=
“3” and “5”
unused
“6” unused

Full Outer Join
• In a Full join of dataset A to B:
• Returns all records from the left A,  
and records from B
• If there is no match in either A and B, the field is null
A B
_id field1
1 34
2 56
3 123
4 56
5 79
A
_id field2
1 red
2 green
4 blue
6 black
Bfull join
_id field1 field2
1 34 red
2 56 green
3 123 null
4 56 blue
5 79 null
6 null black
A full join B=
A
No “6”
No “3” or “5”

Joins with non-unique IDs
Consider a left join A with B where B has non unique _id entries
_id field1
1 34
2 56
3 123
4 56
5 79
A
_id field2
1 red
2 green
4 blue
6 black
4 green
Bfull join
_id field1 field2
1 34 red
2 56 green
3 123 null
4 56 blue
4 56 green
5 79 null
6 null black
A full join B=

Joins

Re-Define the Goal
• Predict rating: Score from 0 to 100
• This is a regression problem
• Description: kitchen, restaurant, etc.
• Location: zip code, latitude, longitude
• Number of violations, text of violations
• We need to clean-up the violations field:
• remove the “[ violation fixed…]” strings

Data Cleaning
Homogenize missing values and different types in the same
feature, fix input errors, correct semantic issues, types, etc.
Name Date Duration (s) Genre Plays
Highway star 1984-05-24 - Rock 139
Blues alive 1990/03/01 281 Blues 239
Lonely planet 2002-11-19 5:32s Techno 42
Dance, dance 02/23/1983 312 Disco N/A
The wall 1943-01-20 218 Reagge 83
Offside down 1965-02-19 4 minutes Techno 895
The alchemist 2001-11-21 418 Bluesss 178
Bring me down 18-10-98 328 Classic 21
The scarecrow 1994-10-12 269 Rock 734
Original data
Name Date Duration (s) Genre Plays
Highway star 1984-05-24 Rock 139
Blues alive 1990-03-01 281 Blues 239
Lonely planet 2002-11-19 332 Techno 42
Dance, dance 1983-02-23 312 Disco
The wall 1943-01-20 218 Reagge 83
Offside down 1965-02-19 240 Techno 895
The alchemist 2001-11-21 418 Blues 178
Bring me down 1998-10-18 328 Classic 21
The scarecrow 1994-10-12 269 Rock 734
Cleaned data
update violations set description = substr(description,1,instr(description,' [ date violation corrected:')-1) where instr(description,'
[ date violation corrected:') > 0;
(replace (field "description") "[.*" "")

Data Cleaning

Re-Deﬁne the Goal
• Predict rating: Poor / Needs Improvement / Adequate / Good
• This is a classiﬁcation problem
• Need to aggregate the violations:

Aggregations
business_id date description
10 20140114 Unclean or degraded floors walls or ceilings
10 20140114 Inadequate and inaccessible handwashing facilities
19 20160513 Food safety certificate or food handler card not available
19 20160513 Unapproved or unmaintained equipment or utensils
19 20141110 Inadequate food safety knowledge or lack of certified food safety
manager19 20141110 Improper storage of equipment utensils or linens
19 20140214 Permit license or inspection report not posted
19 20140214 Inadequately cleaned or sanitized food contact surfaces
Instances
Tabular Data (rows and columns):
• Each row
• is one instance
• contains all the information about that one instance.
violations

Aggregation: Count
Content Genre Duration Play Time User Device
Highway star Rock 190 2015-05-12 16:29:33 User001 TV
Blues alive Blues 281 2015-05-13 12:31:21 User005 Tablet
Lonely planet Techno 332 2015-05-13 14:26:04 User003 TV
Dance, dance Disco 312 2015-05-13 18:12:45 User001 Tablet
The wall Reagge 218 2015-05-14 09:02:55 User002
Offside down Techno 240 2015-05-14 11:26:32 User005 Tablet
The alchemist Blues 418 2015-05-14 21:44:15 User003 TV
Bring me
down
Classic 328 2015-05-15 06:59:56 User001 Tablet
User Count
User001 3
User005 2
User003 2
User002 1
Count
on User
Number of playbacks per user

Aggregation: Count Distinct
Bring me
down
User
Distinct
Genre
User001 3
User005 2
User003 2
User002 1
Count
distinct
Genre
on User
Number of distinct Genre played per user

Aggregation: Count Missing
Bring me
down
User
Missing
Device
User001 0
User005 0
User003 0
User002 1
Count
missing
Device
on User
Number of missing Device per user

Aggregation: Sum
Bring me
down
User
Sum
Duration
User001 830
User005 521
User003 750
User002 218
Sum
Duration
on User
Total Duration per User

Aggregation: Average
Bring me
down
User
Average
Duration
User001 276,67
User005 260,50
User003 375,00
User002 218
Average
Duration
on User
Average Duration per User

Aggregation: Maximum
Bring me
down
User
Max
Duration
User001 328
User005 281
User003 418
User002 218
Maximum
Duration
on User
Maximum Duration per User

Aggregation: Minimum
Bring me
down
User
Min
Duration
User001 190
User005 240
User003 332
User002 218
Minimum
Duration
on User
Minimum Duration per User
• Similar for standard deviation and variance
• Possible to combine multiple aggregations on the same ﬁeld

Pivoting
Different values of a feature are pivoted to new columns in the
result dataset.
The wall Reagge 218 2015-05-14 09:02:55 User002 Smartphone
Bring me down Classic 328 2015-05-15 06:59:56 User001 Tablet
The scarecrow Rock 269 2015-05-15 12:37:05 User003 Smartphone
Original data
User Num.Playback
s
Total Time Pref.Device NP_TV NP_Tablet NP_Smartphone TT_TV TT_Tablet TT_Smartphone
User001 3 830 Tablet 1 2 0 190 640 0
User002 1 218 Smartphone 0 0 1 0 0 218
User003 3 1019 TV 2 0 1 750 0 269
User005 2 521 Tablet 0 2 0 0 521 0
Aggregated data with pivoted columns

Re-Deﬁne the Goal
• Predict rating: Poor / Needs Improvement / Adequate / Good
• This is a classiﬁcation problem
• Need to aggregate the violations:
• number of violations
• concatenate violation descriptions

Aggregations

Time Windows
Create new features using values over different periods of time
Instances
Features
Time
Instances
Features
(millions)
score_2013 score_2014 score_2015
create table scores_2013 select a.business_id, a.score as score_2013, a.idate as idate_2013 from inspections as a JOIN ( select
business_id, max(idate) as idate from inspections where substr(idate,1,4) = "2013" group by business_id) as b where a.business_id =
b.business_id and a.idate = b.idate;
create table scores_over_time select * from businesses left join scores_2013 USING (business_id) left join scores_2014 USING (business_id);
score_2013
score_2014
score_2015

Time Windows

Updates
Need a current view of the data, but new data only comes in
batches of changes
day 1day 2day 3
Instances
Features

Streaming
Data only comes in single changes
data stream
Instances
Features
Stream
Batch
(kafka, etc)

Time Series Transformations

Independent Data
Color Mass PPAP
red 11 pen
green 45 apple
red 53 apple
yellow 0 pen
blue 2 pen
green 422 pineapple
yellow 555 pineapple
blue 7 pen
Discovering patterns:
• Color = “red” Mass < 100
• PPAP = “pineapple” Color
≠ “blue”
• Color = “blue” PPAP =
“pen”

Independent Data
Color Mass PPAP
green 45 apple
blue 2 pen
green 422 pineapple
blue 7 pen
yellow 0 pen
yellow 9 pineapple
red 555 apple
red 11 pen
Patterns still hold when rows
re-arranged:
• Color = “red” Mass < 100
• PPAP = “pineapple” Color
≠ “blue”
• Color = “blue” PPAP =
“pen”

Dependent Data
Year Pineapple
Harvest1986 50,74
1987 22,03
1988 50,69
1989 40,38
1990 29,80
1991 9,90
1992 73,93
1993 22,95
1994 139,09
1995 115,17
1996 193,88
1997 175,31
1998 223,41
1999 295,03
2000 450,53
Pineapple Harvest
Tons
0
125
250
375
500
Year
1986 1988 1990 1992 1994 1996 1998 2000
Trend
Error

Dependent Data
Pineapple Harvest
Tons
0
125
250
375
500
Year
1986 1988 1990 1992 1994 1996 1998 2000
Year Pineapple
Harvest1986 139,09
1987 175,31
1988 9,91
1989 22,95
1990 450,53
1991 73,93
1992 40,38
1993 22,03
1994 295,03
1995 50,74
1996 29,8
1997 223,41
1998 115,17
1999 193,88
2000 50,69
Rearranging Disrupts Patterns
Transformations need to preserve sorting

Calendar Correction
• Time Series data can show variations due to aggregation
• For example: “pounds/month” produced
• Transform: pounds/month ÷ days/month = pounds/day

Calendar Correction

Summary
• Data is usually awful
• Requires clean-up
• Transformations
• Enormous part of the effort in applying ML
• Techniques:
• Denormalizing
• Aggregating / Pivoting
• Time windows / Streaming
• Calendar corrections

MLSD18. Basic Transformations - BigML

MLSD18. Basic Transformations - BigML

More Related Content

What's hot (20)

Similar to MLSD18. Basic Transformations - BigML (20)

More from BigML, Inc (20)

Recently uploaded (20)

MLSD18. Basic Transformations - BigML