SlideShare a Scribd company logo
1st edition
November 4-5, 2018
Machine Learning School in Doha
BigML, Inc X
Basic Transformations
Making Data Machine Learning Ready
Poul Petersen
CIO, BigML
BigML, Inc X· @bigmlcom · @QatarComputing · #MLSD18 ·
In a Perfect World…
Q: How does a physicist milk a cow?
A: Well, first let us consider a spherical cow...
Q: How does a data scientist build a model?
A: Well, first let us consider perfectly formatted data…
BigML, Inc X· @bigmlcom · @QatarComputing · #MLSD18 ·
The Dream
Source Dataset Model Profit!
BigML, Inc X· @bigmlcom · @QatarComputing · #MLSD18 ·
The Reality
CRM
Web Accounts
Transactions
ML Ready?
BigML, Inc X· @bigmlcom · @QatarComputing · #MLSD18 ·
Obstacles
• Data Structure
• Scattered across systems
• Wrong "shape"
• Unlabelled data
• Data Value
• Format: spelling, units
• Missing values
• Non-optimal correlation
• Non-existant correlation
• Data Significance
• Unwanted: PII, Non-Preferred
• Expensive to collect
• Insidious: Leakage, obviously correlated
Data Transformation
Feature Engineering
Feature Selection
BigML, Inc X· @bigmlcom · @QatarComputing · #MLSD18 ·
Data Structure
BigML, Inc X· @bigmlcom · @QatarComputing · #MLSD18 ·
Remember ML Tasks?
CLASSIFICATION Will this component fail?
REGRESSION How many days until this component fails?
TIME SERIES FORECASTING How many components will fail in a week from now?
CLUSTER ANALYSIS Which machines behave similarly?
ANOMALY DETECTION Is this behavior normal?
ASSOCIATION DISCOVERY What alerts are triggered together before a failure?
What “shape” is the data for each ML task?
BigML, Inc X· @bigmlcom · @QatarComputing · #MLSD18 ·
Classification
CategoricalTrainingTesting
Predicting
BigML, Inc X· @bigmlcom · @QatarComputing · #MLSD18 ·
Regression
NumericTrainingTesting
Predicting
BigML, Inc X· @bigmlcom · @QatarComputing · #MLSD18 ·
Time Series
NumericTrainingTesting
Forecasting
Time
BigML, Inc X· @bigmlcom · @QatarComputing · #MLSD18 ·
Anomaly Detection
BigML, Inc X· @bigmlcom · @QatarComputing · #MLSD18 ·
Cluster Analysis
BigML, Inc X· @bigmlcom · @QatarComputing · #MLSD18 ·
Association Discovery
BigML, Inc X· @bigmlcom · @QatarComputing · #MLSD18 ·
Data Labels
Unsupervised Learning Supervised Learning
• Anomaly Detection
• Clustering
• Association Discovery
• Classification
• Regression
• Time Series
The only difference, in terms of
ML-Ready structure is the
presence of a "label"
BigML, Inc X· @bigmlcom · @QatarComputing · #MLSD18 ·
ML Ready DataInstances
Fields (Features)
Tabular Data (rows and columns):
• Each row
• is one instance
• contains all the information about that one instance.
• For Time Series, the rows are not independent
• Each column
• is a field that describes a property of the instance.
BigML, Inc X· @bigmlcom · @QatarComputing · #MLSD18 ·
Data Transformations
BigML, Inc X· @bigmlcom · @QatarComputing · #MLSD18 ·
SF Restaurants Example
https://data.sfgov.org/Health-and-Social-Services/Restaurant-Scores/stya-26eb
https://blog.bigml.com/2013/10/30/data-preparation-for-machine-learning-using-mysql/
create database sf_restaurants;
use sf_restaurants;
create table businesses (business_id int, name varchar(1000), address varchar(1000), city varchar(1000), state varchar(100),
postal_code varchar(100), latitude varchar(100), longitude varchar(100), phone_number varchar(100));
load data local infile './businesses.csv' into table businesses fields terminated by ',' enclosed by '"' lines terminated by
'rn' ignore 1 lines;
create table inspections (business_id int, score varchar(10), idate varchar(8), itype varchar(100));
load data local infile './inspections.csv' into table inspections fields terminated by ',' enclosed by '"' lines terminated
by 'rn' ignore 1 lines;
create table violations (business_id int, vdate varchar(8), description varchar(1000));
load data local infile './violations.csv' into table violations fields terminated by ',' enclosed by '"' lines terminated by
'rn' ignore 1 lines;
create table legend (Minimum_Score int, Maximum_Score int, Description varchar(100));
load data local infile './legend.csv' into table legend fields terminated by ',' enclosed by '"' lines terminated by 'rn'
ignore 1 lines;
BigML, Inc X· @bigmlcom · @QatarComputing · #MLSD18 ·
SF Restaurants Data
BigML, Inc X· @bigmlcom · @QatarComputing · #MLSD18 ·
Building a ML Application
State the problem as an ML task
Data wrangling
Feature engineering
Modeling and Evaluations
Predictions
Measure Results
Data transformations
Task
BigML, Inc X· @bigmlcom · @QatarComputing · #MLSD18 ·
State the Problem
• Predict rating: Score from 0 to 100
• This is a regression problem
• Based on business profile:
• Description: kitchen, cafe, etc.
• Location: zip, latitude, longitude
BigML, Inc X· @bigmlcom · @QatarComputing · #MLSD18 ·
Denormalizing with Joins
business
inspections
violations
scores
Instances
Features
(millions)
join
Data is usually normalized in relational databases, ML-Ready
datasets need the information de-normalized in a single dataset.
create table scores select * from businesses left join inspections using (business_id);
create table scores_last select a.* from scores as a JOIN (select business_id,max(idate)
as idate from scores group by business_id) as b where a.business_id=b.business_id and
a.idate=b.idate;
Denormalize
ML-Ready: Each row contains all the information about that one instance.
create table scores_last_label select scores_last.*, Description as score_label from
scores_last join legend on score <= Maximum_Score and score >= Minimum_Score;
Add Label
BigML, Inc X· @bigmlcom · @QatarComputing · #MLSD18 ·
Joins
• Datasets to join need to have a field in common
• joining sales and demographics on customer_id
• joining employee and budget details on department_id
• Datasets to join do not need to have the same dimensions
• Joins can be performed in several ways
• Left, Right, Inner, Outer…
BigML, Inc X· @bigmlcom · @QatarComputing · #MLSD18 ·
Left Join
• In a Left join of dataset A to B:
• Returns all records from the left A, 

and the matched records from B
• The result is NULL from B, if there is no match.
_id field1
1 34
2 56
3 123
4 56
5 79
A
_id field2
1 red
2 green
4 blue
6 black
BLeft join
_id field1 field2
1 34 red
2 56 green
3 123 null
4 56 blue
5 79 null
A left join B=
A B
No “3” or “5”
BigML, Inc X· @bigmlcom · @QatarComputing · #MLSD18 ·
Right Join
• In a Right join of dataset A to B:
• Returns all records from the right B, 

and the matched records from A
• The result is NULL from A, if there is no match.
_id field1
1 34
2 56
3 123
4 56
5 79
A
_id field2
1 red
2 green
4 blue
6 black
BRight join
_id field2 field1
1 red 34
2 green 56
4 blue 56
6 black null
A right join B=
BA
No “6”,
“3” unused
BigML, Inc X· @bigmlcom · @QatarComputing · #MLSD18 ·
Inner Join
• In an Inner join of dataset A to B:
• Returns only records from the left A, 

that match records from B
• If there is no match between A and B, the record is ignored
A B
_id field1
1 34
2 56
3 123
4 56
5 79
A
_id field2
1 red
2 green
4 blue
6 black
BInner join
_id field1 field2
1 34 red
2 56 green
4 56 blue
A inner join B=
“3” and “5”
unused
“6” unused
BigML, Inc X· @bigmlcom · @QatarComputing · #MLSD18 ·
Full Outer Join
• In a Full join of dataset A to B:
• Returns all records from the left A, 

and records from B
• If there is no match in either A and B, the field is null
A B
_id field1
1 34
2 56
3 123
4 56
5 79
A
_id field2
1 red
2 green
4 blue
6 black
Bfull join
_id field1 field2
1 34 red
2 56 green
3 123 null
4 56 blue
5 79 null
6 null black
A full join B=
A
No “6”
No “3” or “5”
BigML, Inc X· @bigmlcom · @QatarComputing · #MLSD18 ·
Joins with non-unique IDs
Consider a left join A with B where B has non unique _id entries
_id field1
1 34
2 56
3 123
4 56
5 79
A
_id field2
1 red
2 green
4 blue
6 black
4 green
Bfull join
_id field1 field2
1 34 red
2 56 green
3 123 null
4 56 blue
4 56 green
5 79 null
6 null black
A full join B=
BigML, Inc X· @bigmlcom · @QatarComputing · #MLSD18 ·
Joins
BigML, Inc X· @bigmlcom · @QatarComputing · #MLSD18 ·
Re-Define the Goal
• Predict rating: Score from 0 to 100
• This is a regression problem
• Based on business profile:
• Description: kitchen, restaurant, etc.
• Location: zip code, latitude, longitude
• Number of violations, text of violations
• We need to clean-up the violations field:
• remove the “[ violation fixed…]” strings
BigML, Inc X· @bigmlcom · @QatarComputing · #MLSD18 ·
Data Cleaning
Homogenize missing values and different types in the same
feature, fix input errors, correct semantic issues, types, etc.
Name Date Duration (s) Genre Plays
Highway star 1984-05-24 - Rock 139
Blues alive 1990/03/01 281 Blues 239
Lonely planet 2002-11-19 5:32s Techno 42
Dance, dance 02/23/1983 312 Disco N/A
The wall 1943-01-20 218 Reagge 83
Offside down 1965-02-19 4 minutes Techno 895
The alchemist 2001-11-21 418 Bluesss 178
Bring me down 18-10-98 328 Classic 21
The scarecrow 1994-10-12 269 Rock 734
Original data
Name Date Duration (s) Genre Plays
Highway star 1984-05-24 Rock 139
Blues alive 1990-03-01 281 Blues 239
Lonely planet 2002-11-19 332 Techno 42
Dance, dance 1983-02-23 312 Disco
The wall 1943-01-20 218 Reagge 83
Offside down 1965-02-19 240 Techno 895
The alchemist 2001-11-21 418 Blues 178
Bring me down 1998-10-18 328 Classic 21
The scarecrow 1994-10-12 269 Rock 734
Cleaned data
update violations set description = substr(description,1,instr(description,' [ date violation corrected:')-1) where instr(description,'
[ date violation corrected:') > 0;
(replace (field "description") "[.*" "")
BigML, Inc X· @bigmlcom · @QatarComputing · #MLSD18 ·
Data Cleaning
BigML, Inc X· @bigmlcom · @QatarComputing · #MLSD18 ·
Re-Define the Goal
• Predict rating: Poor / Needs Improvement / Adequate / Good
• This is a classification problem
• Based on business profile:
• Description: kitchen, restaurant, etc.
• Location: zip code, latitude, longitude
• Number of violations, text of violations
• We need to clean-up the violations field:
• remove the “[ violation fixed…]” strings
• Need to aggregate the violations:
BigML, Inc X· @bigmlcom · @QatarComputing · #MLSD18 ·
Aggregations
business_id date description
10 20140114 Unclean or degraded floors walls or ceilings
10 20140114 Inadequate and inaccessible handwashing facilities
19 20160513 Unclean or degraded floors walls or ceilings
19 20160513 Food safety certificate or food handler card not available
19 20160513 Unapproved or unmaintained equipment or utensils
19 20141110 Inadequate food safety knowledge or lack of certified food safety
manager19 20141110 Improper storage of equipment utensils or linens
19 20140214 Permit license or inspection report not posted
19 20140214 Inadequately cleaned or sanitized food contact surfaces
24 20161005 Unclean or degraded floors walls or ceilings
24 20160311 Unclean or degraded floors walls or ceilings
Instances
Tabular Data (rows and columns):
• Each row
• is one instance
• contains all the information about that one instance.
violations
BigML, Inc X· @bigmlcom · @QatarComputing · #MLSD18 ·
Aggregation: Count
Content Genre Duration Play Time User Device
Highway star Rock 190 2015-05-12 16:29:33 User001 TV
Blues alive Blues 281 2015-05-13 12:31:21 User005 Tablet
Lonely planet Techno 332 2015-05-13 14:26:04 User003 TV
Dance, dance Disco 312 2015-05-13 18:12:45 User001 Tablet
The wall Reagge 218 2015-05-14 09:02:55 User002
Offside down Techno 240 2015-05-14 11:26:32 User005 Tablet
The alchemist Blues 418 2015-05-14 21:44:15 User003 TV
Bring me
down
Classic 328 2015-05-15 06:59:56 User001 Tablet
User Count
User001 3
User005 2
User003 2
User002 1
Count
on User
Number of playbacks per user
BigML, Inc X· @bigmlcom · @QatarComputing · #MLSD18 ·
Aggregation: Count Distinct
Content Genre Duration Play Time User Device
Highway star Rock 190 2015-05-12 16:29:33 User001 TV
Blues alive Blues 281 2015-05-13 12:31:21 User005 Tablet
Lonely planet Techno 332 2015-05-13 14:26:04 User003 TV
Dance, dance Disco 312 2015-05-13 18:12:45 User001 Tablet
The wall Reagge 218 2015-05-14 09:02:55 User002
Offside down Techno 240 2015-05-14 11:26:32 User005 Tablet
The alchemist Blues 418 2015-05-14 21:44:15 User003 TV
Bring me
down
Classic 328 2015-05-15 06:59:56 User001 Tablet
User
Distinct
Genre
User001 3
User005 2
User003 2
User002 1
Count
distinct
Genre
on User
Number of distinct Genre played per user
BigML, Inc X· @bigmlcom · @QatarComputing · #MLSD18 ·
Aggregation: Count Missing
Content Genre Duration Play Time User Device
Highway star Rock 190 2015-05-12 16:29:33 User001 TV
Blues alive Blues 281 2015-05-13 12:31:21 User005 Tablet
Lonely planet Techno 332 2015-05-13 14:26:04 User003 TV
Dance, dance Disco 312 2015-05-13 18:12:45 User001 Tablet
The wall Reagge 218 2015-05-14 09:02:55 User002
Offside down Techno 240 2015-05-14 11:26:32 User005 Tablet
The alchemist Blues 418 2015-05-14 21:44:15 User003 TV
Bring me
down
Classic 328 2015-05-15 06:59:56 User001 Tablet
User
Missing
Device
User001 0
User005 0
User003 0
User002 1
Count
missing
Device
on User
Number of missing Device per user
BigML, Inc X· @bigmlcom · @QatarComputing · #MLSD18 ·
Aggregation: Sum
Content Genre Duration Play Time User Device
Highway star Rock 190 2015-05-12 16:29:33 User001 TV
Blues alive Blues 281 2015-05-13 12:31:21 User005 Tablet
Lonely planet Techno 332 2015-05-13 14:26:04 User003 TV
Dance, dance Disco 312 2015-05-13 18:12:45 User001 Tablet
The wall Reagge 218 2015-05-14 09:02:55 User002
Offside down Techno 240 2015-05-14 11:26:32 User005 Tablet
The alchemist Blues 418 2015-05-14 21:44:15 User003 TV
Bring me
down
Classic 328 2015-05-15 06:59:56 User001 Tablet
User
Sum
Duration
User001 830
User005 521
User003 750
User002 218
Sum
Duration
on User
Total Duration per User
BigML, Inc X· @bigmlcom · @QatarComputing · #MLSD18 ·
Aggregation: Average
Content Genre Duration Play Time User Device
Highway star Rock 190 2015-05-12 16:29:33 User001 TV
Blues alive Blues 281 2015-05-13 12:31:21 User005 Tablet
Lonely planet Techno 332 2015-05-13 14:26:04 User003 TV
Dance, dance Disco 312 2015-05-13 18:12:45 User001 Tablet
The wall Reagge 218 2015-05-14 09:02:55 User002
Offside down Techno 240 2015-05-14 11:26:32 User005 Tablet
The alchemist Blues 418 2015-05-14 21:44:15 User003 TV
Bring me
down
Classic 328 2015-05-15 06:59:56 User001 Tablet
User
Average
Duration
User001 276,67
User005 260,50
User003 375,00
User002 218
Average
Duration
on User
Average Duration per User
BigML, Inc X· @bigmlcom · @QatarComputing · #MLSD18 ·
Aggregation: Maximum
Content Genre Duration Play Time User Device
Highway star Rock 190 2015-05-12 16:29:33 User001 TV
Blues alive Blues 281 2015-05-13 12:31:21 User005 Tablet
Lonely planet Techno 332 2015-05-13 14:26:04 User003 TV
Dance, dance Disco 312 2015-05-13 18:12:45 User001 Tablet
The wall Reagge 218 2015-05-14 09:02:55 User002
Offside down Techno 240 2015-05-14 11:26:32 User005 Tablet
The alchemist Blues 418 2015-05-14 21:44:15 User003 TV
Bring me
down
Classic 328 2015-05-15 06:59:56 User001 Tablet
User
Max
Duration
User001 328
User005 281
User003 418
User002 218
Maximum
Duration
on User
Maximum Duration per User
BigML, Inc X· @bigmlcom · @QatarComputing · #MLSD18 ·
Aggregation: Minimum
Content Genre Duration Play Time User Device
Highway star Rock 190 2015-05-12 16:29:33 User001 TV
Blues alive Blues 281 2015-05-13 12:31:21 User005 Tablet
Lonely planet Techno 332 2015-05-13 14:26:04 User003 TV
Dance, dance Disco 312 2015-05-13 18:12:45 User001 Tablet
The wall Reagge 218 2015-05-14 09:02:55 User002
Offside down Techno 240 2015-05-14 11:26:32 User005 Tablet
The alchemist Blues 418 2015-05-14 21:44:15 User003 TV
Bring me
down
Classic 328 2015-05-15 06:59:56 User001 Tablet
User
Min
Duration
User001 190
User005 240
User003 332
User002 218
Minimum
Duration
on User
Minimum Duration per User
• Similar for standard deviation and variance
• Possible to combine multiple aggregations on the same field
BigML, Inc X· @bigmlcom · @QatarComputing · #MLSD18 ·
Pivoting
Different values of a feature are pivoted to new columns in the
result dataset.
Content Genre Duration Play Time User Device
Highway star Rock 190 2015-05-12 16:29:33 User001 TV
Blues alive Blues 281 2015-05-13 12:31:21 User005 Tablet
Lonely planet Techno 332 2015-05-13 14:26:04 User003 TV
Dance, dance Disco 312 2015-05-13 18:12:45 User001 Tablet
The wall Reagge 218 2015-05-14 09:02:55 User002 Smartphone
Offside down Techno 240 2015-05-14 11:26:32 User005 Tablet
The alchemist Blues 418 2015-05-14 21:44:15 User003 TV
Bring me down Classic 328 2015-05-15 06:59:56 User001 Tablet
The scarecrow Rock 269 2015-05-15 12:37:05 User003 Smartphone
Original data
User Num.Playback
s
Total Time Pref.Device NP_TV NP_Tablet NP_Smartphone TT_TV TT_Tablet TT_Smartphone
User001 3 830 Tablet 1 2 0 190 640 0
User002 1 218 Smartphone 0 0 1 0 0 218
User003 3 1019 TV 2 0 1 750 0 269
User005 2 521 Tablet 0 2 0 0 521 0
Aggregated data with pivoted columns
BigML, Inc X· @bigmlcom · @QatarComputing · #MLSD18 ·
Re-Define the Goal
• Predict rating: Poor / Needs Improvement / Adequate / Good
• This is a classification problem
• Based on business profile:
• Description: kitchen, restaurant, etc.
• Location: zip code, latitude, longitude
• Number of violations, text of violations
• We need to clean-up the violations field:
• remove the “[ violation fixed…]” strings
• Need to aggregate the violations:
• number of violations
• concatenate violation descriptions
BigML, Inc X· @bigmlcom · @QatarComputing · #MLSD18 ·
Aggregations
BigML, Inc X· @bigmlcom · @QatarComputing · #MLSD18 ·
Time Windows
Create new features using values over different periods of time
Instances
Features
Time
Instances
Features
(millions)
score_2013 score_2014 score_2015
create table scores_2013 select a.business_id, a.score as score_2013, a.idate as idate_2013 from inspections as a JOIN ( select
business_id, max(idate) as idate from inspections where substr(idate,1,4) = "2013" group by business_id) as b where a.business_id =
b.business_id and a.idate = b.idate;
create table scores_over_time select * from businesses left join scores_2013 USING (business_id) left join scores_2014 USING (business_id);
score_2013
score_2014
score_2015
BigML, Inc X· @bigmlcom · @QatarComputing · #MLSD18 ·
Time Windows
BigML, Inc X· @bigmlcom · @QatarComputing · #MLSD18 ·
Updates
Need a current view of the data, but new data only comes in
batches of changes
day 1day 2day 3
Instances
Features
BigML, Inc X· @bigmlcom · @QatarComputing · #MLSD18 ·
Streaming
Data only comes in single changes
data stream
Instances
Features
Stream
Batch
(kafka, etc)
BigML, Inc X· @bigmlcom · @QatarComputing · #MLSD18 ·
Time Series Transformations
BigML, Inc X· @bigmlcom · @QatarComputing · #MLSD18 ·
Independent Data
Color Mass PPAP
red 11 pen
green 45 apple
red 53 apple
yellow 0 pen
blue 2 pen
green 422 pineapple
yellow 555 pineapple
blue 7 pen
Discovering patterns:
• Color = “red” Mass < 100
• PPAP = “pineapple” Color
≠ “blue”
• Color = “blue” PPAP =
“pen”
BigML, Inc X· @bigmlcom · @QatarComputing · #MLSD18 ·
Independent Data
Color Mass PPAP
green 45 apple
blue 2 pen
green 422 pineapple
blue 7 pen
yellow 0 pen
yellow 9 pineapple
red 555 apple
red 11 pen
Patterns still hold when rows
re-arranged:
• Color = “red” Mass < 100
• PPAP = “pineapple” Color
≠ “blue”
• Color = “blue” PPAP =
“pen”
BigML, Inc X· @bigmlcom · @QatarComputing · #MLSD18 ·
Dependent Data
Year Pineapple
Harvest1986 50,74
1987 22,03
1988 50,69
1989 40,38
1990 29,80
1991 9,90
1992 73,93
1993 22,95
1994 139,09
1995 115,17
1996 193,88
1997 175,31
1998 223,41
1999 295,03
2000 450,53
Pineapple Harvest
Tons
0
125
250
375
500
Year
1986 1988 1990 1992 1994 1996 1998 2000
Trend
Error
BigML, Inc X· @bigmlcom · @QatarComputing · #MLSD18 ·
Dependent Data
Pineapple Harvest
Tons
0
125
250
375
500
Year
1986 1988 1990 1992 1994 1996 1998 2000
Year Pineapple
Harvest1986 139,09
1987 175,31
1988 9,91
1989 22,95
1990 450,53
1991 73,93
1992 40,38
1993 22,03
1994 295,03
1995 50,74
1996 29,8
1997 223,41
1998 115,17
1999 193,88
2000 50,69
Rearranging Disrupts Patterns
Transformations need to preserve sorting
BigML, Inc X· @bigmlcom · @QatarComputing · #MLSD18 ·
Calendar Correction
• Time Series data can show variations due to aggregation
• For example: “pounds/month” produced
• Transform: pounds/month ÷ days/month = pounds/day
BigML, Inc X· @bigmlcom · @QatarComputing · #MLSD18 ·
Calendar Correction
BigML, Inc X· @bigmlcom · @QatarComputing · #MLSD18 ·
Summary
• Data is usually awful
• Requires clean-up
• Transformations
• Enormous part of the effort in applying ML
• Techniques:
• Denormalizing
• Aggregating / Pivoting
• Time windows / Streaming
• Calendar corrections
MLSD18. Basic Transformations - BigML

More Related Content

PDF
MLSD18. Unsupervised Workshop
PDF
MLSD18. End-to-End Machine Learning
PDF
MLSD18. OptiML and Fusions
PDF
MLSD18. Data Cleaning
PDF
MLSD18. Real World Use Case II
PDF
MLSD18. Real-World Use Case I
PDF
MLSD18. Ensembles, Logistic Regression, Deepnets
PDF
MLSD18. Supervised Summary
MLSD18. Unsupervised Workshop
MLSD18. End-to-End Machine Learning
MLSD18. OptiML and Fusions
MLSD18. Data Cleaning
MLSD18. Real World Use Case II
MLSD18. Real-World Use Case I
MLSD18. Ensembles, Logistic Regression, Deepnets
MLSD18. Supervised Summary

What's hot (20)

PDF
MLSD18. Machine Learning Research at QCRI
PDF
MLSD18. Summary of Morning Sessions
PDF
MLSD18 Evaluations
PDF
MLSD18. Basic Transformations - QCRI
PDF
MLSD18. Supervised Workshop
PDF
MLSD18. Feature Engineering
PDF
MLSD18. Automating Machine Learning Workflows
PDF
BSSML17 - API and WhizzML
PDF
VSSML18. Feature Engineering
PDF
Connected datalondon metadata-driven apps
PDF
Graph Gurus Episode 37: Modeling for Kaggle COVID-19 Dataset
PDF
Building, and communicating, a knowledge graph in Zalando
PDF
Web UI, Algorithms, and Feature Engineering
PDF
FROM DATAFRAMES TO GRAPH Data Science with pyTigerGraph
PDF
BigML Summer 2017 Release
PDF
BSSML16 L8. REST API, Bindings, and Basic Workflows
PDF
BigML Release: PCA
PDF
schema.org, Linked Data's Gateway Drug
PDF
BSSML16 L10. Summary Day 2 Sessions
PDF
TigerGraph.js
MLSD18. Machine Learning Research at QCRI
MLSD18. Summary of Morning Sessions
MLSD18 Evaluations
MLSD18. Basic Transformations - QCRI
MLSD18. Supervised Workshop
MLSD18. Feature Engineering
MLSD18. Automating Machine Learning Workflows
BSSML17 - API and WhizzML
VSSML18. Feature Engineering
Connected datalondon metadata-driven apps
Graph Gurus Episode 37: Modeling for Kaggle COVID-19 Dataset
Building, and communicating, a knowledge graph in Zalando
Web UI, Algorithms, and Feature Engineering
FROM DATAFRAMES TO GRAPH Data Science with pyTigerGraph
BigML Summer 2017 Release
BSSML16 L8. REST API, Bindings, and Basic Workflows
BigML Release: PCA
schema.org, Linked Data's Gateway Drug
BSSML16 L10. Summary Day 2 Sessions
TigerGraph.js
Ad

Similar to MLSD18. Basic Transformations - BigML (20)

PDF
BSSML17 - Basic Data Transformations
PDF
VSSML17 L5. Basic Data Transformations and Feature Engineering
PDF
VSSML18. Data Transformations
PDF
BSSML16 L6. Basic Data Transformations
PDF
VSSML16 L5. Basic Data Transformations
PDF
BSSML16 L7. Feature Engineering
PDF
BigML Release: Data Transformations
PDF
VSSML17 Review. Summary Day 2 Sessions
PDF
VSSML16 LR2. Summary Day 2
PDF
Big Data LDN 2018: TIPS AND TRICKS TO WRANGLE BIG, DIRTY DATA
PDF
BSSML17 - Feature Engineering
PDF
DutchMLSchool. Automating Decision Making
PDF
BigML Fall 2015 Release
PDF
DutchMLSchool. Introduction to Machine Learning with the BigML Platform
PDF
DutchMLSchool 2022 - End-to-End ML
PDF
GDG Cloud Community Day 2022 - Managing data quality in Machine Learning
PDF
BSSML17 - Introduction, Models, Evaluations
PDF
CM NCCU Class1
PDF
DutchMLSchool 2022 - Anomaly Detection
PDF
BSSML16 L3. Clusters and Anomaly Detection
BSSML17 - Basic Data Transformations
VSSML17 L5. Basic Data Transformations and Feature Engineering
VSSML18. Data Transformations
BSSML16 L6. Basic Data Transformations
VSSML16 L5. Basic Data Transformations
BSSML16 L7. Feature Engineering
BigML Release: Data Transformations
VSSML17 Review. Summary Day 2 Sessions
VSSML16 LR2. Summary Day 2
Big Data LDN 2018: TIPS AND TRICKS TO WRANGLE BIG, DIRTY DATA
BSSML17 - Feature Engineering
DutchMLSchool. Automating Decision Making
BigML Fall 2015 Release
DutchMLSchool. Introduction to Machine Learning with the BigML Platform
DutchMLSchool 2022 - End-to-End ML
GDG Cloud Community Day 2022 - Managing data quality in Machine Learning
BSSML17 - Introduction, Models, Evaluations
CM NCCU Class1
DutchMLSchool 2022 - Anomaly Detection
BSSML16 L3. Clusters and Anomaly Detection
Ad

More from BigML, Inc (20)

PDF
Digital Transformation and Process Optimization in Manufacturing
PDF
DutchMLSchool 2022 - Automation
PDF
DutchMLSchool 2022 - ML for AML Compliance
PDF
DutchMLSchool 2022 - Multi Perspective Anomalies
PDF
DutchMLSchool 2022 - My First Anomaly Detector
PDF
DutchMLSchool 2022 - History and Developments in ML
PDF
DutchMLSchool 2022 - A Data-Driven Company
PDF
DutchMLSchool 2022 - ML in the Legal Sector
PDF
DutchMLSchool 2022 - Smart Safe Stadiums
PDF
DutchMLSchool 2022 - Process Optimization in Manufacturing Plants
PDF
DutchMLSchool 2022 - Anomaly Detection at Scale
PDF
DutchMLSchool 2022 - Citizen Development in AI
PDF
Democratizing Object Detection
PDF
BigML Release: Image Processing
PDF
Machine Learning in Retail: Know Your Customers' Customer. See Your Future
PDF
Machine Learning in Retail: ML in the Retail Sector
PDF
ML in GRC: Machine Learning in Legal Automation, How to Trust a Lawyerbot
PDF
ML in GRC: Supporting Human Decision Making for Regulatory Adherence with Mac...
PDF
ML in GRC: Cybersecurity versus Governance, Risk Management, and Compliance
PDF
Intelligent Mobility: Machine Learning in the Mobility Industry
Digital Transformation and Process Optimization in Manufacturing
DutchMLSchool 2022 - Automation
DutchMLSchool 2022 - ML for AML Compliance
DutchMLSchool 2022 - Multi Perspective Anomalies
DutchMLSchool 2022 - My First Anomaly Detector
DutchMLSchool 2022 - History and Developments in ML
DutchMLSchool 2022 - A Data-Driven Company
DutchMLSchool 2022 - ML in the Legal Sector
DutchMLSchool 2022 - Smart Safe Stadiums
DutchMLSchool 2022 - Process Optimization in Manufacturing Plants
DutchMLSchool 2022 - Anomaly Detection at Scale
DutchMLSchool 2022 - Citizen Development in AI
Democratizing Object Detection
BigML Release: Image Processing
Machine Learning in Retail: Know Your Customers' Customer. See Your Future
Machine Learning in Retail: ML in the Retail Sector
ML in GRC: Machine Learning in Legal Automation, How to Trust a Lawyerbot
ML in GRC: Supporting Human Decision Making for Regulatory Adherence with Mac...
ML in GRC: Cybersecurity versus Governance, Risk Management, and Compliance
Intelligent Mobility: Machine Learning in the Mobility Industry

Recently uploaded (20)

PDF
annual-report-2024-2025 original latest.
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PDF
Clinical guidelines as a resource for EBP(1).pdf
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PPTX
Supervised vs unsupervised machine learning algorithms
PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PDF
Foundation of Data Science unit number two notes
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PPTX
Qualitative Qantitative and Mixed Methods.pptx
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PDF
.pdf is not working space design for the following data for the following dat...
PDF
Lecture1 pattern recognition............
PPTX
1_Introduction to advance data techniques.pptx
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
annual-report-2024-2025 original latest.
Galatica Smart Energy Infrastructure Startup Pitch Deck
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
Clinical guidelines as a resource for EBP(1).pdf
Miokarditis (Inflamasi pada Otot Jantung)
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
Supervised vs unsupervised machine learning algorithms
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
Foundation of Data Science unit number two notes
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
Data_Analytics_and_PowerBI_Presentation.pptx
Qualitative Qantitative and Mixed Methods.pptx
STUDY DESIGN details- Lt Col Maksud (21).pptx
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
.pdf is not working space design for the following data for the following dat...
Lecture1 pattern recognition............
1_Introduction to advance data techniques.pptx
168300704-gasification-ppt.pdfhghhhsjsjhsuxush

MLSD18. Basic Transformations - BigML

  • 1. 1st edition November 4-5, 2018 Machine Learning School in Doha
  • 2. BigML, Inc X Basic Transformations Making Data Machine Learning Ready Poul Petersen CIO, BigML
  • 3. BigML, Inc X· @bigmlcom · @QatarComputing · #MLSD18 · In a Perfect World… Q: How does a physicist milk a cow? A: Well, first let us consider a spherical cow... Q: How does a data scientist build a model? A: Well, first let us consider perfectly formatted data…
  • 4. BigML, Inc X· @bigmlcom · @QatarComputing · #MLSD18 · The Dream Source Dataset Model Profit!
  • 5. BigML, Inc X· @bigmlcom · @QatarComputing · #MLSD18 · The Reality CRM Web Accounts Transactions ML Ready?
  • 6. BigML, Inc X· @bigmlcom · @QatarComputing · #MLSD18 · Obstacles • Data Structure • Scattered across systems • Wrong "shape" • Unlabelled data • Data Value • Format: spelling, units • Missing values • Non-optimal correlation • Non-existant correlation • Data Significance • Unwanted: PII, Non-Preferred • Expensive to collect • Insidious: Leakage, obviously correlated Data Transformation Feature Engineering Feature Selection
  • 7. BigML, Inc X· @bigmlcom · @QatarComputing · #MLSD18 · Data Structure
  • 8. BigML, Inc X· @bigmlcom · @QatarComputing · #MLSD18 · Remember ML Tasks? CLASSIFICATION Will this component fail? REGRESSION How many days until this component fails? TIME SERIES FORECASTING How many components will fail in a week from now? CLUSTER ANALYSIS Which machines behave similarly? ANOMALY DETECTION Is this behavior normal? ASSOCIATION DISCOVERY What alerts are triggered together before a failure? What “shape” is the data for each ML task?
  • 9. BigML, Inc X· @bigmlcom · @QatarComputing · #MLSD18 · Classification CategoricalTrainingTesting Predicting
  • 10. BigML, Inc X· @bigmlcom · @QatarComputing · #MLSD18 · Regression NumericTrainingTesting Predicting
  • 11. BigML, Inc X· @bigmlcom · @QatarComputing · #MLSD18 · Time Series NumericTrainingTesting Forecasting Time
  • 12. BigML, Inc X· @bigmlcom · @QatarComputing · #MLSD18 · Anomaly Detection
  • 13. BigML, Inc X· @bigmlcom · @QatarComputing · #MLSD18 · Cluster Analysis
  • 14. BigML, Inc X· @bigmlcom · @QatarComputing · #MLSD18 · Association Discovery
  • 15. BigML, Inc X· @bigmlcom · @QatarComputing · #MLSD18 · Data Labels Unsupervised Learning Supervised Learning • Anomaly Detection • Clustering • Association Discovery • Classification • Regression • Time Series The only difference, in terms of ML-Ready structure is the presence of a "label"
  • 16. BigML, Inc X· @bigmlcom · @QatarComputing · #MLSD18 · ML Ready DataInstances Fields (Features) Tabular Data (rows and columns): • Each row • is one instance • contains all the information about that one instance. • For Time Series, the rows are not independent • Each column • is a field that describes a property of the instance.
  • 17. BigML, Inc X· @bigmlcom · @QatarComputing · #MLSD18 · Data Transformations
  • 18. BigML, Inc X· @bigmlcom · @QatarComputing · #MLSD18 · SF Restaurants Example https://data.sfgov.org/Health-and-Social-Services/Restaurant-Scores/stya-26eb https://blog.bigml.com/2013/10/30/data-preparation-for-machine-learning-using-mysql/ create database sf_restaurants; use sf_restaurants; create table businesses (business_id int, name varchar(1000), address varchar(1000), city varchar(1000), state varchar(100), postal_code varchar(100), latitude varchar(100), longitude varchar(100), phone_number varchar(100)); load data local infile './businesses.csv' into table businesses fields terminated by ',' enclosed by '"' lines terminated by 'rn' ignore 1 lines; create table inspections (business_id int, score varchar(10), idate varchar(8), itype varchar(100)); load data local infile './inspections.csv' into table inspections fields terminated by ',' enclosed by '"' lines terminated by 'rn' ignore 1 lines; create table violations (business_id int, vdate varchar(8), description varchar(1000)); load data local infile './violations.csv' into table violations fields terminated by ',' enclosed by '"' lines terminated by 'rn' ignore 1 lines; create table legend (Minimum_Score int, Maximum_Score int, Description varchar(100)); load data local infile './legend.csv' into table legend fields terminated by ',' enclosed by '"' lines terminated by 'rn' ignore 1 lines;
  • 19. BigML, Inc X· @bigmlcom · @QatarComputing · #MLSD18 · SF Restaurants Data
  • 20. BigML, Inc X· @bigmlcom · @QatarComputing · #MLSD18 · Building a ML Application State the problem as an ML task Data wrangling Feature engineering Modeling and Evaluations Predictions Measure Results Data transformations Task
  • 21. BigML, Inc X· @bigmlcom · @QatarComputing · #MLSD18 · State the Problem • Predict rating: Score from 0 to 100 • This is a regression problem • Based on business profile: • Description: kitchen, cafe, etc. • Location: zip, latitude, longitude
  • 22. BigML, Inc X· @bigmlcom · @QatarComputing · #MLSD18 · Denormalizing with Joins business inspections violations scores Instances Features (millions) join Data is usually normalized in relational databases, ML-Ready datasets need the information de-normalized in a single dataset. create table scores select * from businesses left join inspections using (business_id); create table scores_last select a.* from scores as a JOIN (select business_id,max(idate) as idate from scores group by business_id) as b where a.business_id=b.business_id and a.idate=b.idate; Denormalize ML-Ready: Each row contains all the information about that one instance. create table scores_last_label select scores_last.*, Description as score_label from scores_last join legend on score <= Maximum_Score and score >= Minimum_Score; Add Label
  • 23. BigML, Inc X· @bigmlcom · @QatarComputing · #MLSD18 · Joins • Datasets to join need to have a field in common • joining sales and demographics on customer_id • joining employee and budget details on department_id • Datasets to join do not need to have the same dimensions • Joins can be performed in several ways • Left, Right, Inner, Outer…
  • 24. BigML, Inc X· @bigmlcom · @QatarComputing · #MLSD18 · Left Join • In a Left join of dataset A to B: • Returns all records from the left A, 
 and the matched records from B • The result is NULL from B, if there is no match. _id field1 1 34 2 56 3 123 4 56 5 79 A _id field2 1 red 2 green 4 blue 6 black BLeft join _id field1 field2 1 34 red 2 56 green 3 123 null 4 56 blue 5 79 null A left join B= A B No “3” or “5”
  • 25. BigML, Inc X· @bigmlcom · @QatarComputing · #MLSD18 · Right Join • In a Right join of dataset A to B: • Returns all records from the right B, 
 and the matched records from A • The result is NULL from A, if there is no match. _id field1 1 34 2 56 3 123 4 56 5 79 A _id field2 1 red 2 green 4 blue 6 black BRight join _id field2 field1 1 red 34 2 green 56 4 blue 56 6 black null A right join B= BA No “6”, “3” unused
  • 26. BigML, Inc X· @bigmlcom · @QatarComputing · #MLSD18 · Inner Join • In an Inner join of dataset A to B: • Returns only records from the left A, 
 that match records from B • If there is no match between A and B, the record is ignored A B _id field1 1 34 2 56 3 123 4 56 5 79 A _id field2 1 red 2 green 4 blue 6 black BInner join _id field1 field2 1 34 red 2 56 green 4 56 blue A inner join B= “3” and “5” unused “6” unused
  • 27. BigML, Inc X· @bigmlcom · @QatarComputing · #MLSD18 · Full Outer Join • In a Full join of dataset A to B: • Returns all records from the left A, 
 and records from B • If there is no match in either A and B, the field is null A B _id field1 1 34 2 56 3 123 4 56 5 79 A _id field2 1 red 2 green 4 blue 6 black Bfull join _id field1 field2 1 34 red 2 56 green 3 123 null 4 56 blue 5 79 null 6 null black A full join B= A No “6” No “3” or “5”
  • 28. BigML, Inc X· @bigmlcom · @QatarComputing · #MLSD18 · Joins with non-unique IDs Consider a left join A with B where B has non unique _id entries _id field1 1 34 2 56 3 123 4 56 5 79 A _id field2 1 red 2 green 4 blue 6 black 4 green Bfull join _id field1 field2 1 34 red 2 56 green 3 123 null 4 56 blue 4 56 green 5 79 null 6 null black A full join B=
  • 29. BigML, Inc X· @bigmlcom · @QatarComputing · #MLSD18 · Joins
  • 30. BigML, Inc X· @bigmlcom · @QatarComputing · #MLSD18 · Re-Define the Goal • Predict rating: Score from 0 to 100 • This is a regression problem • Based on business profile: • Description: kitchen, restaurant, etc. • Location: zip code, latitude, longitude • Number of violations, text of violations • We need to clean-up the violations field: • remove the “[ violation fixed…]” strings
  • 31. BigML, Inc X· @bigmlcom · @QatarComputing · #MLSD18 · Data Cleaning Homogenize missing values and different types in the same feature, fix input errors, correct semantic issues, types, etc. Name Date Duration (s) Genre Plays Highway star 1984-05-24 - Rock 139 Blues alive 1990/03/01 281 Blues 239 Lonely planet 2002-11-19 5:32s Techno 42 Dance, dance 02/23/1983 312 Disco N/A The wall 1943-01-20 218 Reagge 83 Offside down 1965-02-19 4 minutes Techno 895 The alchemist 2001-11-21 418 Bluesss 178 Bring me down 18-10-98 328 Classic 21 The scarecrow 1994-10-12 269 Rock 734 Original data Name Date Duration (s) Genre Plays Highway star 1984-05-24 Rock 139 Blues alive 1990-03-01 281 Blues 239 Lonely planet 2002-11-19 332 Techno 42 Dance, dance 1983-02-23 312 Disco The wall 1943-01-20 218 Reagge 83 Offside down 1965-02-19 240 Techno 895 The alchemist 2001-11-21 418 Blues 178 Bring me down 1998-10-18 328 Classic 21 The scarecrow 1994-10-12 269 Rock 734 Cleaned data update violations set description = substr(description,1,instr(description,' [ date violation corrected:')-1) where instr(description,' [ date violation corrected:') > 0; (replace (field "description") "[.*" "")
  • 32. BigML, Inc X· @bigmlcom · @QatarComputing · #MLSD18 · Data Cleaning
  • 33. BigML, Inc X· @bigmlcom · @QatarComputing · #MLSD18 · Re-Define the Goal • Predict rating: Poor / Needs Improvement / Adequate / Good • This is a classification problem • Based on business profile: • Description: kitchen, restaurant, etc. • Location: zip code, latitude, longitude • Number of violations, text of violations • We need to clean-up the violations field: • remove the “[ violation fixed…]” strings • Need to aggregate the violations:
  • 34. BigML, Inc X· @bigmlcom · @QatarComputing · #MLSD18 · Aggregations business_id date description 10 20140114 Unclean or degraded floors walls or ceilings 10 20140114 Inadequate and inaccessible handwashing facilities 19 20160513 Unclean or degraded floors walls or ceilings 19 20160513 Food safety certificate or food handler card not available 19 20160513 Unapproved or unmaintained equipment or utensils 19 20141110 Inadequate food safety knowledge or lack of certified food safety manager19 20141110 Improper storage of equipment utensils or linens 19 20140214 Permit license or inspection report not posted 19 20140214 Inadequately cleaned or sanitized food contact surfaces 24 20161005 Unclean or degraded floors walls or ceilings 24 20160311 Unclean or degraded floors walls or ceilings Instances Tabular Data (rows and columns): • Each row • is one instance • contains all the information about that one instance. violations
  • 35. BigML, Inc X· @bigmlcom · @QatarComputing · #MLSD18 · Aggregation: Count Content Genre Duration Play Time User Device Highway star Rock 190 2015-05-12 16:29:33 User001 TV Blues alive Blues 281 2015-05-13 12:31:21 User005 Tablet Lonely planet Techno 332 2015-05-13 14:26:04 User003 TV Dance, dance Disco 312 2015-05-13 18:12:45 User001 Tablet The wall Reagge 218 2015-05-14 09:02:55 User002 Offside down Techno 240 2015-05-14 11:26:32 User005 Tablet The alchemist Blues 418 2015-05-14 21:44:15 User003 TV Bring me down Classic 328 2015-05-15 06:59:56 User001 Tablet User Count User001 3 User005 2 User003 2 User002 1 Count on User Number of playbacks per user
  • 36. BigML, Inc X· @bigmlcom · @QatarComputing · #MLSD18 · Aggregation: Count Distinct Content Genre Duration Play Time User Device Highway star Rock 190 2015-05-12 16:29:33 User001 TV Blues alive Blues 281 2015-05-13 12:31:21 User005 Tablet Lonely planet Techno 332 2015-05-13 14:26:04 User003 TV Dance, dance Disco 312 2015-05-13 18:12:45 User001 Tablet The wall Reagge 218 2015-05-14 09:02:55 User002 Offside down Techno 240 2015-05-14 11:26:32 User005 Tablet The alchemist Blues 418 2015-05-14 21:44:15 User003 TV Bring me down Classic 328 2015-05-15 06:59:56 User001 Tablet User Distinct Genre User001 3 User005 2 User003 2 User002 1 Count distinct Genre on User Number of distinct Genre played per user
  • 37. BigML, Inc X· @bigmlcom · @QatarComputing · #MLSD18 · Aggregation: Count Missing Content Genre Duration Play Time User Device Highway star Rock 190 2015-05-12 16:29:33 User001 TV Blues alive Blues 281 2015-05-13 12:31:21 User005 Tablet Lonely planet Techno 332 2015-05-13 14:26:04 User003 TV Dance, dance Disco 312 2015-05-13 18:12:45 User001 Tablet The wall Reagge 218 2015-05-14 09:02:55 User002 Offside down Techno 240 2015-05-14 11:26:32 User005 Tablet The alchemist Blues 418 2015-05-14 21:44:15 User003 TV Bring me down Classic 328 2015-05-15 06:59:56 User001 Tablet User Missing Device User001 0 User005 0 User003 0 User002 1 Count missing Device on User Number of missing Device per user
  • 38. BigML, Inc X· @bigmlcom · @QatarComputing · #MLSD18 · Aggregation: Sum Content Genre Duration Play Time User Device Highway star Rock 190 2015-05-12 16:29:33 User001 TV Blues alive Blues 281 2015-05-13 12:31:21 User005 Tablet Lonely planet Techno 332 2015-05-13 14:26:04 User003 TV Dance, dance Disco 312 2015-05-13 18:12:45 User001 Tablet The wall Reagge 218 2015-05-14 09:02:55 User002 Offside down Techno 240 2015-05-14 11:26:32 User005 Tablet The alchemist Blues 418 2015-05-14 21:44:15 User003 TV Bring me down Classic 328 2015-05-15 06:59:56 User001 Tablet User Sum Duration User001 830 User005 521 User003 750 User002 218 Sum Duration on User Total Duration per User
  • 39. BigML, Inc X· @bigmlcom · @QatarComputing · #MLSD18 · Aggregation: Average Content Genre Duration Play Time User Device Highway star Rock 190 2015-05-12 16:29:33 User001 TV Blues alive Blues 281 2015-05-13 12:31:21 User005 Tablet Lonely planet Techno 332 2015-05-13 14:26:04 User003 TV Dance, dance Disco 312 2015-05-13 18:12:45 User001 Tablet The wall Reagge 218 2015-05-14 09:02:55 User002 Offside down Techno 240 2015-05-14 11:26:32 User005 Tablet The alchemist Blues 418 2015-05-14 21:44:15 User003 TV Bring me down Classic 328 2015-05-15 06:59:56 User001 Tablet User Average Duration User001 276,67 User005 260,50 User003 375,00 User002 218 Average Duration on User Average Duration per User
  • 40. BigML, Inc X· @bigmlcom · @QatarComputing · #MLSD18 · Aggregation: Maximum Content Genre Duration Play Time User Device Highway star Rock 190 2015-05-12 16:29:33 User001 TV Blues alive Blues 281 2015-05-13 12:31:21 User005 Tablet Lonely planet Techno 332 2015-05-13 14:26:04 User003 TV Dance, dance Disco 312 2015-05-13 18:12:45 User001 Tablet The wall Reagge 218 2015-05-14 09:02:55 User002 Offside down Techno 240 2015-05-14 11:26:32 User005 Tablet The alchemist Blues 418 2015-05-14 21:44:15 User003 TV Bring me down Classic 328 2015-05-15 06:59:56 User001 Tablet User Max Duration User001 328 User005 281 User003 418 User002 218 Maximum Duration on User Maximum Duration per User
  • 41. BigML, Inc X· @bigmlcom · @QatarComputing · #MLSD18 · Aggregation: Minimum Content Genre Duration Play Time User Device Highway star Rock 190 2015-05-12 16:29:33 User001 TV Blues alive Blues 281 2015-05-13 12:31:21 User005 Tablet Lonely planet Techno 332 2015-05-13 14:26:04 User003 TV Dance, dance Disco 312 2015-05-13 18:12:45 User001 Tablet The wall Reagge 218 2015-05-14 09:02:55 User002 Offside down Techno 240 2015-05-14 11:26:32 User005 Tablet The alchemist Blues 418 2015-05-14 21:44:15 User003 TV Bring me down Classic 328 2015-05-15 06:59:56 User001 Tablet User Min Duration User001 190 User005 240 User003 332 User002 218 Minimum Duration on User Minimum Duration per User • Similar for standard deviation and variance • Possible to combine multiple aggregations on the same field
  • 42. BigML, Inc X· @bigmlcom · @QatarComputing · #MLSD18 · Pivoting Different values of a feature are pivoted to new columns in the result dataset. Content Genre Duration Play Time User Device Highway star Rock 190 2015-05-12 16:29:33 User001 TV Blues alive Blues 281 2015-05-13 12:31:21 User005 Tablet Lonely planet Techno 332 2015-05-13 14:26:04 User003 TV Dance, dance Disco 312 2015-05-13 18:12:45 User001 Tablet The wall Reagge 218 2015-05-14 09:02:55 User002 Smartphone Offside down Techno 240 2015-05-14 11:26:32 User005 Tablet The alchemist Blues 418 2015-05-14 21:44:15 User003 TV Bring me down Classic 328 2015-05-15 06:59:56 User001 Tablet The scarecrow Rock 269 2015-05-15 12:37:05 User003 Smartphone Original data User Num.Playback s Total Time Pref.Device NP_TV NP_Tablet NP_Smartphone TT_TV TT_Tablet TT_Smartphone User001 3 830 Tablet 1 2 0 190 640 0 User002 1 218 Smartphone 0 0 1 0 0 218 User003 3 1019 TV 2 0 1 750 0 269 User005 2 521 Tablet 0 2 0 0 521 0 Aggregated data with pivoted columns
  • 43. BigML, Inc X· @bigmlcom · @QatarComputing · #MLSD18 · Re-Define the Goal • Predict rating: Poor / Needs Improvement / Adequate / Good • This is a classification problem • Based on business profile: • Description: kitchen, restaurant, etc. • Location: zip code, latitude, longitude • Number of violations, text of violations • We need to clean-up the violations field: • remove the “[ violation fixed…]” strings • Need to aggregate the violations: • number of violations • concatenate violation descriptions
  • 44. BigML, Inc X· @bigmlcom · @QatarComputing · #MLSD18 · Aggregations
  • 45. BigML, Inc X· @bigmlcom · @QatarComputing · #MLSD18 · Time Windows Create new features using values over different periods of time Instances Features Time Instances Features (millions) score_2013 score_2014 score_2015 create table scores_2013 select a.business_id, a.score as score_2013, a.idate as idate_2013 from inspections as a JOIN ( select business_id, max(idate) as idate from inspections where substr(idate,1,4) = "2013" group by business_id) as b where a.business_id = b.business_id and a.idate = b.idate; create table scores_over_time select * from businesses left join scores_2013 USING (business_id) left join scores_2014 USING (business_id); score_2013 score_2014 score_2015
  • 46. BigML, Inc X· @bigmlcom · @QatarComputing · #MLSD18 · Time Windows
  • 47. BigML, Inc X· @bigmlcom · @QatarComputing · #MLSD18 · Updates Need a current view of the data, but new data only comes in batches of changes day 1day 2day 3 Instances Features
  • 48. BigML, Inc X· @bigmlcom · @QatarComputing · #MLSD18 · Streaming Data only comes in single changes data stream Instances Features Stream Batch (kafka, etc)
  • 49. BigML, Inc X· @bigmlcom · @QatarComputing · #MLSD18 · Time Series Transformations
  • 50. BigML, Inc X· @bigmlcom · @QatarComputing · #MLSD18 · Independent Data Color Mass PPAP red 11 pen green 45 apple red 53 apple yellow 0 pen blue 2 pen green 422 pineapple yellow 555 pineapple blue 7 pen Discovering patterns: • Color = “red” Mass < 100 • PPAP = “pineapple” Color ≠ “blue” • Color = “blue” PPAP = “pen”
  • 51. BigML, Inc X· @bigmlcom · @QatarComputing · #MLSD18 · Independent Data Color Mass PPAP green 45 apple blue 2 pen green 422 pineapple blue 7 pen yellow 0 pen yellow 9 pineapple red 555 apple red 11 pen Patterns still hold when rows re-arranged: • Color = “red” Mass < 100 • PPAP = “pineapple” Color ≠ “blue” • Color = “blue” PPAP = “pen”
  • 52. BigML, Inc X· @bigmlcom · @QatarComputing · #MLSD18 · Dependent Data Year Pineapple Harvest1986 50,74 1987 22,03 1988 50,69 1989 40,38 1990 29,80 1991 9,90 1992 73,93 1993 22,95 1994 139,09 1995 115,17 1996 193,88 1997 175,31 1998 223,41 1999 295,03 2000 450,53 Pineapple Harvest Tons 0 125 250 375 500 Year 1986 1988 1990 1992 1994 1996 1998 2000 Trend Error
  • 53. BigML, Inc X· @bigmlcom · @QatarComputing · #MLSD18 · Dependent Data Pineapple Harvest Tons 0 125 250 375 500 Year 1986 1988 1990 1992 1994 1996 1998 2000 Year Pineapple Harvest1986 139,09 1987 175,31 1988 9,91 1989 22,95 1990 450,53 1991 73,93 1992 40,38 1993 22,03 1994 295,03 1995 50,74 1996 29,8 1997 223,41 1998 115,17 1999 193,88 2000 50,69 Rearranging Disrupts Patterns Transformations need to preserve sorting
  • 54. BigML, Inc X· @bigmlcom · @QatarComputing · #MLSD18 · Calendar Correction • Time Series data can show variations due to aggregation • For example: “pounds/month” produced • Transform: pounds/month ÷ days/month = pounds/day
  • 55. BigML, Inc X· @bigmlcom · @QatarComputing · #MLSD18 · Calendar Correction
  • 56. BigML, Inc X· @bigmlcom · @QatarComputing · #MLSD18 · Summary • Data is usually awful • Requires clean-up • Transformations • Enormous part of the effort in applying ML • Techniques: • Denormalizing • Aggregating / Pivoting • Time windows / Streaming • Calendar corrections