BSSML16 L1. Introduction, Models, and Evaluations

D E C E M B E R 8 - 9 , 2 0 1 6

BigML, Inc 2
Poul Petersen
CIO, BigML, Inc.
Intro, Models & EvaluationGetting Started with Machine Learning

BigML, Inc 3Introduction, Models, and Evaluations
Audience Diversity
Expert: Published papers at KDD, ICML, NIPS, etc or
developed own ML algorithms used at large scale.
Aﬁcionado: Understands pros/cons of different
techniques and/or can tweak algorithms as needed.
Newbie: Just taking Coursera ML class or reading an
introductory book to ML.
Absolute beginner: ML sounds like science ﬁction
Practitioner: Very familiar with ML packages (Weka,
Scikit, R, etc).

Building BigML’s Platform
2011
Prototyping and Beta
API-first Approach
2013
Evaluations, Batch
Predictions,
Ensembles, Sunburst
2015
Association
Discovery,
Correlations,
Samples, Statistical
Tests
2014
Anomaly Detection,
Clusters, Flatline
2016
Scripts, Libraries,
Executions,
WhizzML, Logistic
Regression
2012
Core ML workflow:
source, dataset,
model, prediction

time
Automation
Paving the Path to Automatic Machine Learning
A
REST API
Programmable
Infrastructure
Sauron
• Automatic deployment and
auto-scaling
Data Generation and
Filtering
C
Flatline
• DSL for transformation and
new field generation
B
Wintermute
• Distributed Machine
Learning Framework
2011 2016
Automatic Model
Selection
E
SMACdown
• Automatic parameter
optimization
Workflow
Automation
D
WhizzML
• DSL for programmable
workflows
BigML Vision

BigML Architecture
Tools
REST API
Distributed Machine Learning Backend
Web-based Frontend
Visualizations
Smart Infrastructure
(auto-deployable, auto-scalable)
SOURCE
SERVER
DATASET
SERVER
MODEL
SERVER
PREDICTION
SERVER
EVALUATION
SERVER
SAMPLE
SERVER
WHIZZML
SERVER
- https://bigml.com/tools
- https://bigml.com/api
SERVERS
EVENTS GEARMAN
QUEUE
DESIRED
TOPOLOGY
AWS
COSTS
RUNQUEUE
SCALER
BUSY
SCALER
AUTO
TOPOLOGY
AUTO
TOPOLOGY
AUTO
TOPOLOGY
AUTO
TOPOLOGY
ACTUAL
TOPOLOGY

SOURCE DATASET CORRELATION
STATISTICAL
TEST
MODEL ENSEMBLE
LOGISTIC
REGRESSION EVALUATION
ANOMALY
DETECTOR
ASSOCIATION
DISCOVERY
PREDICTION
BATCH
PREDICTIONSCRIPT LIBRARY EXECUTION
Data
Exploration
Supervised
Learning
Unsupervised
Learning
Automation
CLUSTER
Scoring
BigML’s Platform

What is ML?
• You are looking to buy a house
• Recently found a house you like
• Is the asking price fair?
Imagine:
What Next?

What is ML?
Why not ask an expert?
• Experts can be rare / expensive
• Hard to validate experience:
• Experience with similar properties?
• Do they consider all relevant variables?
• Knowledge of market up to date?
• Hard to validate answer:
• How many times expert right / wrong?
• Probably can’t explain decision in detail
• Humans are not good at intuitive statistics

Human Intuition
Consider the following two cities:
Common Intuition:
People in Cloud City never need sunglasses since it’s so
cloudy
Did it occur to you:
Sun City sells more sunglasses per-capita than LA
Cloud City
350 grey and rainy days

15 sunny days
Sun City
15 grey and rainy days

350 sunny days
Question:
Where is the number of sunglasses sold (per-capita)
bigger?

Human Intuition
Imagine Mr. Fernández is selected at random
Is Mr. Fernández more likely to be
a librarian or a farmer?
Did it occur to you that worldwide there is an estimated 
1 billion people oﬃcially employed in agriculture?
Mr. Fernández
http://www.globalagriculture.org/report-topics/industrial-agriculture-and-small-scale-farming.html

Intuitive Statistics
Madrid 81 87 93 % 234 270 87 %
Barcelona 192 263 73 % 55 80 69 %
John Frank
Wins Total Success Wins Total Success
Trials 273 350 78 % 289 350 83 %
John and Frank are both practicing litigation law in Madrid and Barcelona.
Simpson’s Paradox
A trend that appears in diﬀerent groups of data disappears
when these groups are combined, and the reverse trend
appears for the aggregate data.
Which attorney will you choose?

What is ML?
Replace the expert with data?
• Intuition: square footage relates to price.
• Collect data from past sales
SQFT SOLD
2424 360000
1785 307500
1003 185000
4135 600000
1676 328500
1012 247000
3352 420000
2825 435350
PRICE = 125.3*SQFT + 96535
PREDICT
400262
320195
222211
614651
306538
223339
516541
450508

What is ML?
Price?

What is ML?
Price?
SQFT relates
to Price?
SQFT SALE PRICE
2424 360000,0
1785 307500,0
1003 185000,0
4135 600000,0
1676 328500,0
1012 247000,0
3352 420000,0
2825 435350,0
PRICE = 125.3*SQFT + 96535

What is ML?
Replace the expert scorecard
• Experts can be rare / expensive
• Hard to validate experience:
• Experience with similar properties?
• Do they consider all relevant variables?
• Knowledge of market up to date?
• Hard to validate answer:
• How many times expert right / wrong?
• Probably can’t explain decision in detail
• Humans are not good at intuitive statistics

What is ML?
Replace the expert with data
• Intuition: square footage relates to price.
• Collect data from past sales
SQFT SOLD
2424 360000,0
1785 307500,0
1003 185000,0
4135 600000,0
1676 328500,0
1012 247000,0
3352 420000,0
2825 435350,0
PRICE = 125.3*SQFT + 96535

More Data!
SQFT BEDS BATHS ADDRESS LOCATION
LOT
SIZE
YEAR
BUILT
PARKING
SPOTS
LATITUDE LONGITUDE SOLD
2424 4 3,0
1522 NW
Jonquil
Timberhill
SE 2nd
5227 1991 2 44,594828 -123,269328 360000
1785 3 2,0
7360 NW
Valley Vw
Country
Estates
25700 1979 2 44,643876 -123,238189 307500
1003 2 1,0
2620 NW
Chinaberry
Tamarack
Village
4792 1978 2 44,593704 -123,295424 185000
4135 5 3,5
4748 NW
Veronica
Suncrest 6098 2004 3 44,5929659 -123,306916 600000
1676 3 2,0
2842 NW
Monterey
Corvallis 8712 1975 2 44,5945279 -123,291523 328500
1012 3 1,0
2320 NW
Highland
Corvallis 9583 1959 2 44,591476 -123,262841 247000
3352 4 3,0
1205 NW
Ridgewood
Ridgewood
2
60113 1975 2 44,579439 -123,333888 420000
2825 3,0 411 NW 16th
Wilkins
Addition
4792 1938 1 44,570883 -123,272113 435350
Uhhhh……..

This is ML…
Price?
SQFT relates
to Price?
SQFT SALE PRICE
2424 360000,0
1785 307500,0
1003 185000,0
4135 600000,0
1676 328500,0
1012 247000,0
3352 420000,0
2825 435350,0
PRICE = 125.3*SQFT + 96535
DATA
MODELINSTANCE PREDICTION
“a ﬁeld of study that gives computers the
ability to learn without being explicitly
programmed”
Professor Arthur Samuel, 1959

Supervised Learning
animal state … proximity action
tiger hungry … close run
elephant happy … far take picture
… … … … …
Classiﬁcation
animal state … proximity min_kmh
tiger hungry … close 70
hippo angry … far 10
… …. … … …
Regression
animal state … proximity action1 action2
tiger hungry … close run look untasty
elephant happy … far take picture call friends
… … … … … …
Multi-Label Classiﬁcation
label(s)

Decision Trees

Decision Trees
Website Visits > 0

Decision Trees
Minutes Used > 200

Decision Trees
Last Bill > $180

Decision Trees
Last Bill > $180 and Support Calls > 0

Why Decision Trees
• Works for classiﬁcation or regression
• Easy to understand: splits are features and values
• Lightweight and super fast at prediction time
• Relatively parameter free
• Data can be messy
• Useless features are automatically ignored
• Works with un-normalized data
• Works with missing data
• Resilient to outliers
• Well suited for non-linear problems
• Top performer when combined into ensembles…

Handling Missing Data
Missing@
Decision
Trees
KNN
Logistic
Regression
Naive
Bayes
Neural
Networks
SVM
Training Yes No No Yes Yes* No
Prediction Yes No No Yes No No

Data Types
numeric
1 2 3
1, 2.0, 3, -5.4 categoricaltrue, yes, red, mammal categoricalcategorical
A B C
DATE-TIME2013-09-25 10:02
DATE-TIME
YEAR
MONTH
DAY-OF-MONTH
YYYY-MM-DD
DAY-OF-WEEK
HOUR
MINUTE
YYYY-MM-DD
YYYY-MM-DD
M-T-W-T-F-S-D
HH:MM:SS
HH:MM:SS
2013
September
25
Wednesday
10
02
text / items
Be not afraid of greatness:
some are born great, some
achieve greatness, and
some have greatness
thrust upon 'em.
text
“great”
“afraid”
“born”
“some”
appears 2 times
appears 1 time
appears 1 time
appears 2 times

Text Analysis
some are born great, some
achieve greatness, and
some have greatness
thrust upon 'em.
great: appears 4 times
Bag of Words

Text Analysis
great afraid born achieve
4 1 1 1
… … … …
some are born great, some achieve
greatness, and some have greatness
thrust upon ‘em.
Model
The token “great”
does not occur
The token “afraid”
occurs more than once

Learning Problems (fit)
• Model does not fit well enough

• Does not capture the underlying trend of
the data

• Change algorithm or features
Under-fitting Over-fitting
• Model fits too well does not “generalize”

• Captures the noise or outliers of the data

• Change algorithm or filter outliers

Why Not Decision Trees
• Slightly prone to over-fitting
• But we’ll fix this with ensembles
• Splitting prefers decision boundaries that are parallel
to feature axes
• More data
• Predictions outside training data can be problematic
• We’ll fix this with model competence
• Can be sensitive to small changes in training data

Evaluation
DATASET
TRAIN SET
TEST SET
PREDICTIONS
METRICS

Accuracy
TP + TN
Total
• “Percentage correct” - like an exam
• = 1 then no mistakes
• = 0 then all mistakes
• Intuitive but not always useful
• Watch out for unbalanced classes!

Accuracy
Classiﬁed as
Fraud
Classiﬁed as
Not Fraud
TP = 0
FP = 0
TN = 7
FN = 3
ACC = 70%
=Fraud
=Not FraudPositive

Class
Negative

Class

Precision
__TP__
TP + FP
• “accuracy” of positive class
• = 1 then no FP
• = 0 then no TP

Precision
Classiﬁed as
Fraud
Classiﬁed as
Not Fraud
TP = 2
FP = 2
TN = 5
FN = 1
P = 50%
=Fraud
=Not FraudPositive

Class
Negative

Class

Recall
__TP__
TP + FN
• percentage of positive class
correctly identiﬁed
• = 1 then no FN
• = 0 then no TP

Recall
Classiﬁed as
Fraud
Classiﬁed as
Not Fraud
TP = 2
FP = 2
TN = 5
FN = 1
R = 66%
=Fraud
=Not FraudPositive

Class
Negative

Class

f-Measure
2 * Recall * Precision
Recall + Precision
• harmonic mean of Recall & Precision
• = 1 then Recall = Precision = 1
• If Precision OR Recall is small then
f-measure is small

f-Measure
Classiﬁed as
Fraud
Classiﬁed as
Not Fraud
R = 66%
P = 50%
f = 57%
=Fraud
=Not FraudPositive

Class
Negative

Class

Phi Coeﬃcient
__________TP*TN_-_FP*FN__________
SQRT[(TP+FP)(TP+FN)(TN+FP)(TN+FN)]
• Returns a value between -1 and 1
• -1 then predictions are opposite reality
• 0 no correlation between predictions
and reality
• 1 then predictions are always correct

Phi Coefficient
Classified as
Fraud
Classified as
Not Fraud
TP = 2
FP = 2
TN = 5
FN = 1
Phi = 0.356
=Fraud
=Not FraudPositive

Class
Negative

Class

Evaluations

Mean Absolute Error
e1
e2
e7
e6
e5
e4
e3
MAE = |e1| + |e2| + … + |en|
n

Mean Squared Error
e1
e2
e7
e6
e5
e4
e3
MSE = (e1)2 + (e2)2 + … + (en)2
n

MSE / MAE
• For both MAE & MSE: Smaller is
better, but values are unbounded
• MSE is always larger than or equal to
MAE

R Squared Error
e1
e2
e7
e6
e5
e4
e3
Mean
v1
v2
v3 v4 v5
v7
v6
MSEmodel
MSEmean
RSE = 1 -

R-Squared Error
• RSE: measure of how much better the
model is than always predicting the
mean
• < 0 model is worse then mean
• = 0 model is no better than the mean
• = 1 model ﬁts the data perfectly

BSSML16 L1. Introduction, Models, and Evaluations

BSSML16 L1. Introduction, Models, and Evaluations

More Related Content

What's hot (20)

Viewers also liked (18)

Similar to BSSML16 L1. Introduction, Models, and Evaluations (20)

More from BigML, Inc (20)

Recently uploaded (20)

BSSML16 L1. Introduction, Models, and Evaluations