LR2. Summary Day 2

Morning class summary
Mercè Martín
BigML

The Future of ML
José David Martín-Guerrero (IDAL, UV)
Machine learning projectMachine learning project
All steps are connected and feedback is essential to succeed
Society has drifted to the Machine Learning way
social networks, data acquisition, technologies...

Feature engineering challenges
High space dimensionality (#features >>> #samples)
Inputs preparation: selection, transformation or model direct
attack
Modelling strategies: paradox of choice
Too many algorithms and structures, no general purpose one?
Too many con2guration options, no automatic choice?
Select your model by its structure, parameters (tuning) or
search algorithm (e.g. Deep learning: no feature engineering
but hectic tuning, Azure: many elections)
Wish list: more automation
Work7ows, model selection, tuning, representation,
prediction strategies
The Future of ML

The Future of ML
Existing techniques: Reinforcement learning
Environment definable as state-space?
Evolution of this space acted by a set of actors?
The Problem is suitable for RL
Goal to be maximized in the long term?
Prior experience
Interaction
Environment adaptation
Policy
So far applied to synthetic problems and robotics but also suitable for
marketing or medicine, and more to come!

Evaluating ML Algorithms II
GOLDEN RULE: Never use the same
example for training the model and
evaluating it!!
What if you don't have so much data? Sample and repeat!
José Hernández-Orallo (UPV)
Under-fitting: too general
How can we detect them? Evaluating
Over-fitting: too specific

Training
Data
h1
Test
hn
Evaluation
Evaluation
Learning
Learning
Training
Test
n times
n folds
Cross-validation
o We take all possible
combinations with n‒1
for training and the
remaining fold for test.
o The error (or any other
metric) is calculated n
times and then
averaged.
o A 2nal model is trained
with all the data.
Bootstrapping o We extract n samples with repetition
and train with the rest

Cost-sensitive evaluations: not all errors are equally costly
Hadamard product = Cost matrix . Confusion matrix
open close
OPEN 0 100€
CLOSE 2000€ 0
Actual
Predicted
c1 open close
OPEN 300 500
CLOSE 200 99000
Actual
Pred
c3 open close
OPEN 400 5400
CLOSE 100 94100
Actual
c2 open close
OPEN 0 0
CLOSE 500 99500
Actual
c1 open close
OPEN 0€ 50,000€
CLOSE 400,000€ 0€
c3 open close
OPEN 0€ 540,000€
CLOSE 200,000€ 0€
c2 open close
OPEN 0€ 0€
CLOSE 1,000,000€ 0€
TOTAL COST:
450,000€
TOTAL COST:
1,000,000€ TOTAL COST:
740,000€
Confusion Matrices
Cost
Matrix
Resulting
Matrices
External Context:
Set of classes
Cost estimation
Confusion matrix & cost matrix can be characterized by just one number: slope

ROC (Receiver Operating Characteristic) analysis
Dynamic context (class distribution & cost matrix)
ROC diagram
0 1
1
0
FPR
TPR
o Given several classi2ers:
 We add the trivial (0,0) (1,1)
classi2ers and construct the convex
hull of their points (FPR,TPR). The
points in the edges are linear
combinations of classi2ers (p * Ca
+
(1-p) * Cb
)
 The classi2ers below the ROC curve
are discarded.
 The best classi2er (from those
remaining) will be selected at
application time… slope
Probabilistic context: soft ROC analysis
A single classifier with probability-weighted predictions can generate a ROC
curve by changing score threshold
(each threshold gives a new classifier in the ROC curve)
Ca
Cb

AUC (Area Under the ROC Curve)
For crisp classifiers AUC is equivalent to the macro-averaged accuracy.
AUC is a good metric for classifiers and rankers:
A classifier with high AUC is a good ranker.
It is also good for a (uniform) range of operating
conditions
A model with very good AUC will have good accuracy
for all operating conditions.
A model with very good accuracy for one operating
condition can have very bad accuracy for another
operating condition.
A classifier with high AUC can have poor calibration
(probability estimation).
Multidimensional classifications? ROC problematic, AUC has been extended
Regressions? ROC has been extended, AUC is the error variance

Cluster Analysis
K-means
clustering
K=3
Poul Petersen (BigML)
Unsupervised problem (unlabelled data)
Customer segmentation, Item discovery (types),
Association (profiles), recommender, active learning (group and label)

Cluster Analysis
• What is the distance to a “missing value”? Defaults replacement
• What is the distance between categorical values? [0,1]
• What is the distance between text features? Vectorize and use
cosine distance
• Does it have to be Euclidean distance?
• Unknown “K”?
Distance and centers define the groups: K-means, but...
Problems: Convergence (initial conditions), scaling dimensions
Things you need to tackle:
K-means: starting from a subset of K points, recursively compute the distances
of all points in data to them and associate with the closest.
Define the center of each group as new set of K points and repeat until there's
no improvement.

Cluster Analysis
Let K=5
K=5
g-means clustering: increment k looking for the gaussian

Unsupervised Data: Rank by dissimilarity
Why? Unusual instances, intrusion detection, fraud, incorrect
data
• Given a group, try to single out the odd: remove outliers from data
Dataset → Anomaly Detector → score → remove outliers
Can use it a diKerent layers and combined with clustering
• Improve model competence: testing predictions score to look for new
instances dissimilar to train instances (non-competent model)
• Compare against usual distributions, Gaussian, Benford's Law
Anomaly Detection
Poul Petersen (BigML)

Anomaly Detection
“Round”“Skinny” “Corners”
“Skinny”
but not “smooth”
No
“Corners”
Not
“Round”
Most unusual
Different according to grouping features (prior knowledge)

Anomaly Detection
Grow a random decision tree
until each instance is in its
own leaf (random features and
splits)
“easy” to isolate
“hard” to isolate
Depth
Now repeat the process several times and assign an
anomaly score ( 0 = similar , 1 = dissimilar) to any
input data by computing how di%erent is the average
depth for the instance to the average depth of the
training set.

Machine Learning Black Art
Charles Parker (BigML)
Even when you follow the
yellow brick road...
Different models
Feature engineering
Evaluation metrics
The house of horrors awaits you
around the corner:
Huge Hypothesis Space
Poorly Picked Loss Function
Cross Validation
Drifting Domain
Reliance on Research Results

● Huge hypothesis space: the possible classifiers you could build with an
algorithm given the data. Choice!
Triple trade-off
Use non-parametric methods
As data scales simpler models are desirable
Big data often trumps modelling!
● Poorly picked Loss function: standard loss functions (entropy, distance in
formal space) are mathematically convenient but not always enough for real problems
No info about the classes or the costs
False positive in disease diagnosis
False positive in face detection
False positive in thumbprint identification
Path dependence
Game playing
Let developers apply their own loss
function: SVM light, plugins in splitting
code, customized gradient descent...
OR
Hack the prediction (cascade classifiers)
Change the problem setting
(time based limits to the classifier, max loss)
Keep error down with a certain probability
More complex: you need more data

● Cross-validation
hold outs can lead to leakage: features or
instances can be correlated in test an train
sets. Optimistic performance.
Law of averages and being off by one
Features correlated with my prediction
can bias predictions
Photo dating: colors, borders...
Beware of the group the instances belong to
Agreggates and timestamps
Instances in close moments are very correlated

● Drifting Domain
Domain changes (document classification, sales prediction)
Adverse selection of training data (market data predictions, spam)
➢ Prior p(input) is changing → covariate shift
➢ Map changes p(output | input) is changing → concept drift
Symptoms: lots of errors, distribution changes. Compare to old data!
● Reliance on Research results
Reality does not comply to theorems' initial boundaries (error, sample
complexity, convergence)
Rule of thumb:
Use academia as your starting point, but don’t
think it will solve all your problems. Keep learning
Reality does not comply to theorems' initial boundaries (error, sample
complexity, convergence) non-real assumptions

Useful Things about ML
Charles Parker (BigML)
Advice from Dijkstra
● Killing Ambitious Projects - identify sub-problems you can tackle
hard vs easy, hacking it's all right. Good candidates:
No human experts predict in complex environments (protein folding)
Humans can't explain how they know f(x)(character recognition)
f(x) is changing all the time (market data)
f(x) must be specialized many times (anything user speci2c)
● Ignoring the Lure of Complexity
Look for simplicity (remove spaghetti code, processes, drudgery)
Push around complexity (clever compression)
Raw data might have information, sometimes is the right way
● Finding Your Own Humility
Know and embrace your own limits
Continuously learn
Do A/B test: improve from an existing system
● Avoiding Useless Projects
Look for the best combination of easy and big win
De2ne metrics with experts but don't rely on them: monitor

Useful Things about ML
Advice From DijkstraAdvice From DijkstraAdvice From DijkstraAdvice From Dijkstra (continued)
● Creating a good story
Explain why and summarize your model and your data
Stories are more valuable than models
● Continuing to Learn
Don't accommodate, work at the verge of your abilities
Understand your limitations
Learn from your errors
Summary:
ML can be of value for every organization: 2nd where
Locating the right problem, Executing, Showing the proof
When you win we all win, so good luck!!!

LR2. Summary Day 2

More Related Content

What's hot (20)

Similar to LR2. Summary Day 2 (20)

More from Machine Learning Valencia (10)

Recently uploaded (20)

LR2. Summary Day 2