L15. Machine Learning - Black Art

Machine Learning - Black Art
Charles Parker
Allston Trading

Machine Learning is Hard!
• By now, you know kind of a lot

• Diﬀerent types of models

• Feature engineering

• Ways to evaluate

• But you’ll still fail!

• Out in the real world, there’s a
whole bunch of things that will kill
your project

• FYI - A lot of these talks are stolen
2

Join Me!
• On a journey into the Machine Learning House of
Horrors!

• Mwa ha ha!
3

5
• The Horror of The Huge Hypothesis Space

• The Perils of The Poorly Picked Loss Function

• The Creeping Creature Called Cross Validation

• The Dread of the Drifting Domain

• The Repugnance of Reliance on Research Results
The Machine Learning House of Horrors!

Choosing A Hypothesis Space
• By “hypothesis space” we
mean the possible classiﬁers
you could build with an
algorithm given the data

• This is the choice you make
when you pick a learning
algorithm

• You have one job!

• Is there any way to make it
easier?
6

Theory to The Rescue!
• Probably Approximately Correct

• We’d like our model to have error less than epsilon

• We’d like that to happen at least some percentage of the time

• If the error is epsilon, the percentage is sigma, the number of
training examples is m, and the hypothesis space size is d:
7

The Triple Trade-Oﬀ
• There is a triple-trade oﬀ between the error, the size
of the hypothesis space, and the amount of training
data you have
8
Error
Hypothesis Space Training Data

What About Huge Data?
• I’m clever, so I’ll use non-
parametric methods (Decision
tree, k-NN, kernelized SVMs)

• As data scales, curious things
tend to happen

• Simpler models become more
desirable as they’re faster to ﬁt.

• You can increase model
complexity by adding features
(maybe word counts)

• Big data often trumps modeling!
9

10





A Dirty Little Secret About ML Algorithms
• They don’t care what you want

• Decision Trees:

• SVM:

• LR:

• LDA:
11

Real-world Losses
• Real losses are nothing like this

• False positive in disease
diagnosis

• False positive in face
detection

• False positive in thumbprint
identiﬁcation

• Some aren’t even instance-
based

• Path dependencies

• Game playing
12

Specializing Your Loss
• One solution is to let developers apply their own loss

• This is the approach of SVM light:

http://svmlight.joachims.org/

It’s been around for a while

• Losses other than Mutual Information can be plugged into the appropriate
place in splitting code

• Models trained via gradient descent can obviously be customized (Python’s
Theano is interesting for this)

• In the case of multi-example loss function, we have SEARN in Vowpal Wabbit

https://github.com/JohnLangford/vowpal_wabbit
13

Other Hackery
• Sometimes, the solution is just to hack
around the actual prediction

• Have several levels (cascade) of
classiﬁers in e.g., medical diagnosis, text
recognition

• Apply logic to explicitly avoid high loss
cases (e.g., when buying/selling equities)

• Changing the problem setting

• Will you be doing queries? Use ranking
or metric learning

• “I want to do crazy thing x with
classiﬁers”, chances are it’s already been
done and you can read about it.
14

15





When Validation Attacks!
• Cross validation

• n-Fold - Hold out one fold for
testing, train on n - 1 folds

• Great way to measure
performance, right?

• It’s all about information leakage

• via instances

• via features
16

Case Study #1: Law of Averages
• Estimate sporting event
outcomes

• Use previous games to
estimate points scored for
each team (via windowing
transform)

• Choose winner based on
predicted score

• What if you’re oﬀ by one on
the window?
17

Case Study #2: Photo Dating
• Take scanned photos from
30 diﬀerent users (on
average 200 per user) and
create a model to assign a
date taken (plus or minus
ﬁve years)

• Perform 10-cross
validation

• Accuracy is 85%. Can
you trust it?
18

Case Study #3: Moments In Time
• You have a buy/sell
opportunity every ﬁve
seconds

• The signals you use to
evaluate the opportunity
are aggregates of market
activity over the last ﬁve
minutes

• How careful must you be
with cross-validation?
19

20





Breaking Machine Learning
• You’ve got this great model!
Congratulations!

• Suddenly it stops working.
Why?

• You might be in a domain
that tends to change over
time (document classiﬁcation,
sales prediction)

• You might be experiencing
adverse selection (market
data predictions, spam)
21

Concept Drift
• This is called non-stationarity in either the prior or the conditional
distributions

• Could be a couple of diﬀerent things

• If the prior p(input) is changing, it’s covariate shift

• If the conditional p(output | input) is changing, it’s concept drift

• No rule that it can’t be both

• http://blog.bigml.com/2013/03/12/machine-learning-from-
streaming-data-two-problems-two-solutions-two-concerns-and-
two-lessons/
22

Take Action!
• First: Look for symptoms

• Getting a lot of errors

• The distribution of predicted values changes

• Drift detection algorithms (that I know about) have the same basic flavor:

• Buffer some data in memory

• If recent data is “different” from past data, retrain, update or give up

• Some resources - A nice survey paper and an open source package:
23
http://www.win.tue.nl/~mpechen/publications/pubs/Gama_ACMCS_AdaptationCD_accepted.pdf

http://moa.cms.waikato.ac.nz/

The Beneﬁts of Archeology
• Why might you train on old
data, even if it’s not relevant?

• Veriﬁcation of your research
process

• You’d do the same thing
last year. Did it work?

• Gives you a good idea of
how much drift you should
expect
24

25





Publish or Perish
• Academic papers are a certain type of
result

• Show incremental improvement in
accuracy or generality

• Prove something about your
algorithm

• This latter is hard to come by as results
get more realistic

• Machine learning proofs assume data
is “i.i.d”, but this is obviously false.

• Real world data sucks, and dealing
with that signiﬁcantly changes the
dataset
26

Usefulness of Results
• Theoretical Results

• Most of the time bounds do not apply (error, sample
complexity, convergence)

• Sometimes they don’t even make any sense

• Beware of putting too much faith in a single person or single
person’s work

• Usefulness generally occurs only in the aggregate

• And sometimes not even then (researchers are people, too)
27

Machine Learning Isn’t About Machine Learning
• Why doesn’t it work like in the
paper?

• Remember, the paper is carefully
controlled in a way your application
is not.

• Performance is rarely driven by
machine learning

• It’s driven by camera
microphones

• It’s driven by Mario Draghi
28

So, Don’t Bother With It?
• Of course not!

• What’s the alternative?

• “All our science, measured
against reality, is primitive
and childlike — and yet it is
the most precious thing we
have” - Albert Einstein

• Use academia as your
starting point, but don’t
think it will get you out of
the work
29

Some Themes
• The major points of this talk:

• Machine learning is hard to get right

• The algorithms won’t do what you want

• Good results are probably spurious

• Even if they aren’t, it won’t last

• Reading the research won’t help

• Wait, no!

• Have an attitude of skeptical optimism (or optimal skepticism?)
30

L15. Machine Learning - Black Art

More Related Content

What's hot (20)

Similar to L15. Machine Learning - Black Art (20)

More from Machine Learning Valencia (12)

Recently uploaded (20)

L15. Machine Learning - Black Art