Better Machine Learning with Less Data - Slater Victoroff (Indico Data)

Shift Conference
Transfer Learning
BETTER MACHINE LEARNING
WITH LESS DATA
May 31st, 2019
Split, Croatia

2 | Copyright © 2019 Indico2 | Copyright © 2019 Indico
• CTO of indico
• B2B Intelligent Process Automation
company based in Boston
• Working on deep learning based transfer
learning since 2013
• Guy that plays with embeddings all day
• Vegan baker

3 | Copyright © 2019 Indico
Transfer
Learning
1. What is deep learning?
2. What makes it so effective?
3. What’s the catch?
4. Opening the “black box”
5. The unreasonable effectiveness of
embeddings
6. What makes a good embedding?

“DeepMind’s Go-playing AI doesn’t need human
help to beat us anymore”
- The Verge
“New AI Development So Advanced It's Too
Dangerous To Release, Says Scientists”
- Forbes
“AI defeated a top-tier 'Dota 2' esports
team. OpenAI is also inviting everyone
everyone to play.”
- Engadget
“New AI Style Transfer Algorithm
Allows Users to Create Millions of
Artistic Combinations”
- Nvidia

Network Models?
Hebbian Learning
Maybe this is
actually the
opposite of how
things work?
Spike timing
dependent plasticity
Oh, I guess this
doesn't really work
in machine learning
Backprop
All-or-nothing
neurons all wired
together
Connectivity in the
brain is complex,
all-or-nothing isn't
an absolute rule
???
Non-linearities are
critical, step
functions don't work
that well
ReLUs,
convolution,
recurrence
1940 Today1980

“Neuroscientists have long
criticised [sic] deep learning
algorithms as incompatible with
current knowledge of
neurobiology.”
- Yoshua Bengio et al
Towards Biologically Plausible Deep
Learning (2015)

AlexNet:
The shot
heard
round the
world
Source
Human
Accuracy

“Traditional” Machine Learning
What you have What you need
???

Count Vectorizer
# of times word
0 shows up
# of times word
1 shows up …[ ],,

TF-IDF (Term Frequency, Inverse
Document Frequency)
𝑓",$ = 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑡𝑖𝑚𝑒𝑠 𝑡𝑒𝑟𝑚 𝑡 𝑎𝑝𝑝𝑒𝑎𝑟𝑠 𝑖𝑛 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡 𝑑
𝐷 = 𝐴𝑙𝑙 𝐷𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠
𝑣$ =
𝑓",$
∑$:;
< 𝑓",$
|𝐷|
| 𝑡 ∈ 𝑇
𝑣$ = 𝐷𝑜𝑐𝑢𝑚𝑒𝑛𝑡 𝑑@ 𝑠 𝑡𝑓𝑖𝑑𝑓 𝑣𝑒𝑐𝑡𝑜𝑟
T = 𝐹𝑢𝑙𝑙 𝑉𝑜𝑐𝑎𝑏𝑢𝑙𝑎𝑟𝑦

The Problem With Text
John Malkovitch plays tennis in Winchester. He
has been reporting soreness in his elbow. His
60th birthday is in two weeks. After he returns
from his birthday trip to Casablanca we will
recommend a steroid shot to reduce
inflammation.
Feature(s)
• Name

inflammation.
Feature(s)
• Gender
• Location
• Age
Feature(s)
• Name

inflammation.
Feature(s)
• Activity
• Prior Affliction/Treatment
• Travel
Feature(s)
• Name
Feature(s)
• Gender
• Location
• Age

Problem Traditional Solution Traditional Problem
Linguistic Context • Stemming
• Synonym sets
• Lexicons
• Brittle
• Labor-intensive
• Messy real-world data
Local Context • Parse trees
• N-grams
• Phrase lexicon
• Inaccurate parsing
• Limited Context
Out of Vocabulary Issues • Lemmatization
• Expanded vocabulary
• Ignore
• Computationally expensive
• Diminishing returns

Manual
Feature
Engineering
Select
Features
Train
Model
Evaluate
Errors
and View
Test Error

The Philosophy of Traditional Learning
• Text
• Image
• Audio
Raw
Data
• tf-idf
• SIFT
Features
Final
Model
outputs
Outcome

The Philosophy of Deep Learning
• Text
• Image
• Audio
Raw
Data
Statistical
features
derived
from data
Features
Final
Model
outputs
Outcome

What’s going on inside
of a network model
Credit: Zeiler and Fergus (2014)

Enter Embeddings Transfer Learning

What are text embeddings?
0.1
0.2
0.8
0.1
0.3
0.6
0.8
0.3
0.5

What is an Embedding?
Text Space
(e.g. English)
Manifold
(e.g. R300)
Embedding Method
(e.g. Word2Vec)
0.1
0.2
0.8
0.1
0.3
0.6
0.8
0.3
…

What is an Embedding?
Text Space
(e.g. English)
Embedding Space
(e.g. R300)
0.1
0.2
0.8
0.1
0.3
0.6
0.8
0.3
…
Embedding Method
(e.g. Word2Vec)
Linguistic Context
(e.g. Wikipedia)

Pitfalls
• Sufficient, Diverse Linguistic Context
• Clean Test/Train Splits
• The Curse of Dimensionality
• Effective Benchmarking

King
Queen
- man
+ woman
(Royalty)
How do Embeddings Work?
• Meaning is “encoded” into the
embedding space
• Individual dimensions are not
human interpretable
• Embedding method learns by
examining large corpora of
generic language
• Goal is accurate language
representation as a proxy for
downstream performance

“Word” Embeddings
Examples
• Word2vec
• GloVe
• fastText

Token Value
“great” [0.1, 0.3, …]
… …
Examples In Practice
• Word2vec
• GloVe
• fastText

Token Value
“great” [0.1, 0.3, …]
… …
Training
The quick brown fox _____ over the lazy dog
___ ___ ____ ___ jumps ___ __ ___ ___
CBOW
Skip Gram
• Word2vec
• GloVe
• fastText

Do They Really Preserve Algorithmic Value?
• Embeddings generally
outperform raw text at low data
volumes
• Leveraging large, generic text
corpora improves
generalizability
• This is 4 year old tech.
Embeddings have improved
drastically. Text has not.
Reported numbers are the average of 5 runs of randomly sampled test/train splits
each reporting the average of a 5-fold cv, within which Logistic Regression
hyperparameters are optimized. Generated using Enso
0,5
0,55
0,6
0,65
0,7
0,75
0,8
0,85
0,9
50
75
100
125
150
175
200
225
250
275
300
325
350
375
400
425
450
475
500
Accuracy
Number of Data Points
Glove Benchmark (Movie Review Sentiment
Analysis)
tf-idf
Glove

Problems with
Small Data
Add Linguistic Context (Semantics)
Add Local Context
Prevent Out of Vocabulary Issues

Text Embeddings
Examples
• Doc2vec
• Elmo
• ULMFiT

Text Embeddings
Examples
In Practice
Often built on top of pre-trained word embeddings
• Doc2vec
• Elmo
• ULMFiT

Text Embeddings
Training
The quick brown fox jumps over the lazy
0.1
0.2
0.8
0.1
0.3
0.6
0.8
0.3
…
0.1
0.2
0.8
0.1
0.3
0.6
0.8
0.3
…
0.1
0.2
0.8
0.1
0.3
0.6
0.8
0.3
…
0.1
0.2
0.8
0.1
0.3
0.6
0.8
0.3
…
0.1
0.2
0.8
0.1
0.3
0.6
0.8
0.3
…
0.1
0.2
0.8
0.1
0.3
0.6
0.8
0.3
…
0.1
0.2
0.8
0.1
0.3
0.6
0.8
0.3
…
0.1
0.2
0.8
0.1
0.3
0.6
0.8
0.3
…
Language
Supervised
dog
True
Often built on top of pre-trained word embeddings
• Doc2vec
• Elmo
• ULMFiT

Text Embeddings
CNN-Style
0.1
0.2
0.8
0.1
0.3
0.6
0.8
0.3
…
0.1
0.2
0.8
0.1
0.3
0.6
0.8
0.3
…
0.1
0.2
0.8
0.1
0.3
0.6
0.8
0.3
…
0.1
0.2
0.8
0.1
0.3
0.6
0.8
0.3
…
0.1
0.2
0.8
0.1
0.3
0.6
0.8
0.3
…
0.1
0.2
0.8
0.1
0.3
0.6
0.8
0.3
…
0.1
0.2
0.8
0.1
0.3
0.6
0.8
0.3
…
0.1
0.2
0.8
0.1
0.3
0.6
0.8
0.3
…
Prediction
https://arxiv.org/pdf/1408.5882.pdf
Example

Text Embeddings
RNN-Style
0.1
0.2
0.8
0.1
0.3
0.6
0.8
0.3
…
0.1
0.2
0.8
0.1
0.3
0.6
0.8
0.3
…
0.1
0.2
0.8
0.1
0.3
0.6
0.8
0.3
…
0.1
0.2
0.8
0.1
0.3
0.6
0.8
0.3
…
0.1
0.2
0.8
0.1
0.3
0.6
0.8
0.3
…
0.1
0.2
0.8
0.1
0.3
0.6
0.8
0.3
…
0.1
0.2
0.8
0.1
0.3
0.6
0.8
0.3
…
0.1
0.2
0.8
0.1
0.3
0.6
0.8
0.3
…
Output
Memory
0.1
0.2
0.8
0.1
0.3
0.6
0.8
0.3
…
0.1
0.2
0.8
0.1
0.3
0.6
0.8
0.3
…
0.1
0.2
0.8
0.1
0.3
0.6
0.8
0.3
…
0.1
0.2
0.8
0.1
0.3
0.6
0.8
0.3
…
0.1
0.2
0.8
0.1
0.3
0.6
0.8
0.3
…
0.1
0.2
0.8
0.1
0.3
0.6
0.8
0.3
…
0.1
0.2
0.8
0.1
0.3
0.6
0.8
0.3
…
0.1
0.2
0.8
0.1
0.3
0.6
0.8
0.3
…
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
…
σ σ σ σ σ σ σ σ
Prediction
https://arxiv.org/pdf/1802.05365.pdf
Example

Add Linguistic Context (Semantics)
Add Local Context
Prevent Out of Vocabulary Issues
Problems with
Small Data

The Power of Context
We used a bytepair encoding (BPE) vocabulary…
significantly improving upon the state of the art in 9 out of
the 12 tasks studied
- Improving Language Understanding by Generative Pre-Training*
* https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-
unsupervised/language_understanding_paper.pdf

Do They Really Preserve Algorithmic Value?
• Newer transfer learning
techniques have made deep
learning at low data volumes
tractable
• Even when operating on top of
byte-pair encodings sufficient
context is retained to achieve
sota performance
• 4x error reduction over tf-idf
Reported numbers are the average of 5 runs of randomly sampled test/train splits
each reporting the average of a 5-fold cv, within which Logistic Regression
hyperparameters are optimized. Generated using Enso
0,5
0,55
0,6
0,65
0,7
0,75
0,8
0,85
0,9
50
75
100
125
150
175
200
225
250
275
300
325
350
375
400
425
450
475
500
Accuracy
Number of Data Points
Finetune Benchmark (Movie Review Sentiment
Analysis)
tf-idf
Glove
Finetune

Treat it like any other feature vector

Thank You
SLATER VICTOROFF
slater@indico.io

Better Machine Learning with Less Data - Slater Victoroff (Indico Data)

More Related Content

Similar to Better Machine Learning with Less Data - Slater Victoroff (Indico Data) (20)

More from Shift Conference (20)

Recently uploaded (20)

Better Machine Learning with Less Data - Slater Victoroff (Indico Data)