SlideShare a Scribd company logo
Shift Conference
Transfer Learning
BETTER MACHINE LEARNING
WITH LESS DATA
May 31st, 2019
Split, Croatia
2 | Copyright © 2019 Indico2 | Copyright © 2019 Indico
• CTO of indico
• B2B Intelligent Process Automation
company based in Boston
• Working on deep learning based transfer
learning since 2013
• Guy that plays with embeddings all day
• Vegan baker
3 | Copyright © 2019 Indico
Transfer
Learning
3 | Copyright © 2019 Indico
1. What is deep learning?
2. What makes it so effective?
3. What’s the catch?
4. Opening the “black box”
5. The unreasonable effectiveness of
embeddings
6. What makes a good embedding?
4 | Copyright © 2019 Indico
“DeepMind’s Go-playing AI doesn’t need human
help to beat us anymore”
- The Verge
“New AI Development So Advanced It's Too
Dangerous To Release, Says Scientists”
- Forbes
“AI defeated a top-tier 'Dota 2' esports
team. OpenAI is also inviting everyone
everyone to play.”
- Engadget
“New AI Style Transfer Algorithm
Allows Users to Create Millions of
Artistic Combinations”
- Nvidia
Network Models?
Hebbian Learning
Maybe this is
actually the
opposite of how
things work?
Spike timing
dependent plasticity
Oh, I guess this
doesn't really work
in machine learning
Backprop
All-or-nothing
neurons all wired
together
Connectivity in the
brain is complex,
all-or-nothing isn't
an absolute rule
???
Non-linearities are
critical, step
functions don't work
that well
ReLUs,
convolution,
recurrence
1940 Today1980
“Neuroscientists have long
criticised [sic] deep learning
algorithms as incompatible with
current knowledge of
neurobiology.”
- Yoshua Bengio et al
Towards Biologically Plausible Deep
Learning (2015)
What’s the
big deal?
Source
AlexNet:
The shot
heard
round the
world
Source
Human
Accuracy
But Why?
Let’s go on an adventure…
“Traditional” Machine Learning
What you have What you need
???
Count Vectorizer
# of times word
0 shows up
# of times word
1 shows up …[ ],,
TF-IDF (Term Frequency, Inverse
Document Frequency)
𝑓",$ = 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑡𝑖𝑚𝑒𝑠 𝑡𝑒𝑟𝑚 𝑡 𝑎𝑝𝑝𝑒𝑎𝑟𝑠 𝑖𝑛 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡 𝑑
𝐷 = 𝐴𝑙𝑙 𝐷𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠
𝑣$ =
𝑓",$
∑$:;
< 𝑓",$
|𝐷|
| 𝑡 ∈ 𝑇
𝑣$ = 𝐷𝑜𝑐𝑢𝑚𝑒𝑛𝑡 𝑑@ 𝑠 𝑡𝑓𝑖𝑑𝑓 𝑣𝑒𝑐𝑡𝑜𝑟
T = 𝐹𝑢𝑙𝑙 𝑉𝑜𝑐𝑎𝑏𝑢𝑙𝑎𝑟𝑦
The Problem With Text
John Malkovitch plays tennis in Winchester. He
has been reporting soreness in his elbow. His
60th birthday is in two weeks. After he returns
from his birthday trip to Casablanca we will
recommend a steroid shot to reduce
inflammation.
Feature(s)
• Name
The Problem With Text
John Malkovitch plays tennis in Winchester. He
has been reporting soreness in his elbow. His
60th birthday is in two weeks. After he returns
from his birthday trip to Casablanca we will
recommend a steroid shot to reduce
inflammation.
Feature(s)
• Gender
• Location
• Age
Feature(s)
• Name
The Problem With Text
John Malkovitch plays tennis in Winchester. He
has been reporting soreness in his elbow. His
60th birthday is in two weeks. After he returns
from his birthday trip to Casablanca we will
recommend a steroid shot to reduce
inflammation.
Feature(s)
• Activity
• Prior Affliction/Treatment
• Travel
Feature(s)
• Name
Feature(s)
• Gender
• Location
• Age
The Problem With Text
Problem Traditional Solution Traditional Problem
Linguistic Context • Stemming
• Synonym sets
• Lexicons
• Brittle
• Labor-intensive
• Messy real-world data
Local Context • Parse trees
• N-grams
• Phrase lexicon
• Inaccurate parsing
• Limited Context
• Messy real-world data
Out of Vocabulary Issues • Lemmatization
• Expanded vocabulary
• Ignore
• Computationally expensive
• Diminishing returns
• Messy real-world data
Manual
Feature
Engineering
Select
Features
Train
Model
Evaluate
Errors
and View
Test Error
The Philosophy of Traditional Learning
• Text
• Image
• Audio
Raw
Data
• tf-idf
• SIFT
Features
Final
Model
outputs
Outcome
The Philosophy of Deep Learning
• Text
• Image
• Audio
Raw
Data
Statistical
features
derived
from data
Features
Final
Model
outputs
Outcome
What’s going on inside
of a network model
Credit: Zeiler and Fergus (2014)
Enter Embeddings Transfer Learning
What are text embeddings?
0.1
0.2
0.8
0.1
0.3
0.6
0.8
0.3
0.5
What is an Embedding?
Text Space
(e.g. English)
Manifold
(e.g. R300)
Embedding Method
(e.g. Word2Vec)
0.1
0.2
0.8
0.1
0.3
0.6
0.8
0.3
…
Words
Manifold
What is an Embedding?
Text Space
(e.g. English)
Embedding Space
(e.g. R300)
0.1
0.2
0.8
0.1
0.3
0.6
0.8
0.3
…
Embedding Method
(e.g. Word2Vec)
Linguistic Context
(e.g. Wikipedia)
Pitfalls
• Sufficient, Diverse Linguistic Context
• Clean Test/Train Splits
• The Curse of Dimensionality
• Effective Benchmarking
King
Queen
- man
+ woman
(Royalty)
How do Embeddings Work?
• Meaning is “encoded” into the
embedding space
• Individual dimensions are not
human interpretable
• Embedding method learns by
examining large corpora of
generic language
• Goal is accurate language
representation as a proxy for
downstream performance
“Word” Embeddings
Examples
• Word2vec
• GloVe
• fastText
“Word” Embeddings
Token Value
“great” [0.1, 0.3, …]
… …
Examples In Practice
• Word2vec
• GloVe
• fastText
“Word” Embeddings
Token Value
“great” [0.1, 0.3, …]
… …
Examples In Practice
Training
The quick brown fox _____ over the lazy dog
___ ___ ____ ___ jumps ___ __ ___ ___
CBOW
Skip Gram
• Word2vec
• GloVe
• fastText
Do They Really Preserve Algorithmic Value?
• Embeddings generally
outperform raw text at low data
volumes
• Leveraging large, generic text
corpora improves
generalizability
• This is 4 year old tech.
Embeddings have improved
drastically. Text has not.
Reported numbers are the average of 5 runs of randomly sampled test/train splits
each reporting the average of a 5-fold cv, within which Logistic Regression
hyperparameters are optimized. Generated using Enso
0,5
0,55
0,6
0,65
0,7
0,75
0,8
0,85
0,9
50
75
100
125
150
175
200
225
250
275
300
325
350
375
400
425
450
475
500
Accuracy
Number of Data Points
Glove Benchmark (Movie Review Sentiment
Analysis)
tf-idf
Glove
Problems with
Small Data
Add Linguistic Context (Semantics)
Add Local Context
Prevent Out of Vocabulary Issues
Text Embeddings
Examples
• Doc2vec
• Elmo
• ULMFiT
Text Embeddings
Examples
In Practice
Often built on top of pre-trained word embeddings
• Doc2vec
• Elmo
• ULMFiT
Text Embeddings
Examples In Practice
Training
The quick brown fox jumps over the lazy
0.1
0.2
0.8
0.1
0.3
0.6
0.8
0.3
…
0.1
0.2
0.8
0.1
0.3
0.6
0.8
0.3
…
0.1
0.2
0.8
0.1
0.3
0.6
0.8
0.3
…
0.1
0.2
0.8
0.1
0.3
0.6
0.8
0.3
…
0.1
0.2
0.8
0.1
0.3
0.6
0.8
0.3
…
0.1
0.2
0.8
0.1
0.3
0.6
0.8
0.3
…
0.1
0.2
0.8
0.1
0.3
0.6
0.8
0.3
…
0.1
0.2
0.8
0.1
0.3
0.6
0.8
0.3
…
Language
Supervised
dog
True
Often built on top of pre-trained word embeddings
• Doc2vec
• Elmo
• ULMFiT
Text Embeddings
CNN-Style
The quick brown fox jumps over the lazy
0.1
0.2
0.8
0.1
0.3
0.6
0.8
0.3
…
0.1
0.2
0.8
0.1
0.3
0.6
0.8
0.3
…
0.1
0.2
0.8
0.1
0.3
0.6
0.8
0.3
…
0.1
0.2
0.8
0.1
0.3
0.6
0.8
0.3
…
0.1
0.2
0.8
0.1
0.3
0.6
0.8
0.3
…
0.1
0.2
0.8
0.1
0.3
0.6
0.8
0.3
…
0.1
0.2
0.8
0.1
0.3
0.6
0.8
0.3
…
0.1
0.2
0.8
0.1
0.3
0.6
0.8
0.3
…
Prediction
https://arxiv.org/pdf/1408.5882.pdf
Example
Text Embeddings
RNN-Style
The quick brown fox jumps over the lazy
0.1
0.2
0.8
0.1
0.3
0.6
0.8
0.3
…
0.1
0.2
0.8
0.1
0.3
0.6
0.8
0.3
…
0.1
0.2
0.8
0.1
0.3
0.6
0.8
0.3
…
0.1
0.2
0.8
0.1
0.3
0.6
0.8
0.3
…
0.1
0.2
0.8
0.1
0.3
0.6
0.8
0.3
…
0.1
0.2
0.8
0.1
0.3
0.6
0.8
0.3
…
0.1
0.2
0.8
0.1
0.3
0.6
0.8
0.3
…
0.1
0.2
0.8
0.1
0.3
0.6
0.8
0.3
…
Output
Memory
0.1
0.2
0.8
0.1
0.3
0.6
0.8
0.3
…
0.1
0.2
0.8
0.1
0.3
0.6
0.8
0.3
…
0.1
0.2
0.8
0.1
0.3
0.6
0.8
0.3
…
0.1
0.2
0.8
0.1
0.3
0.6
0.8
0.3
…
0.1
0.2
0.8
0.1
0.3
0.6
0.8
0.3
…
0.1
0.2
0.8
0.1
0.3
0.6
0.8
0.3
…
0.1
0.2
0.8
0.1
0.3
0.6
0.8
0.3
…
0.1
0.2
0.8
0.1
0.3
0.6
0.8
0.3
…
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
…
σ σ σ σ σ σ σ σ
Prediction
https://arxiv.org/pdf/1802.05365.pdf
Example
Add Linguistic Context (Semantics)
Add Local Context
Prevent Out of Vocabulary Issues
Problems with
Small Data
The Power of Context
We used a bytepair encoding (BPE) vocabulary…
significantly improving upon the state of the art in 9 out of
the 12 tasks studied
- Improving Language Understanding by Generative Pre-Training*
* https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-
unsupervised/language_understanding_paper.pdf
Problems with
Small Data
Add Linguistic Context (Semantics)
Add Local Context
Prevent Out of Vocabulary Issues
Do They Really Preserve Algorithmic Value?
• Newer transfer learning
techniques have made deep
learning at low data volumes
tractable
• Even when operating on top of
byte-pair encodings sufficient
context is retained to achieve
sota performance
• 4x error reduction over tf-idf
Reported numbers are the average of 5 runs of randomly sampled test/train splits
each reporting the average of a 5-fold cv, within which Logistic Regression
hyperparameters are optimized. Generated using Enso
0,5
0,55
0,6
0,65
0,7
0,75
0,8
0,85
0,9
50
75
100
125
150
175
200
225
250
275
300
325
350
375
400
425
450
475
500
Accuracy
Number of Data Points
Finetune Benchmark (Movie Review Sentiment
Analysis)
tf-idf
Glove
Finetune
Treat it like any other feature vector
Thank You
SLATER VICTOROFF
slater@indico.io

More Related Content

PDF
Small Data for Big Problems: Practical Transfer Learning for NLP
PDF
Genetic Malware
PPTX
The Neural Search Frontier - Doug Turnbull, OpenSource Connections
PPTX
KiwiPyCon 2014 - NLP with Python tutorial
PPTX
Large Components in the Rearview Mirror
PDF
UMich CI Days: Scaling a code in the human dimension
PDF
What is Software Engineering Research Good For?
PDF
Intro to Python for Data Science
Small Data for Big Problems: Practical Transfer Learning for NLP
Genetic Malware
The Neural Search Frontier - Doug Turnbull, OpenSource Connections
KiwiPyCon 2014 - NLP with Python tutorial
Large Components in the Rearview Mirror
UMich CI Days: Scaling a code in the human dimension
What is Software Engineering Research Good For?
Intro to Python for Data Science

Similar to Better Machine Learning with Less Data - Slater Victoroff (Indico Data) (20)

PDF
Intro to Python for Data Science
PDF
Hacking school computers for fun profit and better grades short
PPT
Professional Software Development for t.ppt
PDF
ChatGPT-and-Generative-AI-Landscape Working of generative ai search
PDF
Genetic Malware
PDF
Just the basics_strata_2013
PDF
Categorizing and pos tagging with nltk python
PDF
Rental Cars and Industrialized Learning to Rank with Sean Downes
PPTX
Categorizing and pos tagging with nltk python
PPT
No specimen (software) left behind
PPTX
SP14 CS188 Lecture 1 -- Introduction.pptx
PPTX
Cloud AI GenAI Overview.pptx
PDF
Atlassian - Software For Every Team
PDF
From Data to Visualization, what happens in between?
PDF
Breaking Through The Challenges of Scalable Deep Learning for Video Analytics
PPTX
The information supernova
PDF
Great Models with Great Privacy: Optimizing ML and AI Under GDPR with Sim Sim...
PDF
Tokens, Complex Systems, and Nature
PDF
IoT and DataStream
PDF
Lecture 6: Watson and the Social Web (2014), Chris Welty
Intro to Python for Data Science
Hacking school computers for fun profit and better grades short
Professional Software Development for t.ppt
ChatGPT-and-Generative-AI-Landscape Working of generative ai search
Genetic Malware
Just the basics_strata_2013
Categorizing and pos tagging with nltk python
Rental Cars and Industrialized Learning to Rank with Sean Downes
Categorizing and pos tagging with nltk python
No specimen (software) left behind
SP14 CS188 Lecture 1 -- Introduction.pptx
Cloud AI GenAI Overview.pptx
Atlassian - Software For Every Team
From Data to Visualization, what happens in between?
Breaking Through The Challenges of Scalable Deep Learning for Video Analytics
The information supernova
Great Models with Great Privacy: Optimizing ML and AI Under GDPR with Sim Sim...
Tokens, Complex Systems, and Nature
IoT and DataStream
Lecture 6: Watson and the Social Web (2014), Chris Welty
Ad

More from Shift Conference (20)

PDF
Shift Remote: AI: How Does Face Recognition Work (ars futura)
PDF
Shift Remote: AI: Behind the scenes development in an AI company - Matija Ili...
PDF
Shift Remote: AI: Smarter AI with analytical graph databases - Victor Lee (Ti...
PDF
Shift Remote: DevOps: Devops with Azure Devops and Github - Juarez Junior (Mi...
PDF
Shift Remote: DevOps: Autodesks research into digital twins for AEC - Kean W...
PPTX
Shift Remote: DevOps: When metrics are not enough, and everyone is on-call - ...
PDF
Shift Remote: DevOps: Modern incident management with opsgenie - Kristijan L...
PDF
Shift Remote: DevOps: Gitlab ci hands-on experience - Ivan Rimac (Barrage)
PDF
Shift Remote: DevOps: DevOps Heroes - Adding Advanced Automation to your Tool...
PDF
Shift Remote: DevOps: An (Un)expected Journey - Zeljko Margeta (RBA)
PDF
Shift Remote: Game Dev - Localising Mobile Games - Marta Kunic (Nanobit)
PDF
Shift Remote: Game Dev - Challenges Introducing Open Source to the Games Indu...
PDF
Shift Remote: Game Dev - Ghost in the Machine: Authorial Voice in System Desi...
PDF
Shift Remote: Game Dev - Building Better Worlds with Game Culturalization - K...
PPTX
Shift Remote: Game Dev - Open Match: An Open Source Matchmaking Framework - J...
PDF
Shift Remote: Game Dev - Designing Inside the Box - Fernando Reyes Medina (34...
PDF
Shift Remote: Mobile - Efficiently Building Native Frameworks for Multiple Pl...
PDF
Shift Remote: Mobile - Introduction to MotionLayout on Android - Denis Fodor ...
PDF
Shift Remote: Mobile - Devops-ify your life with Github Actions - Nicola Cort...
PPTX
Shift Remote: WEB - GraphQL and React – Quick Start - Dubravko Bogovic (Infobip)
Shift Remote: AI: How Does Face Recognition Work (ars futura)
Shift Remote: AI: Behind the scenes development in an AI company - Matija Ili...
Shift Remote: AI: Smarter AI with analytical graph databases - Victor Lee (Ti...
Shift Remote: DevOps: Devops with Azure Devops and Github - Juarez Junior (Mi...
Shift Remote: DevOps: Autodesks research into digital twins for AEC - Kean W...
Shift Remote: DevOps: When metrics are not enough, and everyone is on-call - ...
Shift Remote: DevOps: Modern incident management with opsgenie - Kristijan L...
Shift Remote: DevOps: Gitlab ci hands-on experience - Ivan Rimac (Barrage)
Shift Remote: DevOps: DevOps Heroes - Adding Advanced Automation to your Tool...
Shift Remote: DevOps: An (Un)expected Journey - Zeljko Margeta (RBA)
Shift Remote: Game Dev - Localising Mobile Games - Marta Kunic (Nanobit)
Shift Remote: Game Dev - Challenges Introducing Open Source to the Games Indu...
Shift Remote: Game Dev - Ghost in the Machine: Authorial Voice in System Desi...
Shift Remote: Game Dev - Building Better Worlds with Game Culturalization - K...
Shift Remote: Game Dev - Open Match: An Open Source Matchmaking Framework - J...
Shift Remote: Game Dev - Designing Inside the Box - Fernando Reyes Medina (34...
Shift Remote: Mobile - Efficiently Building Native Frameworks for Multiple Pl...
Shift Remote: Mobile - Introduction to MotionLayout on Android - Denis Fodor ...
Shift Remote: Mobile - Devops-ify your life with Github Actions - Nicola Cort...
Shift Remote: WEB - GraphQL and React – Quick Start - Dubravko Bogovic (Infobip)
Ad

Recently uploaded (20)

PDF
Electronic commerce courselecture one. Pdf
PDF
A comparative analysis of optical character recognition models for extracting...
PDF
Assigned Numbers - 2025 - Bluetooth® Document
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PPTX
Machine Learning_overview_presentation.pptx
PPT
Teaching material agriculture food technology
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Empathic Computing: Creating Shared Understanding
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
cuic standard and advanced reporting.pdf
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
Approach and Philosophy of On baking technology
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Encapsulation_ Review paper, used for researhc scholars
PPTX
Cloud computing and distributed systems.
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
Electronic commerce courselecture one. Pdf
A comparative analysis of optical character recognition models for extracting...
Assigned Numbers - 2025 - Bluetooth® Document
20250228 LYD VKU AI Blended-Learning.pptx
The AUB Centre for AI in Media Proposal.docx
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Machine Learning_overview_presentation.pptx
Teaching material agriculture food technology
Mobile App Security Testing_ A Comprehensive Guide.pdf
Empathic Computing: Creating Shared Understanding
Reach Out and Touch Someone: Haptics and Empathic Computing
cuic standard and advanced reporting.pdf
“AI and Expert System Decision Support & Business Intelligence Systems”
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Approach and Philosophy of On baking technology
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Encapsulation_ Review paper, used for researhc scholars
Cloud computing and distributed systems.
Diabetes mellitus diagnosis method based random forest with bat algorithm

Better Machine Learning with Less Data - Slater Victoroff (Indico Data)

  • 1. Shift Conference Transfer Learning BETTER MACHINE LEARNING WITH LESS DATA May 31st, 2019 Split, Croatia
  • 2. 2 | Copyright © 2019 Indico2 | Copyright © 2019 Indico • CTO of indico • B2B Intelligent Process Automation company based in Boston • Working on deep learning based transfer learning since 2013 • Guy that plays with embeddings all day • Vegan baker
  • 3. 3 | Copyright © 2019 Indico Transfer Learning 3 | Copyright © 2019 Indico 1. What is deep learning? 2. What makes it so effective? 3. What’s the catch? 4. Opening the “black box” 5. The unreasonable effectiveness of embeddings 6. What makes a good embedding?
  • 4. 4 | Copyright © 2019 Indico “DeepMind’s Go-playing AI doesn’t need human help to beat us anymore” - The Verge “New AI Development So Advanced It's Too Dangerous To Release, Says Scientists” - Forbes “AI defeated a top-tier 'Dota 2' esports team. OpenAI is also inviting everyone everyone to play.” - Engadget “New AI Style Transfer Algorithm Allows Users to Create Millions of Artistic Combinations” - Nvidia
  • 5. Network Models? Hebbian Learning Maybe this is actually the opposite of how things work? Spike timing dependent plasticity Oh, I guess this doesn't really work in machine learning Backprop All-or-nothing neurons all wired together Connectivity in the brain is complex, all-or-nothing isn't an absolute rule ??? Non-linearities are critical, step functions don't work that well ReLUs, convolution, recurrence 1940 Today1980
  • 6. “Neuroscientists have long criticised [sic] deep learning algorithms as incompatible with current knowledge of neurobiology.” - Yoshua Bengio et al Towards Biologically Plausible Deep Learning (2015)
  • 10. Let’s go on an adventure…
  • 11. “Traditional” Machine Learning What you have What you need ???
  • 12. Count Vectorizer # of times word 0 shows up # of times word 1 shows up …[ ],,
  • 13. TF-IDF (Term Frequency, Inverse Document Frequency) 𝑓",$ = 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑡𝑖𝑚𝑒𝑠 𝑡𝑒𝑟𝑚 𝑡 𝑎𝑝𝑝𝑒𝑎𝑟𝑠 𝑖𝑛 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡 𝑑 𝐷 = 𝐴𝑙𝑙 𝐷𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠 𝑣$ = 𝑓",$ ∑$:; < 𝑓",$ |𝐷| | 𝑡 ∈ 𝑇 𝑣$ = 𝐷𝑜𝑐𝑢𝑚𝑒𝑛𝑡 𝑑@ 𝑠 𝑡𝑓𝑖𝑑𝑓 𝑣𝑒𝑐𝑡𝑜𝑟 T = 𝐹𝑢𝑙𝑙 𝑉𝑜𝑐𝑎𝑏𝑢𝑙𝑎𝑟𝑦
  • 14. The Problem With Text John Malkovitch plays tennis in Winchester. He has been reporting soreness in his elbow. His 60th birthday is in two weeks. After he returns from his birthday trip to Casablanca we will recommend a steroid shot to reduce inflammation. Feature(s) • Name
  • 15. The Problem With Text John Malkovitch plays tennis in Winchester. He has been reporting soreness in his elbow. His 60th birthday is in two weeks. After he returns from his birthday trip to Casablanca we will recommend a steroid shot to reduce inflammation. Feature(s) • Gender • Location • Age Feature(s) • Name
  • 16. The Problem With Text John Malkovitch plays tennis in Winchester. He has been reporting soreness in his elbow. His 60th birthday is in two weeks. After he returns from his birthday trip to Casablanca we will recommend a steroid shot to reduce inflammation. Feature(s) • Activity • Prior Affliction/Treatment • Travel Feature(s) • Name Feature(s) • Gender • Location • Age
  • 17. The Problem With Text Problem Traditional Solution Traditional Problem Linguistic Context • Stemming • Synonym sets • Lexicons • Brittle • Labor-intensive • Messy real-world data Local Context • Parse trees • N-grams • Phrase lexicon • Inaccurate parsing • Limited Context • Messy real-world data Out of Vocabulary Issues • Lemmatization • Expanded vocabulary • Ignore • Computationally expensive • Diminishing returns • Messy real-world data
  • 19. The Philosophy of Traditional Learning • Text • Image • Audio Raw Data • tf-idf • SIFT Features Final Model outputs Outcome
  • 20. The Philosophy of Deep Learning • Text • Image • Audio Raw Data Statistical features derived from data Features Final Model outputs Outcome
  • 21. What’s going on inside of a network model Credit: Zeiler and Fergus (2014)
  • 23. What are text embeddings? 0.1 0.2 0.8 0.1 0.3 0.6 0.8 0.3 0.5
  • 24. What is an Embedding? Text Space (e.g. English) Manifold (e.g. R300) Embedding Method (e.g. Word2Vec) 0.1 0.2 0.8 0.1 0.3 0.6 0.8 0.3 …
  • 26. What is an Embedding? Text Space (e.g. English) Embedding Space (e.g. R300) 0.1 0.2 0.8 0.1 0.3 0.6 0.8 0.3 … Embedding Method (e.g. Word2Vec) Linguistic Context (e.g. Wikipedia)
  • 27. Pitfalls • Sufficient, Diverse Linguistic Context • Clean Test/Train Splits • The Curse of Dimensionality • Effective Benchmarking
  • 28. King Queen - man + woman (Royalty) How do Embeddings Work? • Meaning is “encoded” into the embedding space • Individual dimensions are not human interpretable • Embedding method learns by examining large corpora of generic language • Goal is accurate language representation as a proxy for downstream performance
  • 30. “Word” Embeddings Token Value “great” [0.1, 0.3, …] … … Examples In Practice • Word2vec • GloVe • fastText
  • 31. “Word” Embeddings Token Value “great” [0.1, 0.3, …] … … Examples In Practice Training The quick brown fox _____ over the lazy dog ___ ___ ____ ___ jumps ___ __ ___ ___ CBOW Skip Gram • Word2vec • GloVe • fastText
  • 32. Do They Really Preserve Algorithmic Value? • Embeddings generally outperform raw text at low data volumes • Leveraging large, generic text corpora improves generalizability • This is 4 year old tech. Embeddings have improved drastically. Text has not. Reported numbers are the average of 5 runs of randomly sampled test/train splits each reporting the average of a 5-fold cv, within which Logistic Regression hyperparameters are optimized. Generated using Enso 0,5 0,55 0,6 0,65 0,7 0,75 0,8 0,85 0,9 50 75 100 125 150 175 200 225 250 275 300 325 350 375 400 425 450 475 500 Accuracy Number of Data Points Glove Benchmark (Movie Review Sentiment Analysis) tf-idf Glove
  • 33. Problems with Small Data Add Linguistic Context (Semantics) Add Local Context Prevent Out of Vocabulary Issues
  • 35. Text Embeddings Examples In Practice Often built on top of pre-trained word embeddings • Doc2vec • Elmo • ULMFiT
  • 36. Text Embeddings Examples In Practice Training The quick brown fox jumps over the lazy 0.1 0.2 0.8 0.1 0.3 0.6 0.8 0.3 … 0.1 0.2 0.8 0.1 0.3 0.6 0.8 0.3 … 0.1 0.2 0.8 0.1 0.3 0.6 0.8 0.3 … 0.1 0.2 0.8 0.1 0.3 0.6 0.8 0.3 … 0.1 0.2 0.8 0.1 0.3 0.6 0.8 0.3 … 0.1 0.2 0.8 0.1 0.3 0.6 0.8 0.3 … 0.1 0.2 0.8 0.1 0.3 0.6 0.8 0.3 … 0.1 0.2 0.8 0.1 0.3 0.6 0.8 0.3 … Language Supervised dog True Often built on top of pre-trained word embeddings • Doc2vec • Elmo • ULMFiT
  • 37. Text Embeddings CNN-Style The quick brown fox jumps over the lazy 0.1 0.2 0.8 0.1 0.3 0.6 0.8 0.3 … 0.1 0.2 0.8 0.1 0.3 0.6 0.8 0.3 … 0.1 0.2 0.8 0.1 0.3 0.6 0.8 0.3 … 0.1 0.2 0.8 0.1 0.3 0.6 0.8 0.3 … 0.1 0.2 0.8 0.1 0.3 0.6 0.8 0.3 … 0.1 0.2 0.8 0.1 0.3 0.6 0.8 0.3 … 0.1 0.2 0.8 0.1 0.3 0.6 0.8 0.3 … 0.1 0.2 0.8 0.1 0.3 0.6 0.8 0.3 … Prediction https://arxiv.org/pdf/1408.5882.pdf Example
  • 38. Text Embeddings RNN-Style The quick brown fox jumps over the lazy 0.1 0.2 0.8 0.1 0.3 0.6 0.8 0.3 … 0.1 0.2 0.8 0.1 0.3 0.6 0.8 0.3 … 0.1 0.2 0.8 0.1 0.3 0.6 0.8 0.3 … 0.1 0.2 0.8 0.1 0.3 0.6 0.8 0.3 … 0.1 0.2 0.8 0.1 0.3 0.6 0.8 0.3 … 0.1 0.2 0.8 0.1 0.3 0.6 0.8 0.3 … 0.1 0.2 0.8 0.1 0.3 0.6 0.8 0.3 … 0.1 0.2 0.8 0.1 0.3 0.6 0.8 0.3 … Output Memory 0.1 0.2 0.8 0.1 0.3 0.6 0.8 0.3 … 0.1 0.2 0.8 0.1 0.3 0.6 0.8 0.3 … 0.1 0.2 0.8 0.1 0.3 0.6 0.8 0.3 … 0.1 0.2 0.8 0.1 0.3 0.6 0.8 0.3 … 0.1 0.2 0.8 0.1 0.3 0.6 0.8 0.3 … 0.1 0.2 0.8 0.1 0.3 0.6 0.8 0.3 … 0.1 0.2 0.8 0.1 0.3 0.6 0.8 0.3 … 0.1 0.2 0.8 0.1 0.3 0.6 0.8 0.3 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 … σ σ σ σ σ σ σ σ Prediction https://arxiv.org/pdf/1802.05365.pdf Example
  • 39. Add Linguistic Context (Semantics) Add Local Context Prevent Out of Vocabulary Issues Problems with Small Data
  • 40. The Power of Context We used a bytepair encoding (BPE) vocabulary… significantly improving upon the state of the art in 9 out of the 12 tasks studied - Improving Language Understanding by Generative Pre-Training* * https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language- unsupervised/language_understanding_paper.pdf
  • 41. Problems with Small Data Add Linguistic Context (Semantics) Add Local Context Prevent Out of Vocabulary Issues
  • 42. Do They Really Preserve Algorithmic Value? • Newer transfer learning techniques have made deep learning at low data volumes tractable • Even when operating on top of byte-pair encodings sufficient context is retained to achieve sota performance • 4x error reduction over tf-idf Reported numbers are the average of 5 runs of randomly sampled test/train splits each reporting the average of a 5-fold cv, within which Logistic Regression hyperparameters are optimized. Generated using Enso 0,5 0,55 0,6 0,65 0,7 0,75 0,8 0,85 0,9 50 75 100 125 150 175 200 225 250 275 300 325 350 375 400 425 450 475 500 Accuracy Number of Data Points Finetune Benchmark (Movie Review Sentiment Analysis) tf-idf Glove Finetune
  • 43. Treat it like any other feature vector