OReilly AI Transfer Learning

O'Reilly Artificial Intelligence Conference San Francisco 2018
How to use transfer learning to bootstrap image
classification and question answering (QA)
Danielle Dean PhD, Wee Hyong Tok PhD
Principal Data Scientist Lead
Microsoft
@danielleodean | @weehyong
Inspired by “Transfer Learning: Repurposing ML Algorithms from Different Domains to Cloud Defense” , Mark Russinovich, RSA Conference 2018

Textbook ML development
Choosing the
Learning Task
Defining Data
Input
Applying Data
Transforms
Choosing the
Learner
Choosing Output
Choosing Run
Options
View Results
Debug and
Visualize Errors
Analyze Model
Predictions

Choosing the
Learning Task
Defining Data
Input
Applying Data
Transforms
Choosing the
Learner
Choosing
Output
Choosing Run
Options
View Results
Debug and
Visualize Errors
Analyze Model
Predictions
Fact | Industry grade ML solutions are highly exploratory
Choosing the
Learning Task
Defining Data
Input
Applying Data
Transforms
Choosing the
Learner
Choosing
Output
Choosing Run
Options
View Results
Debug and
Visualize Errors
Analyze Model
Predictions
Choosing the
Learning Task
Defining Data
Input
Applying Data
Transforms
Choosing the
Learner
Choosing
Output
Choosing Run
Options
View Results
Debug and
Visualize Errors
Analyze Model
Predictions
Choosing the
Learning Task
Defining Data
Input
Applying Data
Transforms
Choosing the
Learner
Choosing
Output
Choosing Run
Options
View Results
Debug and
Visualize Errors
Analyze Model
Predictions
Choosing the
Learning Task
Defining Data
Input
Applying Data
Transforms
Choosing the
Learner
Choosing
Output
Choosing Run
Options
View Results
Debug and
Visualize Errors
Analyze Model
Predictions
Choosing the
Learning Task
Defining Data
Input
Applying Data
Transforms
Choosing the
Learner
Choosing
Output
Choosing Run
Options
View Results
Debug and
Visualize Errors
Analyze Model
Predictions
Attempt 1 Attempt 2 Attempt 3
Attempt 4 Attempt n

Traditional versus Transfer learning
Learning
system
Learning
system
Learning
system
Different tasks
Traditional Machine Learning Transfer Learning
Source tasks
Learning
system
Target task
Source: "A survey on transfer learning." , Pan, Sinno Jialin, and Qiang Yang. IEEE Transactions on knowledge and data engineering

Why are we talking about transfer learning ?
Commercial
success
Time 2016
Supervised
learning
Transfer
learning
Unsupervised
learning
Reinforcement
learning
Drivers of ML success in industry
Source: “Transfer Learning - Machine Learning's Next Frontier” , Ruder, Sebastian,

Transfer Learning in Computer Vision
Can we leverage knowledge of processing images to help with new
tasks?
• What’s in the picture?
• Where is the bike located?
• Can you find a similar bike?
• How many bikes are there?

Before Deep Learning
• Researchers took a traditional machine learning approach
• Manual creation of a variety of different visual feature extractors
• Followed by traditional ML classifiers
• Features not very generalizable to other vision tasks – not easy to transfer
• Example: HoG Detectors
- Histogram of oriented
gradients (HoG) features
- Sliding window detector
- SVM Classifier
- Very fast OpenCV
implementation (<100ms)

14,197,122 images
21841 synsets
Diverse images, Lots of labels!

Transfer Learning for Computer Vision
Train a model
using data from
ImageNet Retail
Manufacturing
Deep Learning
Model for
Computer
Vision
Apply the
model to
other domains

Example – Visualizing the different layers
Source: Olah, et al., "Feature Visualization", Distill, 2017
https://distill.pub/2017/feature-visualization/
Another fun site:
https://deepart.io/nips/submissions/random/
http://cs231n.stanford.edu/

Example – Visualizing the different layers
Source: Olah, et al., "Feature Visualization", Distill, 2017
https://distill.pub/2017/feature-visualization/
Check out these sites -
https://deepart.io/nips/submissions/random/
http://cs231n.stanford.edu/

Clothing texture dataset:
• 1716 images from Bing which were manually annotated
Striped
Argyle
Dotted

Transfer Learning – How to get started?
Type How to Initialize
Featurization
Layers
Output
Layer
Initialization
How is Transfer Learning
used?
How to Train?
Standard DNN Random Random None Train featurization and output
jointly
Headless DNN Learn using
another task
Separate ML
algorithm
Use the features learned
on a related task
Use the features to train a
separate classifier
Fine Tune DNN Learn using
another task
Random Use and fine tune
features learned on a
related task
Train featurization and output
jointly with a small learning rate
Multi-Task DNN Random Random Learned features need to
solve many related tasks
Share a featurization network
across both tasks. Train all
networks jointly with a loss
function (sum of individual task
loss function)

Pre-Built CNN from General Task on Millions of Images
Output
Layer
Stripped
cat? YES
dog? NO
car? NO
Classi
fier
e.g.
SVM
dotted?
Complex
Objects &
Scenes
(people, animals,
cars, beach
scene, etc.)
Low-Level Features
(lines, edges,
color fields, etc.)
High-Level Features
(corners, contours,
simple shapes)
Object Parts
(wheels, faces,
windows, etc.)
Outputs of penultimate layer of ImageNet Trained CNN
provide excellent general purpose image features

Pre-Built CNN from General Task on Millions of Images
Output
Layer
Stripped
Using a pre-trained DNN, an accurate
model can be achieved with thousands (or
less) of labeled examples instead of millions
cat? YES
dog? NO
car? NO
dotted?
Train one or more
layers in new network

Transfer Learning Results - Texture Dataset
DNN featurization
Input Image Size: 224x224 pixels
Area Under Curve: 0.59
Classification Accuracy: 69.0%
Fine-tuning (full CNN)
Fine-tuning (full CNN)

Transfer Learning for Similarity

Full code:
https://github.com/miguelgfierro/sciblog_support/blob/master/A_Gentle_Introduction_to_Transfer_Learning/Intro_Transfer_Learning.ipynb
•Hymenoptera, 2 classes and 397 images.
•Simpsons, 20 classes (subset of total) and 19548 images.
•Dogs vs Cats, 2 classes and 25000 images.
•Caltech 256, 257 classes and 30607 images.

Full code:

Full code:
0
5000
10000
15000
20000
25000
30000
35000
0.00
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0.80
0.90
1.00
Dataset
Hymenoptera
Dataset
Hymenoptera
gray
Dataset
Simpsons
Dataset
Simpsons gray
Dataset Dogs
vs Cats
Dataset Dogs
vs Cats gray
Dataset
Caltech256
Dataset
Caltech256
gray
Val. accuracy finetuning Val. accuracy freezing # of images

Aerial Use Classification ESmart – Connected Drone Jabil – Defect Inspection
Example Applications in Computer Vision
Lung Cancer Detection
Distributed deep domain
adaptation for automated
poacher detection

https://github.com/MattKleinsmith/void-detector

Read more details: https://www.microsoft.com/en-us/research/blog/using-
transfer-learning-to-address-label-noise-for-large-scale-image-classification/
Label Noise

Traditional Method: Manual Verification

Applying Transfer Learning

Computer Vision is not a “solved problem”
The knowledge being “transferred” can be very useful but not the same as
how humans learn to see

Recap: Transfer Learning for Image Classification
Define the
Learning Task
Identify a pre-
trained model
Decide whether to
further fine-tune
or use it as a
headless DNN
Freeze top layers,
re-train the
classifier
Validate the model
Deploy the model

Audio Spectrograms
Images
Rich, high-dimensional datasets
Rich, high-dimensional datasets
Text
Spare data (depends on the encoding)I s e e a b I g c a t
Deep Learning on Different Types of Data

How do we apply
Transfer Learning to NLP?

Different Type of NLP Tasks
And many more….

Transfer Learning for Text
Define the
Learning Task
Identify a pre-
trained model
Decide whether to
further fine-tune
Freeze top layers,
re-train the
classifier
Validate the model
Deploy the model
What does the top
layer encode?
What kind of pre-
trained model?

Word Embeddings
Male - Female Verb Tense Country - Capital
Source: Tensorflow Tutorial - https://www.tensorflow.org/tutorials/representation/word2vec

Word Embeddings
2013 2014-2015 2017

Using Pre-trained Embeddings
Text Classification using 20 Newsgroup dataset
Source: https://blog.keras.io/using-pre-trained-word-embeddings-in-a-keras-model.html
embeddings_index = {}
f = open(os.path.join(GLOVE_DIR, 'glove.6B.100d.txt'))
for line in f:
values = line.split()
word = values[0]
coefs = np.asarray(values[1:], dtype='float32')
embeddings_index[word] = coefs
f.close()
Compute an index
mapping words to
known embeddings
embedding_matrix = np.zeros((len(word_index) + 1, EMBEDDING_DIM))
for word, i in word_index.items():
embedding_vector = embeddings_index.get(word)
if embedding_vector is not None:
# words not found in embedding index will be all-zeros.
embedding_matrix[i] = embedding_vector
Compute Embedding
Matrix

from keras.layers import Embedding
embedding_layer = Embedding(len(word_index) + 1,
EMBEDDING_DIM,
weights=[embedding_matrix],
input_length=MAX_SEQUENCE_LENGTH,
trainable=False)
Load the Embedding
Matrix into an
Embedding Layer
Prevent weights from being
updated during training

sequence_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32')
embedded_sequences = embedding_layer(sequence_input)
x = Conv1D(128, 5, activation='relu')(embedded_sequences)
x = MaxPooling1D(5)(x)
x = Conv1D(128, 5, activation='relu')(x)
x = MaxPooling1D(5)(x)
x = Conv1D(128, 5, activation='relu')(x)
x = MaxPooling1D(35)(x) # global max pooling
x = Flatten()(x)
x = Dense(128, activation='relu')(x)
preds = Dense(len(labels_index), activation='softmax')(x)
model = Model(sequence_input, preds)
model.compile(loss='categorical_crossentropy',
optimizer='rmsprop',
metrics=['acc'])
model.fit(x_train, y_train, validation_data=(x_val, y_val),
epochs=2, batch_size=128)
Build a small 1D
convnet to solve the
classification problem

From initializing the first layers to pre-
training the entire model
(and learning higher level semantic concepts)

Transfer Learning for NLP - ULMFiT
Source: Universal Language Model Fine-tuning for Text Classification, Jeremy Howard, Sebastian Ruder, ACL 2018
Train a Language Model
using Large General
Domain Corpus
Fine-tune the
Language Model
Fine-tune Classifier

Transfer Learning for NLP - ELMo
Source: Deep contextualized word representations, Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt
Gardner, Christopher Clark, Kenton Lee, Luke Zettlemoyer., NAACL 2018
ELMo ELMo ELMo
have a nice
Corpus
Train
biLMs
Enhancing
Inputs with ELMos
Usual
Inputs

ELMo Pre-trained Models
Source: https://allennlp.org/elmo

Using ELMo with TensorFlow Hub
Source: https://www.tensorflow.org/hub/modules/google/elmo/2
elmo = hub.Module("https://tfhub.dev/google/elmo/2",
trainable=True)
embeddings = elmo(
["the cat is on the mat", "dogs are in the fog"],
signature="default",
as_dict=True)["elmo"]
elmo = hub.Module("https://tfhub.dev/google/elmo/2", trainable=True)
tokens_input = [["the", "cat", "is", "on", "the", "mat"],
["dogs", "are", "in", "the", "fog", ""]]
tokens_length = [6, 5]
embeddings = elmo(
inputs={
"tokens": tokens_input,
"sequence_len": tokens_length
},
signature="tokens",
as_dict=True)["elmo"]
ELMo
Untokenized Sentences
Tokens
Or Dictionary
• Character-based word representation
• First LSTM Hidden State
• Second LSTM Hidden State
• elmo (weighted sum of 3 layers)
• Fixed mean-pooling of contextualized
word representation

Transfer Learning for MRC tasks
Source:
Transfer Learning for Machine Reading Comprehension - https://bit.ly/2Cmiffy

Transfer Learning for MRC
Train a model
using data from
WikiPedia
News Articles
Customer Support Data
MRC
Model Apply the
model to
other domains

SQUAD
Stanford Question Answering Dataset (SQuAD)
Reading comprehension dataset
Based on Wikipedia articles
Crowdsource questions
Answer is Text Segment, or span, from
the corresponding reading passage, or the no
answers found.
Question Answer Pairs

Transfer Learning for MRC using SynNet
Train using a large
MRC Dataset (e.g.
SQuAD)
Apply the pre-
trained model to a
new domain (e.g.
NewsQA)
Validate
the model
Deploy the model
Transfer Learning for MRC –Survey - https://bit.ly/2JAt1h0
More comparisons between different MRC Approaches

SynNet
Stage 1- Answer Synthesis module
uses a bi-directional LSTM to predict
IOB tags on the input paragraph.
Marks out semantic concept that are
likely answer
Stage 2 – Question Synthesis module
uses a uni-directional LSTM to
generate the questions
Source: ACL 2017, https://www.microsoft.com/en-us/research/publication/two-stage-synthesis-networks-transfer-learning-machine-comprehension/

SynNet – Question/Answer Generation Example

How to use transfer learning to
bootstrap image classification and
question answering (QA)
Summary
1. Transfer Learning and
Applications
2. How to use Transfer Learning for
Image Classification
3. How to use Transfer Learning for
NLP tasks

How to use transfer learning to
bootstrap image classification and
question answering (QA)
Danielle Dean PhD, Wee Hyong Tok PhD
Principal Data Scientist Lead
Microsoft
@danielleodean | @weehyong
Thank You!

OReilly AI Transfer Learning

More Related Content

What's hot (11)

Similar to OReilly AI Transfer Learning (20)

Recently uploaded (20)

OReilly AI Transfer Learning