Multi modal retrieval and generation with deep distributed models

@graphiﬁc
Roelof Pieters
Mul--modal Retrieval and
Genera-on with Deep
Distributed Models
26 April 2016  
KTH
www.csc.kth.se/~roelof/
roelof@kth.se

Creative AI > a “brush” > rapid experimentation
human-machine collaboration

[Karlgren 2014, NLP Sthlm Meetup]5
Digital Media Deluge: text

[ http://lexicon.gavagai.se/lookup/en/lol ]6
Digital Media Deluge: text
lol ?
…

[Youtube Blog, 2010]7
Digital Media Deluge: video

[Reelseo, 2015]8
Digital Media Deluge: video

[Reelseo, 2015]9
Digital Media Deluge: audio

[Reelseo, 2015]10
Digital Media Deluge: audio

Challenges
11
• Volume
• Velocity
• Variety

Can we make it searchable?
12
Language

Language: Compositionality
Principle of compositionality:
the “meaning (vector) of a
complex expression (sentence)
is determined by:
— Gottlob Frege  
(1848 - 1925)
- the meanings of its constituent
expressions (words) and
- the rules (grammar) used to
combine them”
13

• NLP treats words mainly (rule-based/statistical
approaches at least) as atomic symbols: 
• or in vector space: 
• also known as “one hot” representation.
• Its problem ?
Word Representation
Love Candy Store
[0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 …]
Candy [0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 …] AND
Store [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 …] = 0 !
14

Distributional semantics
Distributional meaning as co-occurrence vector:
16

Deep Distributional representations
• Taking it further:
• Continuous word embeddings
• Combine vector space semantics with the
prediction of probabilistic models
• Words are represented as a dense vector:
Candy =
17

• Can theoretically (given enough units) approximate
“any” function
• and fit to “any” kind of data
• Efficient for NLP: hidden layers can be used as word
lookup tables
• Dense distributed word vectors + efficient NN
training algorithms:
• Can scale to billions of words !
Neural Networks for NLP
18

Multi modal retrieval and generation with deep distributed models

Word Embeddings: SocherVector Space Model
adapted rom Bengio, “Representation Learning and Deep Learning”, July, 2012, UCLA
In a perfect world:
20

adapted rom Bengio, “Representation Learning and Deep Learning”, July, 2012, UCLA
In a perfect world:
the country of my birth
the place where I was born
21

Figure (edited) from Bengio, “Representation Learning and Deep Learning”, July, 2012, UCLA
In a perfect world:
the country of my birth
the place where I was born ?
…
22

Word Embeddings: Turian (2010)
Turian, J., Ratinov, L., Bengio, Y. (2010). Word representations: A simple and general method for semi-supervised learning
code & info: http://metaoptimize.com/projects/wordreprs/23

Word Embeddings: Turian (2010)
Turian, J., Ratinov, L., Bengio, Y. (2010). Word representations: A simple and general method for semi-supervised learning
code & info: http://metaoptimize.com/projects/wordreprs/
24

Word Embeddings: Collobert & Weston (2011)
Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., Kuksa, P. (2011) .
Natural Language Processing (almost) from Scratch
25

Multi-embeddings: Stanford (2012)
Eric H. Huang, Richard Socher, Christopher D. Manning, Andrew Y. Ng (2012) 
Improving Word Representations via Global Context and Multiple Word Prototypes
26

Linguistic Regularities: Mikolov (2013)
code & info: https://code.google.com/p/word2vec/
Mikolov, T., Yih, W., & Zweig, G. (2013). Linguistic Regularities in Continuous Space Word Representations
27

Word Embeddings for MT: Mikolov (2013)
Mikolov, T., Le, V. L., Sutskever, I. (2013) .  
Exploiting Similarities among Languages for Machine Translation
28

Word Embeddings for MT: Kiros (2014)
29

Recursive Embeddings for Sentiment: Socher (2013)
Socher, R., Perelygin, A., Wu, J., Chuang, J.,Manning, C., Ng, A., Potts, C. (2013)  
Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank.
code & demo: http://nlp.stanford.edu/sentiment/index.html
30

Paragraph Vectors: Dai et al. (2014)
31

Paragraph Vectors: Dai et al. (2014)
32

Can we make it searchable?
33
Other modalities

• Image -> vector -> embedding ? ?
• Video -> vector -> embedding ? ?
• Audio -> vector -> embedding ? ?
34
Other modalities: Embeddings?

•A host of statistical machine learning
techniques
•Enables the automatic learning of feature
hierarchies
•Generally based on artiﬁcial neural networks
Deep Learning?

• Manually designed features are often over-specified,
incomplete and take a long time to design and validate
• Learned Features are easy to adapt, fast to learn 
• Deep learning provides a very flexible, (almost?) universal,
learnable framework for representing world, visual and
linguistic information.
• Deep learning can learn unsupervised (from raw text/
audio/images/whatever content) and supervised (with
specific labels like positive/negative)
(as summarised by Richard Socher 2014)
Deep Learning?

37
2006+ : The Deep Learning Conspirators

• Image -> vector -> embedding
• Video -> vector -> embedding ? ?
39
Image Embeddings

40
Convolutional Neural Nets for Images
classiﬁcation demo

41
http://ml4a.github.io/dev/demos/demo_convolution.html

42
Zeiler and Fergus 2013,  
Visualizing and Understanding Convolutional Networks

43

44

47
Convolutional Neural Nets: Embeddings?
[-0.34, 0.28, …]
4096-dimensional fc7 AlexNet CNN

49
Convolutional Neural Nets: Embeddings?
http://ml4a.github.io/dev/demos/tsne-viewer.html

• Image -> vector -> embedding ??
• Video -> vector -> embedding
50
Video Embeddings

51
Convolutional Neural Nets for Video
3D Convolutional Neural Networks for Human Action Recognition, Ji et al., 2010

52
Sequential Deep Learning for Human Action Recognition, Baccouche et al., 2011

53
Large-scale Video Classiﬁcation with Convolutional Neural Networks, Karpathy et al., 2014

54
Large-scale Video Classiﬁcation with Convolutional Neural Networks, Karpathy et al., 2014

55
[Large-scale Video Classiﬁcation with
Convolutional Neural Networks, Karpathy et
al., 2014
[Le et al. '11]
vs classic 2d convnet:

56
[Large-scale Video Classiﬁcation with Convolutional Neural Networks, Karpathy et al., 2014

57
Sequential Deep Learning for Human Action Recognition, Baccouche et al., 2011

58
Long-term Recurrent Convolutional Networks for Visual Recognition and Description, Donahue et al., 2015

59
Beyond Short Snippets: Deep Networks for Video Classiﬁcation, Ng et al., 2015]

60
Delving Deeper into Convolutional Networks for Learning Video Representations, Ballas et al., 2016

• Image -> vector -> embedding ??
• Video -> vector -> embedding ??
• Audio -> vector -> embedding
61
Audio Embeddings

62
Zero-shot Learning
[Sander Dieleman, 2014]

63
Audio Embeddings
[Sander Dieleman, 2014]

• Can we take this further?
65
Multi Modal Embeddings?

• unsupervised pre-training (on many images)
• in parallel train a neural network (Language) Model
• train linear mapping between (image) representations
and (word) embeddings, representing the diﬀerent
“classes”
66
Zero-shot Learning

DeViSE model (Frome et al. 2013)
• skip-gram text model on wikipedia corpus of 5.7 million
documents (5.4 billion words) - approach from (Mikolov
et al. ICLR 2013)
67
Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Mikolov, T., Ranzato, M.A. (2013)  
Devise: A deep visual-semantic embedding model

Encoder: A deep convolutional network (CNN) and long short-
term memory recurrent network (LSTM) for learning a joint
image-sentence embedding.
Decoder: A new neural language model that combines structure
and content vectors for generating words one at a time in
sequence.
Encoder-Decoder pipeline (Kiros et al 2014)
68
Kiros, R., Salakhutdinov, R., Zemerl, R. S. (2014)  
Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models

Kiros, R., Salakhutdinov, R., Zemerl, R. S. (2014)  
Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models
• matches state-of-the-art performance on Flickr8K and
Flickr30K without using object detections
• new best results when using the 19-layer Oxford
convolutional network.
• linear encoders: learned embedding space captures
multimodal regularities (e.g. *image of a blue car* - "blue"
+ "red" is near images of red cars)
Encoder-Decoder pipeline (Kiros et al 2014)
69

Image-Text Embeddings
70
Socher et al (2013) Zero Shot Learning Through Cross-Modal Transfer (info)

Image-Captioning
• Andrej Karpathy Li Fei-Fei , 2015.  
Deep Visual-Semantic Alignments for Generating Image Descriptions (pdf) (info) (code)
• Oriol Vinyals, Alexander Toshev, Samy Bengio, Dumitru Erhan , 2015. Show and Tell: A
Neural Image Caption Generator (arxiv)
• Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan
Salakhutdinov, Richard Zemel, Yoshua Bengio, Show, Attend and Tell: Neural Image
Caption Generation with Visual Attention (arxiv) (info) (code)

“A person riding a motorcycle on a dirt road.”???
Image-Captioning

“Two hockey players are ﬁghting over the puck.”???
Image-Captioning

• Let’s turn it around!
• Generative Models
• (we wont cover, but common architectures):
• Auto encoders (AE), variational variants: VAE
• Generative Adversarial Nets (GAN)
• Variational Recurrent Neural Net (VRNN)
74
Generative Models

Wanna Play ?
Text generation (RNN)
75
Karpathy (2015), The Unreasonable Eﬀectiveness of Recurrent Neural
Networks (blog)

Wanna Play ?
Text generation
76
Networks (blog)

Networks (blog)

“A stop sign is ﬂying in blue skies.”
“A herd of elephants ﬂying in the blue skies.”
Elman Mansimov, Emilio Parisotto, Jimmy Lei Ba, Ruslan
Salakhutdinov, 2015. Generating Images from Captions
with Attention (arxiv) (examples)
Caption -> Image generation

Turn Convnet Around: “Deep Dream”
Image -> NN -> What do you (think) you see  
-> Whats the (text) label
Image -> NN -> What do you (think) you see ->  
feed back activations ->  
optimize image to “ﬁt” to the ConvNets
“hallucination” (iteratively)

see also: www.csc.kth.se/~roelof/deepdream/  

see also: www.csc.kth.se/~roelof/deepdream/ codeyoutubeRoelof Pieters 2015

https://www.ﬂickr.com/photos/graphiﬁc/albums/72157657250972188
Single Units

Inter-modal: “Style Net”
Leon A. Gatys, Alexander S. Ecker, Matthias Bethge , 2015.  
A Neural Algorithm of Artistic Style (GitXiv)

https://github.com/alexjc/neural-doodle
Neural Doodle

Gene Kogan, 2015. Why is a Raven Like a Writing Desk? (vimeo)

• Image Analogies, 2001, A. Hertzmann, C. Jacobs, N. Oliver, B. Curless, D. Sales
• A Neural Algorithm of Artistic Style, 2015. Leon A. Gatys, Alexander S. Ecker,
Matthias Bethge
• Combining Markov Random Fields and Convolutional Neural Networks for Image
Synthesis, 2016, Chuan Li, Michael Wand
• Semantic Style Transfer and Turning Two-Bit Doodles into Fine Artworks, 2016, Alex J.
Champandard
• Texture Networks: Feed-forward Synthesis of Textures and Stylized Images, 2016,
Dmitry Ulyanov, Vadim Lebedev, Andrea Vedaldi, Victor Lempitsky
• Perceptual Losses for Real-Time Style Transfer and Super-Resolution, 2016, Justin
Johnson, Alexandre Alahi, Li Fei-Fei
• Precomputed Real-Time Texture Synthesis with Markovian Generative Adversarial
Networks, 2016, Chuan Li, Michael Wand
• @DeepForger
93
“Style Transfer” papers

• https://soundcloud.com/graphific/neural-music-walk
• https://soundcloud.com/graphific/pyotr-lstm-
tchaikovsky
• https://soundcloud.com/graphific/neural-remix-net
94
Audio Generation
A Recurrent Latent Variable Model for Sequential Data, 2016,  
J. Chung, K. Kastner, L. Dinh, K. Goel, A. Courville, Y. Bengio

python has a wide range of deep
learning-related libraries available
Deep Learning with Python
Low level
High level
deeplearning.net/software/theano
caffe.berkeleyvision.org
tensorﬂow.org/
lasagne.readthedocs.org/en/latest
and of course:
keras.io

Questions?
love letters? existential dilemma’s? academic questions? gifts?  
ﬁnd me at: 
roelof@kth.se
Code & Papers?
Collaborative Open Computer Science
.com
@graphiﬁc

Questions?
love letters? existential dilemma’s? academic questions? gifts?  
ﬁnd me at: 
roelof@kth.se
Generative “creative” AI “stuff”?
.net
@graphiﬁc

(YouTube, Paper)

(Vimeo, Paper)

105
Generative Adverserial Nets
Emily Denton, Soumith Chintala, Arthur Szlam, Rob Fergus, 2015.  
Deep Generative Image Models using a Laplacian Pyramid of Adversarial Networks (GitXiv)

106
Alec Radford, Luke Metz, Soumith Chintala , 2015.  
Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks (GitXiv)

107

108
”turn” vector created from four averaged samples of faces looking left
vs looking right.

walking through the manifold

top: unmodiﬁed samples
bottom: same samples dropping out ”window” ﬁlters

Multi modal retrieval and generation with deep distributed models

More Related Content

What's hot (20)

Viewers also liked (18)

Similar to Multi modal retrieval and generation with deep distributed models (20)

Recently uploaded (20)

Multi modal retrieval and generation with deep distributed models