Deep Learning for Semantic Composition

Deep Learning for Semantic Composition
Xiaodan Zhu∗ & Edward Grefenstette†
∗National Research Council Canada
Queen’s University
zhu2048@gmail.com
†DeepMind
etg@google.com
July 30th, 2017
Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th
, 2017 1 / 119

Outline
1 Introduction
Semantic composition
Formal methods
Simple parametric models
2 Parameterizing Composition Functions
Recurrent composition models
Recursive composition models
Convolutional composition models
Unsupervised models
3 Selected Topics
Compositionality and non-compositionality
Subword composition methods
4 Summary
, 2017 2 / 119

Outline
1 Introduction
Formal methods
Unsupervised models
3 Selected Topics
4 Summary
, 2017 3 / 119

Principle of Compositionality
Principle of compositionality: The meaning of a whole is a function of
the meaning of the parts.
, 2017 4 / 119

While we focus on natural language, compositionality exists not just
in language.
, 2017 4 / 119

in language.
Sound/music
Music notes are composed with some regularity but not randomly
arranged to form a song.
, 2017 4 / 119

in language.
Sound/music
Music notes are composed with some regularity but not randomly
arranged to form a song.
Vision
Natural scenes are composed of meaningful components.
Artiﬁcial visual art pieces often convey certain meaning with regularity
from their parts.
, 2017 4 / 119

Compositionality is regarded by many as a fundamental component of
intelligence in addition to language understanding (Miller et al., 1976;
Fodor et al., 1988; Bienenstock et al., 1996; Lake et al., 2016).
, 2017 5 / 119

For example, Lake et al. (2016) emphasize several essential
ingredients for building machines that “learn and think like people”:
Compositionality
Intuitive physics/psychology
Learning-to-learn
Causality models
, 2017 5 / 119

Compositionality
Learning-to-learn
Causality models
Note that many of these challenges present in natural language
understanding.
, 2017 5 / 119

Compositionality
Learning-to-learn
Causality models
understanding.
They are reﬂected in the sparseness in training a NLP model.
, 2017 5 / 119

Compositionality
Learning-to-learn
Causality models
understanding.
They are reﬂected in the sparseness in training a NLP model.
Note also that compositionality may be entangled with the other
“ingredients” listed above.
, 2017 5 / 119

Semantic Composition in Natural Language
good → very good → not very good → ...
, 2017 6 / 119

Figure: Results from (Zhu et al., 2014).
A dot in the ﬁgure corresponds to a negated
phrase (e.g., not very good) in Stanford
Sentiment Treebank (Socher et al., 2013).
The y-axis is its sentiment value and x-axis
the sentiment of its argument.
, 2017 7 / 119

Figure: Results from (Zhu et al., 2014).
A dot in the ﬁgure corresponds to a negated
phrase (e.g., not very good) in Stanford
Sentiment Treebank (Socher et al., 2013).
The y-axis is its sentiment value and x-axis
the sentiment of its argument.
Even a one-layer composition, over
one dimension of meaning (e.g.,
semantic orientation (Osgood
et al., 1957)), could be a
complicated mapping.
, 2017 7 / 119

good → very good → not very good → ...
senator → former senator → ...
basketball player → short basketball player → ...
giant → small giant → ...
empty/full → half empty/full → almost half empty/full → ...1
1
See more examples in (Partee, 1995).
, 2017 8 / 119

Semantic composition in natural language: the task of modelling the
meaning of a larger piece of text by composing the meaning of its
constituents.
, 2017 9 / 119

constituents.
modelling: learning a representation
The compositionality in language is very challenging as discussed
above.
Compositionality can entangle with other challenges such as those
emphasized in (Lake et al., 2016).
, 2017 9 / 119

constituents.
above.
a larger piece of text: a phrase, sentence, or document.
, 2017 9 / 119

constituents.
above.
a larger piece of text: a phrase, sentence, or document.
constituents: subword components, words, phrases.
, 2017 9 / 119

Introduction
Two key problems:
How to represent meaning?
How to learn such a representation?
, 2017 10 / 119

Representation
Let’s ﬁrst very brieﬂy revisit the representation we assume in this tutorial
... and leave the learning problem to the entire tutorial that follows.
, 2017 11 / 119

Representation
Let’s ﬁrst very brieﬂy revisit the representation we assume in this tutorial
... and leave the learning problem to the entire tutorial that follows.
Love
, 2017 11 / 119

Representation
, 2017 12 / 119

Representation
love, admiration, satisfaction ...
anger, fear, hunger ...
, 2017 12 / 119

Representation
A viewpoint from The Emotion Machine (Minsky, 2006)
, 2017 13 / 119

Representation
Each variable responds to diﬀerent concepts and each concept is
represented by diﬀerent variables.
, 2017 13 / 119

Representation
Each variable responds to diﬀerent concepts and each concept is
represented by diﬀerent variables.
This is exactly a distributed representation.
, 2017 13 / 119

Representation
, 2017 14 / 119

Modelling Composition Functions
How do we model the composition functions?
, 2017 15 / 119

Representation
Deep learning: We focus on deep learning models in this tutorial.
, 2017 16 / 119

Representation
“Wait a minute, deep learning again?”
“DL people, leave language along ...”
, 2017 16 / 119

Representation
“Wait a minute, deep learning again?”
“DL people, leave language along ...”
Asking some questions may be helpful:
Are deep learning models providing nice function or density
approximation, the problems that many speciﬁc NLP tasks essentially
seek to solve? X→Y
Are continuous vector representations of meaning eﬀective for (as
least some) NLP tasks? Are DL models convenient for computing
such continuous representations?
Do DL models naturally bridge language with other modalities in
terms of both representation and learning? (this could be important.)
, 2017 16 / 119

Introduction
More questions:
What NLP problems (e.g., semantic problems here) can be better
handled with DL and what cannot?
Can NLP beneﬁt from combining DL and other approaches (e.g.,
symbolic approaches)?
In general, has the eﬀectiveness of DL models for semantics already
been well understood?
, 2017 17 / 119

Introduction
, 2017 18 / 119

Outline
1 Introduction
Formal methods
Unsupervised models
3 Selected Topics
4 Summary
, 2017 19 / 119

Formal Semantics
Montague Semantics (1970–1973):
Treat natural language like a formal language via
an interpretation function [[. . .]], and
a mapping from CFG rules to function application order.
Interpretation of a sentence reduces to logical form via β-reduction.
High Level Idea
Syntax guides composition, types determine their semantics, predicate
logic does the rest.
, 2017 20 / 119

Formal Semantics
Syntactic Analysis Semantic Interpretation
S ⇒ NP VP [[VP]]([[NP]])
NP ⇒ cats, milk, etc. [[cats]], [[milk]], . . .
VP ⇒ Vt NP [[Vt]]([[NP]])
Vt ⇒ like, hug, etc. λyx.[[like]](x, y), . . .
[[like]]([[cats]], [[milk]])
[[cats]] λx.[[like]](x, [[milk]])
λyx.[[like]](x, y) [[milk]]
Cats like milk.
, 2017 21 / 119

Formal Semantics
Pros:
Intuitive and interpretable(?) representations.
Leverage the power of predicate logic to model semantics.
Evaluate the truth of statements, derive conclusions, etc.
Cons:
Brittle, requires robust parsers.
Extensive logical model required for evaluation of clauses.
Extensive set of rules required to do anything useful.
Overall, an intractable (or unappealing) learning problem.
, 2017 22 / 119

Outline
1 Introduction
Formal methods
Unsupervised models
3 Selected Topics
4 Summary
, 2017 23 / 119

Simple Parametric Models
Basic models with pre-deﬁned function form (Mitchell et al., 2008):
General form : p = f (u, v, R, K)
Add : p = u + v
WeightAdd : p = αT
u + βT
v
Multiplicative : p = u ⊗ v
Combined : p = αT
u + βT
v + γT
(u ⊗ v)
We will see later in this tutorial that the above models could be seen as
special cases of more complicated composition models.
, 2017 24 / 119

Results
Reference (R): The color ran.
High-similarity landmark (H): The color dissolved.
Low-similarity landmark (L): The color galloped.
A good composition model should give the above R-H pair a similarity score
higher than that given to the R-L pair. Also, a good model should assign such
similarity scores with a high correlation (ρ) to what human assigned.
, 2017 25 / 119

Results
Reference (R): The color ran.
High-similarity landmark (H): The color dissolved.
Low-similarity landmark (L): The color galloped.
A good composition model should give the above R-H pair a similarity score
higher than that given to the R-L pair. Also, a good model should assign such
similarity scores with a high correlation (ρ) to what human assigned.
Models R-H similarity R-L similarity ρ
NonComp 0.27 0.26 0.08**
Add 0.59 0.59 0.04*
WeightAdd 0.35 0.34 0.09**
Kintsch 0.47 0.45 0.09**
Multiply 0.42 0.28 0.17**
Combined 0.38 0.28 0.19**
UpperBound 4.94 3.25 0.40**
Table: Mean cosine similarities for the R-H pairs and R-L pairs as well as the
correlation coeﬃcients (ρ) with human judgments (*: p < 0.05, **: p < 0.01).
, 2017 25 / 119

Outline
1 Introduction
Formal methods
Unsupervised models
3 Selected Topics
4 Summary
, 2017 26 / 119

Parameterizing Composition Functions
To move beyond simple algebraic or parametric models we need function
approximators which, ideally:
Can approximate any arbitrary function (e.g. ANNs).
Can cope with variable size sequences.
Can capture long range or unbounded dependencies.
Can implicitly or explicitly model structure.
Can be trained against a supervised or unsupervised objective (or
both — semi-supervised training).
Can be trained chieﬂy or primarily through backpropagation.
A Neural Network Model Zoo
This section presents a selection of models satisfying some (if not all) of
these criteria.
, 2017 27 / 119

Outline
1 Introduction
Formal methods
Unsupervised models
3 Selected Topics
4 Summary
, 2017 28 / 119

Recurrent Neural Networks
Bounded Methods
Many methods impose explicit or implicit length limits on conditioning
information. For example:
order-n Markov assumption in NLM/LBL
fully-connected layers and dynamic pooling in conv-nets
wj f(w1:j)
hj-1 hj
Recurrent Neural Networks introduce a repeatedly
composable unit, the recurrent cell, which both
models an unbounded sequence preﬁx and express a
function over it.
, 2017 29 / 119

The Mathematics of Recurrence
wj f(w1:j)
hj-1 hj
previous
state
next
state
inputs outputs
Building Blocks
An input vector wj ∈ R|w|
A previous state hj−1 ∈ R|h|
A next state hj ∈ R|h|
An output yj ∈ R|y|
fy : R|w| × R|h| → R|y|
fh : R|w| × R|h| → R|h|
Putting it together
hj = fh(wj , hj−1)
yj = fy (wj , hj )
So yj = fy (wj , fh(wj−1, hj−1)) = fy (wj , fh(wj−1, fh(wj−2, hj−2))) = . . .
, 2017 30 / 119

RNNs for Language Modelling
Language modelling
We want to model the joint probability of tokens t1, . . . tn in a sequence:
P(t1, . . . tn) = P(t1)
n
i=2
P(ti |t1, . . . ti−1)
Adapting a recurrence for basic LM
For vocab V, deﬁne an embedding matrix E ∈ R|V |×|w| and a logit
projection matrix WV ∈ R|y|×|V |. Then:
wj = embed(tj , E)
yj = fy (wj , hj ) hj = fh(wj , hj−1)
pj = softmax(yj WV )
P(tj+1|t1, . . . , tj ) = Categorical(tj+1; pj )
, 2017 31 / 119

Aside: The Vanishing Gradient Problem and LSTM RNNs
RNN is deep “by time”, so it could
seriously suffer from the vanishing
gradient issue.
LSTM configures memory cells and
multiple “gates” to control
information flow. If properly learned,
LSTM can keep pretty long-distance
(hundreds of time steps) information
in memory.
Memory-cell details:
it = σ(Wxi xt + Whi ht−1 + Wci ct−1)
ft = σ(Wxf xt + Whf ht−1 + Wcf ct−1)
ct = σ(ftct−1 + ittanh(Wxc xt + Whc ht−1))
ot = σ(Wxoxt + Whoht−1 + Wcoct)
ht = σ(ottanh(ct))
, 2017 32 / 119

Conditional Language Models
Conditional Language Modelling
A strength of RNNs is that hj can model not only the history of the
generated/observed sequence t1, . . . , tj , but any conditioning information
β, e.g. by setting h0 = β.
w1 w2 w3 w1 w2 w3
, 2017 33 / 119

Encoder-Decoder Models with RNNs
Les chiens aiment les os ||| Dogs love bones
Dogs love bones </s>
Source sequence Target sequence
cf. Kalchbrenner et al., 2013; Sutskever et al., 2014
Model p(t1, . . . , tn|s1, . . . , sm)
he
i = RNNencoder (si , he
i−1)
hd
i = RNNdecoder (ti , hd
i−1)
hd
0 = he
m
ti+1 ∼ Categorical(t; fV (hi ))
The encoder RNN as a composition module
All information needed to transduce the source into the target sequence
using RNNdecoder needs to be present in the start state hd
0 .
This start state is produced by RNNencoder , which will learn to compose.
, 2017 34 / 119

RNNs as Sentence Encoders
This idea of RNNs as sentence encoder works for classification as well:
Data is labelled sequences (s1, . . . , s|s|; ˆy).
RNN is run over s to produce final state h|s| = RNN(s).
A differentiable function of h|s| classifies: y = fθ(h|s|)
h|s| can be taken to be the composed meaning of s, with regard to
the task at hand.
An aside: Bi-directional RNN encoders
For both sequence classification and generation, sometimes a
Bi-directional RNN is used to encode:
h←
i = RNN←
(si , h←
i+1) h→
i = RNN→
(si , h→
i−1)
h|s| = concat(h←
1 , h→
|s|)
, 2017 35 / 119

A Transduction Bottleneck
Single vector representation of
sentences causes problems:
Training focusses on learning
marginal language model of
target language ﬁrst.
Longer input sequences cause
compressive loss.
Encoder gets signiﬁcantly
diminished gradient.
In the words of Ray Mooney. . .
“You can’t cram the meaning of a whole %&!$ing sentence into a single
$&!*ing vector!” Yes, the censored-out swearing is copied verbatim.
, 2017 36 / 119

Attention
cf. Bahdanau et al., 2014
We want to use he
1, . . . , he
m when
predicting ti by conditioning on
words that might relate to ti :
1 Compute hd
i (RNN update)
2 eij = fatt(hd
i , he
j )
3 aij = softmax(ei )j
4 hatt
i = m
j=1 aij he
j
5 ˆhi = concat(hd
i , hatt
i )
6 ti+1 ∼ Categorical(t; fV (ˆhi ))
The many faces of attention
Many variants on the above process: early attention (based on hd
i−1 and ti ,
used to update hd
i ), diﬀerent attentive functions fatt (e.g. based on
projected inner products, or MLPs), and so on.
, 2017 37 / 119

Attention and Composition
We refer to the set of source activation vectors he
1, . . . , he
m in the previous
slides as an attention matrix. Is it a suitable sentence representation?
Pros:
Locally compositional: vectors contain information about other words
(especially with bi-directional RNN as encoder).
Variable size sentence representation: longer sentences yield larger
representation with more capacity.
Cons:
Single vector representation of sentences is convenient (many
decoders, classiﬁers, etc. expect ﬁxed-width feature vectors as input)
Locally compositional, but are long range dependencies resolved in
the attention matrix? Does it truly express the sentence’s meaning as
a semantic unit (or is it just good for sequence transduction)?
, 2017 38 / 119

Outline
1 Introduction
Formal methods
Unsupervised models
3 Selected Topics
4 Summary
, 2017 39 / 119

Recursive Neural Networks
Recursive networks: a generalization of (chain) recurrent networks with
a computational graph, often a tree (Pollack, 1990; Francesconi et al.,
1997; Socher et al., 2011a,b,c, 2013; Zhu et al., 2015b)
, 2017 40 / 119

Successfully applied to consider input data structures.
Natural language processing (Socher et al., 2011a,c; Le et al., 2015;
Tai et al., 2015; Zhu et al., 2015b)
Computer vision (Socher et al., 2011b)
, 2017 41 / 119

How to determine the structures.
Encode given “external” knowledge about the structure of the input
data,
e.g., syntactic structures; modelling sentential semantics and syntax is
one of the most interesting problems in language.
, 2017 41 / 119

How to determine the structures.
Encode given “external” knowledge about the structure of the input
data,
e.g., syntactic structures; modelling sentential semantics and syntax is
one of the most interesting problems in language.
Encode simply a complete tree.
, 2017 41 / 119

Integrating Syntactic Parses in Composition
Recursive Neural Tensor Network (Socher et al., 2012):
The structure is given (here by a constituency parser.)
Each node here is implemented as a regular feed-forward layer plus a
3rd -order tensor.
The tensor captures 2nd
-degree (quadratic) polynomial interaction of
children, e.g., b2
i , bi cj , and c2
j .
, 2017 42 / 119

Results
The models have been successfully applied to a number of tasks such as
sentiment analysis (Socher et al., 2013).
Table: Accuracy for ﬁne grained (5-class) and binary predictions at the
sentence level (root) and for all nodes.
, 2017 43 / 119

Tree-LSTM
Tree-structured LSTM (Le, *SEM-15; Tai, ACL-15; Zhu, ICML-15): It is
an extension of chain LSTM to tree structures.
, 2017 44 / 119

Tree-LSTM
Tree-structured LSTM (Le, *SEM-15; Tai, ACL-15; Zhu, ICML-15): It is
an extension of chain LSTM to tree structures.
If your have a non-binary tree, a
simple solution is to binarize it.
, 2017 44 / 119

Tree-LSTM Application: Sentiment Analysis
Sentiment composed over a constituency parse tree:
, 2017 45 / 119

Tree-LSTM Application: Sentiment Analysis
Results on Stanford Sentiment Treebank (Zhu et al., 2015b):
Models roots phrases
NB 41.0 67.2
SVM 40.7 64.3
RvNN 43.2 79.0
RNTN 45.7 80.7
Tree-LSTM 48.9 81.9
Table: Performances (accuracy) of models on Stanford Sentiment Treebank, at
the sentence level (roots) and the phrase level.
, 2017 46 / 119

Tree-LSTM Application: Natural Language Inference
Applied to Natural Language Inference (NLI): Determine if a sentence
entails another, if they contradict, or have no relation (Chen et al., 2017).
, 2017 47 / 119

Tree-LSTM Application: Natural Language Inference
Accuracy on Stanford Natural Language Inference (SNLI) dataset:
(Chen et al., 2017)
* Welcome to the poster at 6:00-9:30pm on July 31.
, 2017 48 / 119

Learning Representation for Natural Language Inference
RepEval-2017 Shared Task (Williams et al., 2017): Learn sentence
representation as a ﬁxed-length vector.
, 2017 49 / 119

Tree-LSTM without Syntactic Parses
How if we simply apply recursive networks over trees that are not
generated from syntactic parses, e.g., a complete binary trees?
Multiple eﬀorts on SNLI (Munkhdalai
et al., 2016; Chen et al., 2017) have
observed that the models outperform
sequential (chain) LSTM.
This could be related to the discussion
that recursive nets may capture
long-distance dependency (Goodfellow
et al., 2016).
, 2017 50 / 119

SPINN: Doing Away with Test-Time Trees
buffer
stack
t = 0
down
sat
cat
the
shift
t = 1
down
sat
cat
the
shift
t = 2
down
sat
cat
the
reduce
t = 3
down
sat
the cat
shift
t = 4
down
sat
the cat
shift
t = 5
down
sat
the cat
reduce
t = 6
sat down
the cat
reduce
t = 7 = T
(the cat) (sat down)
output to model
for semantic task
Image credit: Sam Bowman and co-authors.
cf. Bowman et al., 2016
Shift-Reduce Parsers:
Exploit isomorphism between binary branching trees with T leaves
and sequences of 2T − 1 binary shift/reduce actions.
Shift unattached leaves from a buffer onto a processing stack.
Reduce the top two child nodes on the stack to a single parent node.
SPINN: Jointly train a TreeRNN and a vector-based shift-reduce parser.
Training time trees offer supervision for shift-reduce parser.
No need for test time trees!
, 2017 51 / 119

SPINN:Doing Away with Test-Time Trees
buffer
down
sat
stack
cat
the
composition
tracking
transition
reduce
down
sat
the cat composition
tracking
transition
shift
down
sat
the cat
tracking
Image credit: Sam Bowman and co-authors.
Word vectors start on buffer b (top: first word in sentence).
Shift moves word vectors from buffer to stack s.
Reduce pops top two vectors off the stack, applies
f R : Rd × Rd → Rd , and pushes the result back to the stack
(i.e. TreeRNN composition).
Tracker LSTM tracks parser/composer state across operations,
decides shift-reduce operations a, is supervised by both observed
shift-reduce operations and end-task:
ht = LSTM(f C
(bt−1[0], st−1[0], st−1[1]), ht−1) at ∼ f A
(ht)
, 2017 52 / 119

A Quick Introduction to Reinforce
What if some part of our process is not diﬀerentiable (e.g. samples from
the shift-reduce module in SPINN) but we want to learn with no labels. . .
x y
x y
z
p(y|x) = Epθ(z|x) [fφ(z, x)] s.t. y ∼ fφ(z, x) or y = fφ(z, x)
φp(y|x) =
z
pθ(z|x) φfφ(z, x) = Epθ(z|x) [ φfφ(z, x)]
θp(y|x) =
z
fφ(z, x) θpθ(z|x) = ???
, 2017 53 / 119

SPINN+RL: Doing Away with Training-Time Trees
“Drop in” extension to SPINN (Yogatama et al., 2016):
Treat at ∼ f A(ht) as policy πA
θ (at; ht), trained via Reinforce.
Reward is negated loss of the end task, e.g. log-likelihood of the
correct label.
Everything else is trained by backpropagation against the end task:
tracker LSTM, representations, etc. receive gradient both from the
supervised objective, and from Reinforce via the shift-reduce policy.
a
wo
man
wea
ring
sun
glas
ses
is
frow
ning
. a boy
drag
s
his
sled
s
thro
ugh
the
sno
w
.
Model recovers linguistic-like structures (e.g. noun phrases, auxiliary verb-verb pairing, etc.).
, 2017 55 / 119

SPINN+RL: Doing Away with Training-Time Trees
Does RL-SPINN work? According to Yogatama et al. (2016):
Better than LSTM baselines: model captures and exploits structure.
Better than SPINN benchmarks: model is not biased by what
linguists think trees should be like, only has a loose inductive biase
towards tree structures.
But some parses do not reﬂect order of composition (see below).
Semi-supervised setup may be sensible.
two men are
playi
ng
frisb
ee
in the park .
fami
ly
me
mbe
rs
stan
ding
outs
ide a
hom
e
.
Some “bad” parses, but not necessarily worse results.
, 2017 56 / 119

Outline
1 Introduction
Formal methods
Unsupervised models
3 Selected Topics
4 Summary
, 2017 57 / 119

Convolution Neural Networks
Visual Inspiration: How do we learn to recognise pictures?
Will a fully connected neural network do the trick?
8
, 2017 58 / 119

ConvNets for pictures
Problem: lots of variance that shouldn’t matter (position, rotation, skew,
diﬀerence in font/handwriting).
8 8 8
8 8 8Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th
, 2017 59 / 119

Solution: Accept that features are local. Search for local features with a
window.
8
, 2017 60 / 119

Convolutional window acts as a classifer for local features.
⇒
, 2017 61 / 119

Diﬀerent convolutional maps can be trained to recognise diﬀerent features
(e.g. edges, curves, serifs).
...
, 2017 62 / 119

Stacked convolutional layers learn higher-level features.
Fully Connected Layer
Convolutional Layer
8 8Raw Image First Order Local Features Higher Order Features Prediction
One or more fully-connected layers learn classiﬁcation function over
highest level of representation.
, 2017 63 / 119

ConvNets for language
Convolutional neural networks ﬁt natural language well.
Deep ConvNets capture:
Positional invariances
Local features
Hierarchical structure
Language has:
Some positional invariance
Local features (e.g. POS)
Hierarchical structure (phrases,
dependencies)
, 2017 64 / 119

How do we go from images to sentences? Sentence matrices!
w1 w2 w3 w4 w5
, 2017 65 / 119

Does a convolutional window make sense for language?
w1 w2 w3 w4 w5
, 2017 66 / 119

A better solution: feature-speciﬁc windows.
w1 w2 w3 w4 w5
, 2017 67 / 119

Word Level Sentence Vectors with ConvNets
K-Max pooling
(k=3)
Fully connected
layer
Folding
Wide
convolution
(m=2)
Dynamic
k-max pooling
(k= f(s) =5)
Projected
sentence
matrix
(s=7)
Wide
convolution
(m=3)
game's the same, just got more ﬁerce
cf. Kalchbrenner et al., 2014
, 2017 68 / 119

Character Level Sentence Vectors with ConvNets
Image credit: Yoon Kim and co-authors.
cf. Kim et al., 2016
Naively, we could just represent
everything at character level.
Convolutions seem to work well
for low-level patterns
(e.g. morphology)
One interpretation: multiple
ﬁlters can capture the low-level
idiosyncrasies of natural
language (e.g. arbitrary spelling)
whereas language is more
compositional at a higher level.
, 2017 69 / 119

ConvNet-like Architectures for Composition
t0 t1 t2 t3 t4 t5 t6 t7 t8 t9 t11 t12 t13 t14 t15 t16t10
s0 s1 s2 s3 s4 s5 s6 s7 s8 s9 s10 s11 s12 s13 s14 s15 s16
t11 t12 t13 t14 t15 t16 t17t10t9t8t7t6t5t4t3t2t1
Image credit: Nal Kalchbrenner and co-authors.
cf. Kalchbrenner et al., 2016
Many other CNN-like
architectures (e.g. ByteNet from
Kalchbrenner et al. (2016))
Common recipe components:
dilated convolutions and ResNet
blocks.
These model sequences well in
domains like speech, and are
beginning to ﬁnd applications in
NLP, so worth reading up on.
, 2017 70 / 119

Outline
1 Introduction
Formal methods
Unsupervised models
3 Selected Topics
4 Summary
, 2017 71 / 119

Unsupervised Composition Models
Why care about unsupervised learning?
Much more unlabelled linguistic data than labelled data.
Learn general purpose representations and composition functions.
Suitable pre-training for supervised models, semi-supervised, or
multi-task objectives.
In the (paraphrased) words of Yann LeCun: unsupervised learning is a
cake, supervised learning is frosting, and RL is the cherry on top!
Plot twist: it’s possibly a cherry cake.
Yes, that’s nice. . . But what are we doing, concretely?
Good question! Usually, just modelling—directly or indirectly—some
aspect of the probability of the observed data.
Further suggestions on a postcard, please!
, 2017 72 / 119

Autoencoders
Autoencoders provide an unsupervised method for representation learning:
We minimise an objective function over inputs xi , i ∈ N and their
reconstructions xi :
J =
1
2
N
i
xi − xi
2
Warning: degenerate solution if xi can be updated (∀i.xi = 0).
, 2017 73 / 119

Recursive Autoencoders
cf. Socher et al., 2011a
To auto-encode variable length
sequences, we can chain
autoencoders to create a recursive
structure.
Objective Function
Minimizing the reconstruction error
will learn a compression function over
the inputs:
Erec(i, θ) =
1
2
xi − xi
2
A “modern” alternative: use sequence to sequence model, and
log-likelihood objective.
, 2017 74 / 119

What’s wrong with auto-encoders?
Empirically, narrow auto-encoders produce sharp latent codes, and
unregularised wide auto-encoders learn identity functions.
Reconstruction objective includes nothing about distance preservation
in latent space: no guarantee that
dist(a, b) ≤ dist(a, c)
→ dist(encode(a), encode(b)) ≤ dist(encode(a), encode(c))
Conversely, little incentive for similar latent codes to generate
radically diﬀerent (but semantically equivalent) observations.
Ultimately, compression = meaning.
, 2017 75 / 119

Skip-Thought
Image credit: Jamie Kiros and co-authors.
cf. Kiros et al., 2015
Similar to auto-encoding objective: encode sentence, but decode
neighbouring sentences.
Pair of LSTM-based seq2seq models with share encoder, but
alternative formulations are possible.
Conceptually similar to distributional semantics: a unit’s
representation is a function of its neighbouring units, except units are
sentence instead of words.
, 2017 76 / 119

Variational Auto-Encoders
Semantically Weak Codes
Generally, auto-encoders sparsely encode or densely compress information.
No pressure to ensure similarity continuum amongst codes.
Factorized Generative Picture
p(x) = p(x, z)dz
= p(x|z)p(z)dz
= Ep(z) [p(x|z)]
z xN(0, I)
Prior on z enforces semantic continuum (e.g. no arbitrarily unrelated codes
for similar data), but expectation is typically intractable to compute
exactly, and Monte Carlo estimate of gradients will be high variance.
, 2017 77 / 119

Goal
Estimate, by maximising p(x):
The parameters θ of a function modelling part of the generative
process pθ(x|z) given samples from a ﬁxed prior z ∼ p(z).
The parameters φ of a distribution qφ(z|x) approximating the true
posterior p(z|x).
How do we do it? We maximise p(x) via a variational lower bound (VLB):
log p(x) ≥ Eqφ(z|x) [log pθ(x|z)] − DKL (qφ(z|x) p(z))
Equivalently we can minimise NLL(x):
NLL(x) ≤ Eqφ(z|x)[NLLθ(x|z)] + DKL (qφ(z|x) p(z))
, 2017 78 / 119

The problem of stochastic gradients
Estimating ∂
∂φ Eqφ(z|x) [log pθ(x|z)] requires backpropagating through
samples z ∼ qφ(z|x). For some choices of q, such as Gaussians there are
reparameterization tricks (cf. Kingma et al., 2013)
Reparameterizing Gaussians (Kingma et al., 2013)
z ∼ N(z; µ, σ2
)
equivalent to z = µ + σ
where ∼ N( ; 0, I)
Trivially:
∂z
∂µ
= 1
∂z
∂σ
=
, 2017 80 / 119

Variational Auto-Encoders for Sentences
1 Observe a sentence w1, . . . , wn. Encode it, e.g. with an LSTM:
he = LSTMe(w1, . . . , wn)
2 Predict µ = f µ(he) and σ2 = f σ(he) (in practice we operate in log
space for σ2 by determining log σ).
3 Sample z ∼ q(z|x) = N(z; µ, σ2)
4 Use conditional RNN to decode and measure log p(x|z). Use
closed-form formula of KL divergence of two Gaussians to calculate
−DKL (qφ(z|x) p(z)). Add both to obtain maximisation objective.
5 Backpropagate gradient through decoder normally based on log
component of the objective, and use reparameterisation trick to
backpropagate through sampling operation back to encoder.
6 Gradient of the KL divergence component of the loss with regard to
the encoder parameters is straightforward backpropagation.
, 2017 81 / 119

Variational Auto-Encoders and Autoregressivity
The problem of powerful auto-regressive decoders
We want to minimise NLL(x) ≤ Eq(z|x)[NLL(x|z)] + DKL (q(z|x) p(z)).
What if the decoder is powerful enough to model x without using z?
A degenerate solution:
If z can be ignored when minimising the reconstruction loss of x given
z, the model can safely let q(z|x) collapse to the prior p(z) to
minimise DKL (q(z|x) p(z)).
Since q need not depend on x (e.g. the encoder can just ignore x and
predict the mean and variance of the prior), z bears no relation to x.
Result: useless encoder, useless latent variable.
Is this really a problem?
If your decoder is not auto-regressive (e.g. MLPs expressing the probability
of pixels which are conditionally independent given z), then no.
If your decoder is an RNN and domain has systematic patterns, then yes.
, 2017 82 / 119

Variational Auto-Encoders and Autoregressivity
What are some solutions to this problem?
Pick a non-autoregressive decoder. If you care more about the latent
code than having a good generative model (e.g. document modelling),
this isn’t a bad idea, but frustrating if this is the only solution.
KL Annealing: set Eq(z|x)[NLL(x|z)] + αDKL (q(z|x) p(z)) as
objective. Start with α = 0 (basic seq2seq model). Increase α to 1
over time during training. Works somewhat, but unprincipled
changing of the objective function.
Set as objective Eq(z|x)[NLL(x|z)] + max(λ, DKL (q(z|x) p(z))) where
λ ≥ 0 is a scalar or vector hyperparameter. Once the KL dips below λ,
there is no beneﬁt, so the model must rely on z to some extent. This
objective is still a valid upper bound on NLL(x) (albeit a looser one).
, 2017 83 / 119

Outline
1 Introduction
Formal methods
Unsupervised models
3 Selected Topics
4 Summary
, 2017 84 / 119

Outline
1 Introduction
Formal methods
Unsupervised models
3 Selected Topics
4 Summary
, 2017 85 / 119

Compositional or Non-compositional Representation
, 2017 86 / 119

, 2017 87 / 119

, 2017 88 / 119

Such “hard” or “soft” non-compositionalilty exists at diﬀerent
granularities of texts.
We will discuss some models on how to handle this at the
word-phrase level.
, 2017 88 / 119

Compositional and Non-compositional Semantics
Compositionality/non-compositionality is a common phenomenon in
language.
A framework that is able to consider both
compositionality/non-compositionality is of interest.
, 2017 89 / 119

language.
A pragmatic viewpoint: If one is able to obtain holistically the
representation of an n-gram or a phrase in text, it would be desirable
that a composition model has the ability to decide the sources of
knowledge it will use.
, 2017 89 / 119

language.
A pragmatic viewpoint: If one is able to obtain holistically the
representation of an n-gram or a phrase in text, it would be desirable
that a composition model has the ability to decide the sources of
knowledge it will use.
In addition to composition, considering non-compositionality may
avoid back-propagating errors unnecessarily to confuse word
embedding.
think about the “kick the bucket” example.
, 2017 89 / 119

Integrating Compositional and Non-compositional
Semantics
Integrating non-compositionality in recursive networks (Zhu et al., 2015a):
Basic idea: Enabling individual composition operations to be able to
choose information from diﬀerent resources, compositional or
non-compositional (e.g., holistically learned).
, 2017 90 / 119

Semantics
Model 1: Regular bilinear merge (Zhu et al., 2015a):
, 2017 91 / 119

Semantics
Model 2: Tensor-based merging (Zhu et al., 2015a)
, 2017 92 / 119

Semantics
Model 3: Explicitly gated merging (Zhu et al., 2015a):
, 2017 93 / 119

Experiment Set-Up
Task: sentiment analysis
Data: Stanford Sentiment Treebank
Non-compositional sentiment
Sentiment of ngrams automatically learned from tweets (Mohammad
et al., 2013).
Polled the Twitter API every four hours from April to December 2012
in search of tweets with either a positive word hashtag or a negative
word hashtag.
Using 78 seed hashtags (32 positive and 36 negative) such as #good,
#excellent, and #terrible to annotate sentiment.
775,000 tweets that contain at least a positive hashtag or a negative
hashtag were used as the learning corpus.
Point-wise mutual information (PMI) is calculated for each bigrams
and trigrams.
Each sentiment score is converted to a one-hot vector; e.g. a bigram
with a score of -1.5 will be assigned a 5-dimensional vector [0, 1, 0, 0,
0] (i.e., the e vector).
Using the human annotation coming with Stanford Sentiment
Treebank for bigrams and trigrams.
, 2017 94 / 119

Results
Models sentence-level (roots) all phrases (all nodes)
(1) RNTN 42.44 79.95
(2) Regular-bilinear (auto) 42.37 79.97
(3) Regular-bilinear (manu) 42.98 80.14
(4) Explicitly-gated (auto) 42.58 80.06
(5) Explicitly-gated (manu) 43.21 80.21
(6) Conﬁned-tensor (auto) 42.99 80.49
(7) Conﬁned-tensor (manu) 43.75† 80.66†
Table: Model performances (accuracy) on predicting 5-category sentiment at the
sentence (root) level and phrase level.
1
The results is based on the version 3.3.0 of the Stanford CoreNLP.
, 2017 95 / 119

Semantics
We have discussed integrating non-compositionality in recursive
networks.
How if there are no prior input structures available?
Remember we have discussed the models that capture hidden
structures.
How if a syntactic parsing tree is not very reliable?
e.g., for data like social media text or speech transcripts.
In these situations, how can we still consider non-compositionality in
the composition process.
, 2017 96 / 119

Semantics
Integrating non-compositionality in chain recurrent networks (Zhu et al.,
2016)
, 2017 97 / 119

Semantics
Non-compositional nodes:
Form the non-compositional paths (e.g., 3-8-9 or 4-5-9).
Allow the embedding spaces of a non-compositional node to be
diﬀerent from those of a compositional node.
, 2017 98 / 119

Semantics
Fork nodes:
Summarizing history so far to
support both compositional and
non-compositional paths.
, 2017 99 / 119

Semantics
Merging nodes:
Combining information from compositional and non-compositional paths.
Binarization
, 2017 100 / 119

Semantics
, 2017 101 / 119

Semantics
Binarization:
Binarizing the composition of in-bound
paths (we do not worry too much about
the order of merging.)
Now we do not need to design diﬀerent
nodes for diﬀerent fan-in, but let
parameter-sharing be all over the nets.
, 2017 101 / 119

Results
Method SemEval-13 SemEval-14
Majority baseline 29.19 34.46
Unigram (SVM) 56.95 58.58
3rd best model 64.86 69.95
2nd best model 65.27 70.14
The best model 69.02 70.96
DAG-LSTM 70.88 71.97
Table: Performances of diﬀerent models in oﬃcial evaluation metric (macro
F-scores) on the test sets of SemEval-2013 and SemEval-2014 Sentiment Analysis
in Twitter in predicting the sentiment of the tweet messages.
, 2017 102 / 119

Results
Method SemEval-13 SemEval-14
DAG-LSTM
Full paths 70.88 71.97
Full – {autoPaths} 69.36 69.27
Full – {triPaths} 70.16 70.77
Full – {triPaths, biPaths} 69.55 69.93
Full – {manuPaths} 69.88 70.58
LSTM without DAG
Full – {autoPaths,manuPaths} 64.00 66.40
Table: Ablation performances (macro-averaged F-scores) of DAG-LSTM with
diﬀerent types of paths being removed.
, 2017 103 / 119

Outline
1 Introduction
Formal methods
Unsupervised models
3 Selected Topics
4 Summary
, 2017 104 / 119

Subword Composition
Composition can also be performed to learn representations for words
from subword components (Botha et al., 2014; Ling et al., 2015;
Luong et al., 2015; Kim et al., 2016; Sennrich et al., 2016).
Rich morphology: some languages have larger vocabularies than others.
Informal text: very coooooool!
, 2017 105 / 119

Subword Composition
Basically alleviate Sparseness!
, 2017 105 / 119

Subword Composition
One perspective of viewing subword models:
Morpheme based composition: deriving word representation from
morphemes.
Character based composition: deriving word representation from
characters (pretty eﬀective as well, even used by itself!)
, 2017 105 / 119

Subword Composition
morphemes.
Another perspective (by model architectures):
Recursive models
Convolutional models
Recurrent models
, 2017 105 / 119

Subword Composition
morphemes.
Another perspective (by model architectures):
Recursive models
Convolutional models
Recurrent models
We will discuss several typical methods here only brieﬂy.
, 2017 105 / 119

Subword Composition: Recursive Networks
Morphological Recursive Neural Networks (Luong et al., 2013):
Extending recursive neural networks (Socher et al., 2011b) to learn word
representation through composition over morphemes.
Assume the availability of morphemic analyses.
Each tree node combines a stem vector and an aﬃx vector.
Figure. Context insensitive (left) and sensitive (right) Morphological
Recursive Neural Networks.
, 2017 106 / 119

Subword Composition: Recurrent Networks
Bi-directional LSTM for subword composition (Ling et al., 2015).
Figure. Character RNN for sub-word composition.
Some more details ...Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th
, 2017 107 / 119

Subword Composition: Convolutional Networks
Convolutional neural networks for subword composition (Zhang et al.,
2015)
Figure. Character CNN for sub-word composition.
, 2017 108 / 119

Subword Composition: Convolutional Networks
Convolutional neural networks for subword composition (Zhang et al.,
2015)
Figure. Character CNN for sub-word composition.
In general, subword models have been successfully used in a wide
variety of problems such as translation, sentiment analysis, question
answering, etc.
You should seriously consider it in the situations such as OOV is high
or the word distribution has a long tail.
, 2017 108 / 119

Outline
1 Introduction
Formal methods
Unsupervised models
3 Selected Topics
4 Summary
, 2017 109 / 119

Summary
The tutorial discusses semantic composition with distributed
representation learned with neural networks.
Neural networks are able to learn powerful representation and
complicated composition functions.
The models can achieve state-of-the-art performances on a wide
range of NLP tasks.
We expect further studies would continue to deepen our
understanding on such approaches:
Unsupervised models
Compositionality with other “ingredients” of intelligence
Compositionality in multi-modalities
Interpretability of models
Distributed vs./and symbolic composition models
... ...
, 2017 110 / 119

References I
C. E. Osgood, G. J. Suci, and P. H. Tannenbaum. The
Measurement of Meaning. University of Illinois Press, 1957.
Richard Montague. “English as a Formal Language”. In: Linguaggi
nella societa e nella tecnica. Ed. by Bruno Visentini. Edizioni di
Communita, 1970, pp. 188–221.
G. A. Miller and P. N. Johnson-Laird. Language and perception.
Cambridge, MA: Belknap Press, 1976.
J. A. Fodor and Z. W. Pylyshyn. “Connectionism and cognitive
architecture: A critical analysis”. In: Cognition 28 (1988), pp. 3–71.
Jordan B. Pollack. “Recursive Distributed Representations”. In:
Artif. Intell. 46.1-2 (1990), pp. 77–105.
, 2017 111 / 119

References II
Ronald J. Williams. “Simple Statistical Gradient-Following
Algorithms for Connectionist Reinforcement Learning”. In: Machine
Learning 8 (1992), pp. 229–256.
Barbara Partee. “Lexical semantics and compositionality”. In:
Invitation to Cognitive Science 1 (1995), pp. 311–360.
Elie Bienenstock, Stuart Geman, and Daniel Potter.
“Compositionality, MDL Priors, and Object Recognition”. In: NIPS.
1996.
Enrico Francesconi et al. “Logo Recognition by Recursive Neural
Networks”. In: GREC. 1997.
Jeﬀ Mitchell and Mirella Lapata. “Vector-based Models of Semantic
Composition”. In: ACL. 2008, pp. 236–244.
, 2017 112 / 119

References III
Richard Socher et al. “Dynamic Pooling and Unfolding Recursive
Autoencoders for Paraphrase Detection”. In: NIPS. 2011,
pp. 801–809.
Richard Socher et al. “Parsing Natural Scenes and Natural Language
with Recursive Neural Networks”. In: ICML. 2011, pp. 129–136.
Richard Socher et al. “Semi-Supervised Recursive Autoencoders for
Predicting Sentiment Distributions”. In: EMNLP. 2011.
Richard Socher et al. “Semantic Compositionality through Recursive
Matrix-Vector Spaces”. In: EMNLP-CoNLL. 2012, pp. 1201–1211.
Nal Kalchbrenner and Phil Blunsom. “Recurrent Continuous
Translation Models.”. In: EMNLP. Vol. 3. 39. 2013, p. 413.
Diederik P. Kingma and Max Welling. “Auto-Encoding Variational
Bayes”. In: CoRR abs/1312.6114 (2013).
, 2017 113 / 119

References IV
Thang Luong, Richard Socher, and Christopher D. Manning.
“Better Word Representations with Recursive Neural Networks for
Morphology”. In: CoNLL. 2013.
Saif Mohammad, Svetlana Kiritchenko, and Xiao-Dan Zhu.
“NRC-Canada: Building the State-of-the-Art in Sentiment Analysis
of Tweets”. In: SemEval@NAACL-HLT. 2013.
Richard Socher et al. “Recursive Deep Models for Semantic
Compositionality Over a Sentiment Treebank”. In: EMNLP. 2013,
pp. 1631–1642.
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. “Neural
machine translation by jointly learning to align and translate”. In:
arXiv preprint arXiv:1409.0473 (2014).
Jan A. Botha and Phil Blunsom. “Compositional Morphology for
Word Representations and Language Modelling”. In: ICML. 2014.
, 2017 114 / 119

References V
Nal Kalchbrenner, Edward Grefenstette, and Phil Blunsom. “A
convolutional neural network for modelling sentences”. In: arXiv
preprint arXiv:1404.2188 (2014).
Ilya Sutskever, Oriol Vinyals, and Quoc V Le. “Sequence to
sequence learning with neural networks”. In: Advances in neural
information processing systems. 2014, pp. 3104–3112.
Xiaodan Zhu et al. “An Empirical Study on the Eﬀect of Negation
Words on Sentiment”. In: ACL. 2014.
Ryan Kiros et al. “Skip-thought vectors”. In: Advances in neural
information processing systems. 2015, pp. 3294–3302.
Phong Le and Willem Zuidema. “Compositional Distributional
Semantics with Long Short Term Memory”. In:
*SEM@NAACL-HLT. 2015.
, 2017 115 / 119

References VI
Wang Ling et al. “Finding Function in Form: Compositional
Character Models for Open Vocabulary Word Representation”. In:
2015.
Thang Luong et al. “Addressing the Rare Word Problem in Neural
Machine Translation”. In: ACL. 2015.
Sheng Kai Tai, Richard Socher, and D. Christopher Manning.
“Improved Semantic Representations From Tree-Structured Long
Short-Term Memory Networks”. In: ACL. 2015, pp. 1556–1566.
Xiang Zhang and Yann LeCun. “Text Understanding from Scratch”.
In: CoRR abs/1502.01710 (2015).
Xiaodan Zhu, Hongyu Guo, and Parinaz Sobhani. “Neural Networks
for Integrating Compositional and Non-compositional Sentiment in
Sentiment Composition”. In: *SEM@NAACL-HLT. 2015.
, 2017 116 / 119

References VII
Xiaodan Zhu, Parinaz Sobhani, and Hongyu Guo. “Long Short-Term
Memory Over Recursive Structures”. In: ICML. 2015,
pp. 1604–1612.
Samuel R Bowman et al. “A fast uniﬁed model for parsing and
sentence understanding”. In: arXiv preprint arXiv:1603.06021
(2016).
Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning.
MIT Press, 2016.
Nal Kalchbrenner et al. “Neural Machine Translation in Linear
Time”. In: CoRR abs/1610.10099 (2016).
Yoon Kim et al. “Character-Aware Neural Language Models”. In:
AAAI. 2016.
B. M. Lake et al. “Building Machines that Learn and Think Like
People”. In: Behavioral and Brain Sciences. (in press). (2016).
, 2017 117 / 119

References VIII
Tsendsuren Munkhdalai and Hong Yu. “Neural Tree Indexers for
Text Understanding”. In: CoRR abs/1607.04492 (2016).
Rico Sennrich, Barry Haddow, and Alexandra Birch. “Neural
Machine Translation of Rare Words with Subword Units”. In: ACL.
2016.
Dani Yogatama et al. “Learning to Compose Words into Sentences
with Reinforcement Learning”. In: CoRR abs/1611.09100 (2016).
Xiaodan Zhu, Parinaz Sobhani, and Hongyu Guo. “DAG-Structured
Long Short-Term Memory for Semantic Compositionality”. In:
NAACL. 2016.
Qian Chen et al. “Enhanced LSTM for Natural Language Inference”.
In: ACL. 2017.
, 2017 118 / 119

References IX
Williams, Nikita Nangia, and Samuel R. Bowman. “A
Broad-Coverage Challenge Corpus for Sentence Understanding
through Inference”. In: CoRR abs/1704.05426 (2017).
, 2017 119 / 119

Deep Learning for Semantic Composition

More Related Content

What's hot (20)

Similar to Deep Learning for Semantic Composition (20)

More from MLReview (13)

Recently uploaded (20)

Deep Learning for Semantic Composition