SlideShare a Scribd company logo
Deep Learning for Semantic Composition
Xiaodan Zhu∗ & Edward Grefenstette†
∗National Research Council Canada
Queen’s University
zhu2048@gmail.com
†DeepMind
etg@google.com
July 30th, 2017
Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th
, 2017 1 / 119
Outline
1 Introduction
Semantic composition
Formal methods
Simple parametric models
2 Parameterizing Composition Functions
Recurrent composition models
Recursive composition models
Convolutional composition models
Unsupervised models
3 Selected Topics
Compositionality and non-compositionality
Subword composition methods
4 Summary
Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th
, 2017 2 / 119
Outline
1 Introduction
Semantic composition
Formal methods
Simple parametric models
2 Parameterizing Composition Functions
Recurrent composition models
Recursive composition models
Convolutional composition models
Unsupervised models
3 Selected Topics
Compositionality and non-compositionality
Subword composition methods
4 Summary
Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th
, 2017 3 / 119
Principle of Compositionality
Principle of compositionality: The meaning of a whole is a function of
the meaning of the parts.
Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th
, 2017 4 / 119
Principle of Compositionality
Principle of compositionality: The meaning of a whole is a function of
the meaning of the parts.
While we focus on natural language, compositionality exists not just
in language.
Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th
, 2017 4 / 119
Principle of Compositionality
Principle of compositionality: The meaning of a whole is a function of
the meaning of the parts.
While we focus on natural language, compositionality exists not just
in language.
Sound/music
Music notes are composed with some regularity but not randomly
arranged to form a song.
Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th
, 2017 4 / 119
Principle of Compositionality
Principle of compositionality: The meaning of a whole is a function of
the meaning of the parts.
While we focus on natural language, compositionality exists not just
in language.
Sound/music
Music notes are composed with some regularity but not randomly
arranged to form a song.
Vision
Natural scenes are composed of meaningful components.
Artificial visual art pieces often convey certain meaning with regularity
from their parts.
Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th
, 2017 4 / 119
Principle of Compositionality
Compositionality is regarded by many as a fundamental component of
intelligence in addition to language understanding (Miller et al., 1976;
Fodor et al., 1988; Bienenstock et al., 1996; Lake et al., 2016).
Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th
, 2017 5 / 119
Principle of Compositionality
Compositionality is regarded by many as a fundamental component of
intelligence in addition to language understanding (Miller et al., 1976;
Fodor et al., 1988; Bienenstock et al., 1996; Lake et al., 2016).
For example, Lake et al. (2016) emphasize several essential
ingredients for building machines that “learn and think like people”:
Compositionality
Intuitive physics/psychology
Learning-to-learn
Causality models
Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th
, 2017 5 / 119
Principle of Compositionality
Compositionality is regarded by many as a fundamental component of
intelligence in addition to language understanding (Miller et al., 1976;
Fodor et al., 1988; Bienenstock et al., 1996; Lake et al., 2016).
For example, Lake et al. (2016) emphasize several essential
ingredients for building machines that “learn and think like people”:
Compositionality
Intuitive physics/psychology
Learning-to-learn
Causality models
Note that many of these challenges present in natural language
understanding.
Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th
, 2017 5 / 119
Principle of Compositionality
Compositionality is regarded by many as a fundamental component of
intelligence in addition to language understanding (Miller et al., 1976;
Fodor et al., 1988; Bienenstock et al., 1996; Lake et al., 2016).
For example, Lake et al. (2016) emphasize several essential
ingredients for building machines that “learn and think like people”:
Compositionality
Intuitive physics/psychology
Learning-to-learn
Causality models
Note that many of these challenges present in natural language
understanding.
They are reflected in the sparseness in training a NLP model.
Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th
, 2017 5 / 119
Principle of Compositionality
Compositionality is regarded by many as a fundamental component of
intelligence in addition to language understanding (Miller et al., 1976;
Fodor et al., 1988; Bienenstock et al., 1996; Lake et al., 2016).
For example, Lake et al. (2016) emphasize several essential
ingredients for building machines that “learn and think like people”:
Compositionality
Intuitive physics/psychology
Learning-to-learn
Causality models
Note that many of these challenges present in natural language
understanding.
They are reflected in the sparseness in training a NLP model.
Note also that compositionality may be entangled with the other
“ingredients” listed above.
Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th
, 2017 5 / 119
Semantic Composition in Natural Language
good → very good → not very good → ...
Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th
, 2017 6 / 119
Semantic Composition in Natural Language
Figure: Results from (Zhu et al., 2014).
A dot in the figure corresponds to a negated
phrase (e.g., not very good) in Stanford
Sentiment Treebank (Socher et al., 2013).
The y-axis is its sentiment value and x-axis
the sentiment of its argument.
Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th
, 2017 7 / 119
Semantic Composition in Natural Language
Figure: Results from (Zhu et al., 2014).
A dot in the figure corresponds to a negated
phrase (e.g., not very good) in Stanford
Sentiment Treebank (Socher et al., 2013).
The y-axis is its sentiment value and x-axis
the sentiment of its argument.
Even a one-layer composition, over
one dimension of meaning (e.g.,
semantic orientation (Osgood
et al., 1957)), could be a
complicated mapping.
Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th
, 2017 7 / 119
Semantic Composition in Natural Language
good → very good → not very good → ...
senator → former senator → ...
basketball player → short basketball player → ...
giant → small giant → ...
empty/full → half empty/full → almost half empty/full → ...1
1
See more examples in (Partee, 1995).
Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th
, 2017 8 / 119
Semantic Composition in Natural Language
Semantic composition in natural language: the task of modelling the
meaning of a larger piece of text by composing the meaning of its
constituents.
Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th
, 2017 9 / 119
Semantic Composition in Natural Language
Semantic composition in natural language: the task of modelling the
meaning of a larger piece of text by composing the meaning of its
constituents.
modelling: learning a representation
The compositionality in language is very challenging as discussed
above.
Compositionality can entangle with other challenges such as those
emphasized in (Lake et al., 2016).
Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th
, 2017 9 / 119
Semantic Composition in Natural Language
Semantic composition in natural language: the task of modelling the
meaning of a larger piece of text by composing the meaning of its
constituents.
modelling: learning a representation
The compositionality in language is very challenging as discussed
above.
Compositionality can entangle with other challenges such as those
emphasized in (Lake et al., 2016).
a larger piece of text: a phrase, sentence, or document.
Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th
, 2017 9 / 119
Semantic Composition in Natural Language
Semantic composition in natural language: the task of modelling the
meaning of a larger piece of text by composing the meaning of its
constituents.
modelling: learning a representation
The compositionality in language is very challenging as discussed
above.
Compositionality can entangle with other challenges such as those
emphasized in (Lake et al., 2016).
a larger piece of text: a phrase, sentence, or document.
constituents: subword components, words, phrases.
Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th
, 2017 9 / 119
Semantic Composition in Natural Language
Semantic composition in natural language: the task of modelling the
meaning of a larger piece of text by composing the meaning of its
constituents.
modelling: learning a representation
The compositionality in language is very challenging as discussed
above.
Compositionality can entangle with other challenges such as those
emphasized in (Lake et al., 2016).
a larger piece of text: a phrase, sentence, or document.
constituents: subword components, words, phrases.
Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th
, 2017 9 / 119
Introduction
Two key problems:
How to represent meaning?
How to learn such a representation?
Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th
, 2017 10 / 119
Representation
Let’s first very briefly revisit the representation we assume in this tutorial
... and leave the learning problem to the entire tutorial that follows.
Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th
, 2017 11 / 119
Representation
Let’s first very briefly revisit the representation we assume in this tutorial
... and leave the learning problem to the entire tutorial that follows.
Love
Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th
, 2017 11 / 119
Representation
Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th
, 2017 12 / 119
Representation
love, admiration, satisfaction ...
anger, fear, hunger ...
Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th
, 2017 12 / 119
Representation
A viewpoint from The Emotion Machine (Minsky, 2006)
Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th
, 2017 13 / 119
Representation
A viewpoint from The Emotion Machine (Minsky, 2006)
Each variable responds to different concepts and each concept is
represented by different variables.
Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th
, 2017 13 / 119
Representation
A viewpoint from The Emotion Machine (Minsky, 2006)
Each variable responds to different concepts and each concept is
represented by different variables.
This is exactly a distributed representation.
Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th
, 2017 13 / 119
Representation
Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th
, 2017 14 / 119
Modelling Composition Functions
How do we model the composition functions?
Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th
, 2017 15 / 119
Representation
Deep Learning for Semantic Composition
Deep learning: We focus on deep learning models in this tutorial.
Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th
, 2017 16 / 119
Representation
Deep Learning for Semantic Composition
Deep learning: We focus on deep learning models in this tutorial.
“Wait a minute, deep learning again?”
“DL people, leave language along ...”
Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th
, 2017 16 / 119
Representation
Deep Learning for Semantic Composition
Deep learning: We focus on deep learning models in this tutorial.
“Wait a minute, deep learning again?”
“DL people, leave language along ...”
Asking some questions may be helpful:
Are deep learning models providing nice function or density
approximation, the problems that many specific NLP tasks essentially
seek to solve? X→Y
Are continuous vector representations of meaning effective for (as
least some) NLP tasks? Are DL models convenient for computing
such continuous representations?
Do DL models naturally bridge language with other modalities in
terms of both representation and learning? (this could be important.)
Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th
, 2017 16 / 119
Introduction
More questions:
What NLP problems (e.g., semantic problems here) can be better
handled with DL and what cannot?
Can NLP benefit from combining DL and other approaches (e.g.,
symbolic approaches)?
In general, has the effectiveness of DL models for semantics already
been well understood?
Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th
, 2017 17 / 119
Introduction
Deep Learning for Semantic Composition
Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th
, 2017 18 / 119
Outline
1 Introduction
Semantic composition
Formal methods
Simple parametric models
2 Parameterizing Composition Functions
Recurrent composition models
Recursive composition models
Convolutional composition models
Unsupervised models
3 Selected Topics
Compositionality and non-compositionality
Subword composition methods
4 Summary
Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th
, 2017 19 / 119
Formal Semantics
Montague Semantics (1970–1973):
Treat natural language like a formal language via
an interpretation function [[. . .]], and
a mapping from CFG rules to function application order.
Interpretation of a sentence reduces to logical form via β-reduction.
High Level Idea
Syntax guides composition, types determine their semantics, predicate
logic does the rest.
Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th
, 2017 20 / 119
Formal Semantics
Syntactic Analysis Semantic Interpretation
S ⇒ NP VP [[VP]]([[NP]])
NP ⇒ cats, milk, etc. [[cats]], [[milk]], . . .
VP ⇒ Vt NP [[Vt]]([[NP]])
Vt ⇒ like, hug, etc. λyx.[[like]](x, y), . . .
[[like]]([[cats]], [[milk]])
[[cats]] λx.[[like]](x, [[milk]])
λyx.[[like]](x, y) [[milk]]
Cats like milk.
Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th
, 2017 21 / 119
Formal Semantics
Pros:
Intuitive and interpretable(?) representations.
Leverage the power of predicate logic to model semantics.
Evaluate the truth of statements, derive conclusions, etc.
Cons:
Brittle, requires robust parsers.
Extensive logical model required for evaluation of clauses.
Extensive set of rules required to do anything useful.
Overall, an intractable (or unappealing) learning problem.
Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th
, 2017 22 / 119
Outline
1 Introduction
Semantic composition
Formal methods
Simple parametric models
2 Parameterizing Composition Functions
Recurrent composition models
Recursive composition models
Convolutional composition models
Unsupervised models
3 Selected Topics
Compositionality and non-compositionality
Subword composition methods
4 Summary
Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th
, 2017 23 / 119
Simple Parametric Models
Basic models with pre-defined function form (Mitchell et al., 2008):
General form : p = f (u, v, R, K)
Add : p = u + v
WeightAdd : p = αT
u + βT
v
Multiplicative : p = u ⊗ v
Combined : p = αT
u + βT
v + γT
(u ⊗ v)
We will see later in this tutorial that the above models could be seen as
special cases of more complicated composition models.
Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th
, 2017 24 / 119
Results
Reference (R): The color ran.
High-similarity landmark (H): The color dissolved.
Low-similarity landmark (L): The color galloped.
A good composition model should give the above R-H pair a similarity score
higher than that given to the R-L pair. Also, a good model should assign such
similarity scores with a high correlation (ρ) to what human assigned.
Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th
, 2017 25 / 119
Results
Reference (R): The color ran.
High-similarity landmark (H): The color dissolved.
Low-similarity landmark (L): The color galloped.
A good composition model should give the above R-H pair a similarity score
higher than that given to the R-L pair. Also, a good model should assign such
similarity scores with a high correlation (ρ) to what human assigned.
Models R-H similarity R-L similarity ρ
NonComp 0.27 0.26 0.08**
Add 0.59 0.59 0.04*
WeightAdd 0.35 0.34 0.09**
Kintsch 0.47 0.45 0.09**
Multiply 0.42 0.28 0.17**
Combined 0.38 0.28 0.19**
UpperBound 4.94 3.25 0.40**
Table: Mean cosine similarities for the R-H pairs and R-L pairs as well as the
correlation coefficients (ρ) with human judgments (*: p < 0.05, **: p < 0.01).
Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th
, 2017 25 / 119
Outline
1 Introduction
Semantic composition
Formal methods
Simple parametric models
2 Parameterizing Composition Functions
Recurrent composition models
Recursive composition models
Convolutional composition models
Unsupervised models
3 Selected Topics
Compositionality and non-compositionality
Subword composition methods
4 Summary
Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th
, 2017 26 / 119
Parameterizing Composition Functions
To move beyond simple algebraic or parametric models we need function
approximators which, ideally:
Can approximate any arbitrary function (e.g. ANNs).
Can cope with variable size sequences.
Can capture long range or unbounded dependencies.
Can implicitly or explicitly model structure.
Can be trained against a supervised or unsupervised objective (or
both — semi-supervised training).
Can be trained chiefly or primarily through backpropagation.
A Neural Network Model Zoo
This section presents a selection of models satisfying some (if not all) of
these criteria.
Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th
, 2017 27 / 119
Outline
1 Introduction
Semantic composition
Formal methods
Simple parametric models
2 Parameterizing Composition Functions
Recurrent composition models
Recursive composition models
Convolutional composition models
Unsupervised models
3 Selected Topics
Compositionality and non-compositionality
Subword composition methods
4 Summary
Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th
, 2017 28 / 119
Recurrent Neural Networks
Bounded Methods
Many methods impose explicit or implicit length limits on conditioning
information. For example:
order-n Markov assumption in NLM/LBL
fully-connected layers and dynamic pooling in conv-nets
wj f(w1:j)
hj-1 hj
Recurrent Neural Networks introduce a repeatedly
composable unit, the recurrent cell, which both
models an unbounded sequence prefix and express a
function over it.
Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th
, 2017 29 / 119
The Mathematics of Recurrence
wj f(w1:j)
hj-1 hj
previous
state
next
state
inputs outputs
Building Blocks
An input vector wj ∈ R|w|
A previous state hj−1 ∈ R|h|
A next state hj ∈ R|h|
An output yj ∈ R|y|
fy : R|w| × R|h| → R|y|
fh : R|w| × R|h| → R|h|
Putting it together
hj = fh(wj , hj−1)
yj = fy (wj , hj )
So yj = fy (wj , fh(wj−1, hj−1)) = fy (wj , fh(wj−1, fh(wj−2, hj−2))) = . . .
Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th
, 2017 30 / 119
RNNs for Language Modelling
Language modelling
We want to model the joint probability of tokens t1, . . . tn in a sequence:
P(t1, . . . tn) = P(t1)
n
i=2
P(ti |t1, . . . ti−1)
Adapting a recurrence for basic LM
For vocab V, define an embedding matrix E ∈ R|V |×|w| and a logit
projection matrix WV ∈ R|y|×|V |. Then:
wj = embed(tj , E)
yj = fy (wj , hj ) hj = fh(wj , hj−1)
pj = softmax(yj WV )
P(tj+1|t1, . . . , tj ) = Categorical(tj+1; pj )
Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th
, 2017 31 / 119
Aside: The Vanishing Gradient Problem and LSTM RNNs
RNN is deep “by time”, so it could
seriously suffer from the vanishing
gradient issue.
LSTM configures memory cells and
multiple “gates” to control
information flow. If properly learned,
LSTM can keep pretty long-distance
(hundreds of time steps) information
in memory.
Memory-cell details:
it = σ(Wxi xt + Whi ht−1 + Wci ct−1)
ft = σ(Wxf xt + Whf ht−1 + Wcf ct−1)
ct = σ(ftct−1 + ittanh(Wxc xt + Whc ht−1))
ot = σ(Wxoxt + Whoht−1 + Wcoct)
ht = σ(ottanh(ct))
Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th
, 2017 32 / 119
Conditional Language Models
Conditional Language Modelling
A strength of RNNs is that hj can model not only the history of the
generated/observed sequence t1, . . . , tj , but any conditioning information
β, e.g. by setting h0 = β.
w1 w2 w3 w1 w2 w3
Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th
, 2017 33 / 119
Encoder-Decoder Models with RNNs
Les chiens aiment les os ||| Dogs love bones
Dogs love bones </s>
Source sequence Target sequence
cf. Kalchbrenner et al., 2013; Sutskever et al., 2014
Model p(t1, . . . , tn|s1, . . . , sm)
he
i = RNNencoder (si , he
i−1)
hd
i = RNNdecoder (ti , hd
i−1)
hd
0 = he
m
ti+1 ∼ Categorical(t; fV (hi ))
The encoder RNN as a composition module
All information needed to transduce the source into the target sequence
using RNNdecoder needs to be present in the start state hd
0 .
This start state is produced by RNNencoder , which will learn to compose.
Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th
, 2017 34 / 119
RNNs as Sentence Encoders
This idea of RNNs as sentence encoder works for classification as well:
Data is labelled sequences (s1, . . . , s|s|; ˆy).
RNN is run over s to produce final state h|s| = RNN(s).
A differentiable function of h|s| classifies: y = fθ(h|s|)
h|s| can be taken to be the composed meaning of s, with regard to
the task at hand.
An aside: Bi-directional RNN encoders
For both sequence classification and generation, sometimes a
Bi-directional RNN is used to encode:
h←
i = RNN←
(si , h←
i+1) h→
i = RNN→
(si , h→
i−1)
h|s| = concat(h←
1 , h→
|s|)
Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th
, 2017 35 / 119
A Transduction Bottleneck
Les chiens aiment les os ||| Dogs love bones
Dogs love bones </s>
Source sequence Target sequence
Single vector representation of
sentences causes problems:
Training focusses on learning
marginal language model of
target language first.
Longer input sequences cause
compressive loss.
Encoder gets significantly
diminished gradient.
In the words of Ray Mooney. . .
“You can’t cram the meaning of a whole %&!$ing sentence into a single
$&!*ing vector!” Yes, the censored-out swearing is copied verbatim.
Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th
, 2017 36 / 119
Attention
Les chiens aiment les os ||| Dogs love bones
Dogs love bones </s>
Source sequence Target sequence
cf. Bahdanau et al., 2014
We want to use he
1, . . . , he
m when
predicting ti by conditioning on
words that might relate to ti :
1 Compute hd
i (RNN update)
2 eij = fatt(hd
i , he
j )
3 aij = softmax(ei )j
4 hatt
i = m
j=1 aij he
j
5 ˆhi = concat(hd
i , hatt
i )
6 ti+1 ∼ Categorical(t; fV (ˆhi ))
The many faces of attention
Many variants on the above process: early attention (based on hd
i−1 and ti ,
used to update hd
i ), different attentive functions fatt (e.g. based on
projected inner products, or MLPs), and so on.
Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th
, 2017 37 / 119
Attention and Composition
We refer to the set of source activation vectors he
1, . . . , he
m in the previous
slides as an attention matrix. Is it a suitable sentence representation?
Pros:
Locally compositional: vectors contain information about other words
(especially with bi-directional RNN as encoder).
Variable size sentence representation: longer sentences yield larger
representation with more capacity.
Cons:
Single vector representation of sentences is convenient (many
decoders, classifiers, etc. expect fixed-width feature vectors as input)
Locally compositional, but are long range dependencies resolved in
the attention matrix? Does it truly express the sentence’s meaning as
a semantic unit (or is it just good for sequence transduction)?
Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th
, 2017 38 / 119
Outline
1 Introduction
Semantic composition
Formal methods
Simple parametric models
2 Parameterizing Composition Functions
Recurrent composition models
Recursive composition models
Convolutional composition models
Unsupervised models
3 Selected Topics
Compositionality and non-compositionality
Subword composition methods
4 Summary
Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th
, 2017 39 / 119
Recursive Neural Networks
Recursive networks: a generalization of (chain) recurrent networks with
a computational graph, often a tree (Pollack, 1990; Francesconi et al.,
1997; Socher et al., 2011a,b,c, 2013; Zhu et al., 2015b)
Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th
, 2017 40 / 119
Recursive Neural Networks
Successfully applied to consider input data structures.
Natural language processing (Socher et al., 2011a,c; Le et al., 2015;
Tai et al., 2015; Zhu et al., 2015b)
Computer vision (Socher et al., 2011b)
Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th
, 2017 41 / 119
Recursive Neural Networks
Successfully applied to consider input data structures.
Natural language processing (Socher et al., 2011a,c; Le et al., 2015;
Tai et al., 2015; Zhu et al., 2015b)
Computer vision (Socher et al., 2011b)
How to determine the structures.
Encode given “external” knowledge about the structure of the input
data,
e.g., syntactic structures; modelling sentential semantics and syntax is
one of the most interesting problems in language.
Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th
, 2017 41 / 119
Recursive Neural Networks
Successfully applied to consider input data structures.
Natural language processing (Socher et al., 2011a,c; Le et al., 2015;
Tai et al., 2015; Zhu et al., 2015b)
Computer vision (Socher et al., 2011b)
How to determine the structures.
Encode given “external” knowledge about the structure of the input
data,
e.g., syntactic structures; modelling sentential semantics and syntax is
one of the most interesting problems in language.
Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th
, 2017 41 / 119
Recursive Neural Networks
Successfully applied to consider input data structures.
Natural language processing (Socher et al., 2011a,c; Le et al., 2015;
Tai et al., 2015; Zhu et al., 2015b)
Computer vision (Socher et al., 2011b)
How to determine the structures.
Encode given “external” knowledge about the structure of the input
data,
e.g., syntactic structures; modelling sentential semantics and syntax is
one of the most interesting problems in language.
Encode simply a complete tree.
Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th
, 2017 41 / 119
Integrating Syntactic Parses in Composition
Recursive Neural Tensor Network (Socher et al., 2012):
The structure is given (here by a constituency parser.)
Each node here is implemented as a regular feed-forward layer plus a
3rd -order tensor.
The tensor captures 2nd
-degree (quadratic) polynomial interaction of
children, e.g., b2
i , bi cj , and c2
j .
Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th
, 2017 42 / 119
Results
The models have been successfully applied to a number of tasks such as
sentiment analysis (Socher et al., 2013).
Table: Accuracy for fine grained (5-class) and binary predictions at the
sentence level (root) and for all nodes.
Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th
, 2017 43 / 119
Tree-LSTM
Tree-structured LSTM (Le, *SEM-15; Tai, ACL-15; Zhu, ICML-15): It is
an extension of chain LSTM to tree structures.
Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th
, 2017 44 / 119
Tree-LSTM
Tree-structured LSTM (Le, *SEM-15; Tai, ACL-15; Zhu, ICML-15): It is
an extension of chain LSTM to tree structures.
If your have a non-binary tree, a
simple solution is to binarize it.
Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th
, 2017 44 / 119
Tree-LSTM Application: Sentiment Analysis
Sentiment composed over a constituency parse tree:
Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th
, 2017 45 / 119
Tree-LSTM Application: Sentiment Analysis
Results on Stanford Sentiment Treebank (Zhu et al., 2015b):
Models roots phrases
NB 41.0 67.2
SVM 40.7 64.3
RvNN 43.2 79.0
RNTN 45.7 80.7
Tree-LSTM 48.9 81.9
Table: Performances (accuracy) of models on Stanford Sentiment Treebank, at
the sentence level (roots) and the phrase level.
Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th
, 2017 46 / 119
Tree-LSTM Application: Natural Language Inference
Applied to Natural Language Inference (NLI): Determine if a sentence
entails another, if they contradict, or have no relation (Chen et al., 2017).
Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th
, 2017 47 / 119
Tree-LSTM Application: Natural Language Inference
Accuracy on Stanford Natural Language Inference (SNLI) dataset:
(Chen et al., 2017)
* Welcome to the poster at 6:00-9:30pm on July 31.
Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th
, 2017 48 / 119
Learning Representation for Natural Language Inference
RepEval-2017 Shared Task (Williams et al., 2017): Learn sentence
representation as a fixed-length vector.
Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th
, 2017 49 / 119
Tree-LSTM without Syntactic Parses
How if we simply apply recursive networks over trees that are not
generated from syntactic parses, e.g., a complete binary trees?
Multiple efforts on SNLI (Munkhdalai
et al., 2016; Chen et al., 2017) have
observed that the models outperform
sequential (chain) LSTM.
This could be related to the discussion
that recursive nets may capture
long-distance dependency (Goodfellow
et al., 2016).
Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th
, 2017 50 / 119
SPINN: Doing Away with Test-Time Trees
buffer
stack
t = 0
down
sat
cat
the
shift
t = 1
down
sat
cat
the
shift
t = 2
down
sat
cat
the
reduce
t = 3
down
sat
the cat
shift
t = 4
down
sat
the cat
shift
t = 5
down
sat
the cat
reduce
t = 6
sat down
the cat
reduce
t = 7 = T
(the cat) (sat down)
output to model
for semantic task
Image credit: Sam Bowman and co-authors.
cf. Bowman et al., 2016
Shift-Reduce Parsers:
Exploit isomorphism between binary branching trees with T leaves
and sequences of 2T − 1 binary shift/reduce actions.
Shift unattached leaves from a buffer onto a processing stack.
Reduce the top two child nodes on the stack to a single parent node.
SPINN: Jointly train a TreeRNN and a vector-based shift-reduce parser.
Training time trees offer supervision for shift-reduce parser.
No need for test time trees!
Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th
, 2017 51 / 119
SPINN:Doing Away with Test-Time Trees
buffer
down
sat
stack
cat
the
composition
tracking
transition
reduce
down
sat
the cat composition
tracking
transition
shift
down
sat
the cat
tracking
Image credit: Sam Bowman and co-authors.
Word vectors start on buffer b (top: first word in sentence).
Shift moves word vectors from buffer to stack s.
Reduce pops top two vectors off the stack, applies
f R : Rd × Rd → Rd , and pushes the result back to the stack
(i.e. TreeRNN composition).
Tracker LSTM tracks parser/composer state across operations,
decides shift-reduce operations a, is supervised by both observed
shift-reduce operations and end-task:
ht = LSTM(f C
(bt−1[0], st−1[0], st−1[1]), ht−1) at ∼ f A
(ht)
Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th
, 2017 52 / 119
A Quick Introduction to Reinforce
What if some part of our process is not differentiable (e.g. samples from
the shift-reduce module in SPINN) but we want to learn with no labels. . .
x y
x y
z
p(y|x) = Epθ(z|x) [fφ(z, x)] s.t. y ∼ fφ(z, x) or y = fφ(z, x)
φp(y|x) =
z
pθ(z|x) φfφ(z, x) = Epθ(z|x) [ φfφ(z, x)]
θp(y|x) =
z
fφ(z, x) θpθ(z|x) = ???
Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th
, 2017 53 / 119
A Quick Introduction to Reinforce
The Reinforce Trick (R. J. Williams, 1992)
θ log pθ(z|x) =
θpθ(z|x)
pθ(z|x)
⇒ θpθ(z|x) = pθ(z|x) θ log pθ(z|x)
θp(y|x) =
z
fφ(z, x) θpθ(z|x)
=
z
fφ(z, x)pθ(z|x) θ log pθ(z|x)
= Epθ(z|x) [fφ(z, x) θ log pθ(z|x)]
This naturally extends to cases where p(z|x) = p(z1, . . . , zn|x).
RL vocab: samples of such sequences of of discrete actions are referred to
as “traces”. We often refer to pθ(z|x) as a policy πθ(z; x).
Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th
, 2017 54 / 119
SPINN+RL: Doing Away with Training-Time Trees
“Drop in” extension to SPINN (Yogatama et al., 2016):
Treat at ∼ f A(ht) as policy πA
θ (at; ht), trained via Reinforce.
Reward is negated loss of the end task, e.g. log-likelihood of the
correct label.
Everything else is trained by backpropagation against the end task:
tracker LSTM, representations, etc. receive gradient both from the
supervised objective, and from Reinforce via the shift-reduce policy.
a
wo
man
wea
ring
sun
glas
ses
is
frow
ning
. a boy
drag
s
his
sled
s
thro
ugh
the
sno
w
.
Model recovers linguistic-like structures (e.g. noun phrases, auxiliary verb-verb pairing, etc.).
Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th
, 2017 55 / 119
SPINN+RL: Doing Away with Training-Time Trees
Does RL-SPINN work? According to Yogatama et al. (2016):
Better than LSTM baselines: model captures and exploits structure.
Better than SPINN benchmarks: model is not biased by what
linguists think trees should be like, only has a loose inductive biase
towards tree structures.
But some parses do not reflect order of composition (see below).
Semi-supervised setup may be sensible.
two men are
playi
ng
frisb
ee
in the park .
fami
ly
me
mbe
rs
stan
ding
outs
ide a
hom
e
.
Some “bad” parses, but not necessarily worse results.
Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th
, 2017 56 / 119
Outline
1 Introduction
Semantic composition
Formal methods
Simple parametric models
2 Parameterizing Composition Functions
Recurrent composition models
Recursive composition models
Convolutional composition models
Unsupervised models
3 Selected Topics
Compositionality and non-compositionality
Subword composition methods
4 Summary
Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th
, 2017 57 / 119
Convolution Neural Networks
Visual Inspiration: How do we learn to recognise pictures?
Will a fully connected neural network do the trick?
8
Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th
, 2017 58 / 119
ConvNets for pictures
Problem: lots of variance that shouldn’t matter (position, rotation, skew,
difference in font/handwriting).
8 8 8
8 8 8Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th
, 2017 59 / 119
ConvNets for pictures
Solution: Accept that features are local. Search for local features with a
window.
8
Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th
, 2017 60 / 119
ConvNets for pictures
Convolutional window acts as a classifer for local features.
⇒
Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th
, 2017 61 / 119
ConvNets for pictures
Different convolutional maps can be trained to recognise different features
(e.g. edges, curves, serifs).
...
Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th
, 2017 62 / 119
ConvNets for pictures
Stacked convolutional layers learn higher-level features.
Fully Connected Layer
Convolutional Layer
8 8Raw Image First Order Local Features Higher Order Features Prediction
One or more fully-connected layers learn classification function over
highest level of representation.
Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th
, 2017 63 / 119
ConvNets for language
Convolutional neural networks fit natural language well.
Deep ConvNets capture:
Positional invariances
Local features
Hierarchical structure
Language has:
Some positional invariance
Local features (e.g. POS)
Hierarchical structure (phrases,
dependencies)
Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th
, 2017 64 / 119
ConvNets for language
How do we go from images to sentences? Sentence matrices!
w1 w2 w3 w4 w5
Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th
, 2017 65 / 119
ConvNets for language
Does a convolutional window make sense for language?
w1 w2 w3 w4 w5
Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th
, 2017 66 / 119
ConvNets for language
A better solution: feature-specific windows.
w1 w2 w3 w4 w5
Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th
, 2017 67 / 119
Word Level Sentence Vectors with ConvNets
K-Max pooling
(k=3)
Fully connected
layer
Folding
Wide
convolution
(m=2)
Dynamic
k-max pooling
(k= f(s) =5)
Projected
sentence
matrix
(s=7)
Wide
convolution
(m=3)
game's the same, just got more fierce
cf. Kalchbrenner et al., 2014
Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th
, 2017 68 / 119
Character Level Sentence Vectors with ConvNets
Image credit: Yoon Kim and co-authors.
cf. Kim et al., 2016
Naively, we could just represent
everything at character level.
Convolutions seem to work well
for low-level patterns
(e.g. morphology)
One interpretation: multiple
filters can capture the low-level
idiosyncrasies of natural
language (e.g. arbitrary spelling)
whereas language is more
compositional at a higher level.
Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th
, 2017 69 / 119
ConvNet-like Architectures for Composition
t0 t1 t2 t3 t4 t5 t6 t7 t8 t9 t11 t12 t13 t14 t15 t16t10
s0 s1 s2 s3 s4 s5 s6 s7 s8 s9 s10 s11 s12 s13 s14 s15 s16
t11 t12 t13 t14 t15 t16 t17t10t9t8t7t6t5t4t3t2t1
Image credit: Nal Kalchbrenner and co-authors.
cf. Kalchbrenner et al., 2016
Many other CNN-like
architectures (e.g. ByteNet from
Kalchbrenner et al. (2016))
Common recipe components:
dilated convolutions and ResNet
blocks.
These model sequences well in
domains like speech, and are
beginning to find applications in
NLP, so worth reading up on.
Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th
, 2017 70 / 119
Outline
1 Introduction
Semantic composition
Formal methods
Simple parametric models
2 Parameterizing Composition Functions
Recurrent composition models
Recursive composition models
Convolutional composition models
Unsupervised models
3 Selected Topics
Compositionality and non-compositionality
Subword composition methods
4 Summary
Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th
, 2017 71 / 119
Unsupervised Composition Models
Why care about unsupervised learning?
Much more unlabelled linguistic data than labelled data.
Learn general purpose representations and composition functions.
Suitable pre-training for supervised models, semi-supervised, or
multi-task objectives.
In the (paraphrased) words of Yann LeCun: unsupervised learning is a
cake, supervised learning is frosting, and RL is the cherry on top!
Plot twist: it’s possibly a cherry cake.
Yes, that’s nice. . . But what are we doing, concretely?
Good question! Usually, just modelling—directly or indirectly—some
aspect of the probability of the observed data.
Further suggestions on a postcard, please!
Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th
, 2017 72 / 119
Autoencoders
Autoencoders provide an unsupervised method for representation learning:
We minimise an objective function over inputs xi , i ∈ N and their
reconstructions xi :
J =
1
2
N
i
xi − xi
2
Warning: degenerate solution if xi can be updated (∀i.xi = 0).
Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th
, 2017 73 / 119
Recursive Autoencoders
cf. Socher et al., 2011a
To auto-encode variable length
sequences, we can chain
autoencoders to create a recursive
structure.
Objective Function
Minimizing the reconstruction error
will learn a compression function over
the inputs:
Erec(i, θ) =
1
2
xi − xi
2
A “modern” alternative: use sequence to sequence model, and
log-likelihood objective.
Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th
, 2017 74 / 119
What’s wrong with auto-encoders?
Empirically, narrow auto-encoders produce sharp latent codes, and
unregularised wide auto-encoders learn identity functions.
Reconstruction objective includes nothing about distance preservation
in latent space: no guarantee that
dist(a, b) ≤ dist(a, c)
→ dist(encode(a), encode(b)) ≤ dist(encode(a), encode(c))
Conversely, little incentive for similar latent codes to generate
radically different (but semantically equivalent) observations.
Ultimately, compression = meaning.
Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th
, 2017 75 / 119
Skip-Thought
Image credit: Jamie Kiros and co-authors.
cf. Kiros et al., 2015
Similar to auto-encoding objective: encode sentence, but decode
neighbouring sentences.
Pair of LSTM-based seq2seq models with share encoder, but
alternative formulations are possible.
Conceptually similar to distributional semantics: a unit’s
representation is a function of its neighbouring units, except units are
sentence instead of words.
Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th
, 2017 76 / 119
Variational Auto-Encoders
Semantically Weak Codes
Generally, auto-encoders sparsely encode or densely compress information.
No pressure to ensure similarity continuum amongst codes.
Factorized Generative Picture
p(x) = p(x, z)dz
= p(x|z)p(z)dz
= Ep(z) [p(x|z)]
z xN(0, I)
Prior on z enforces semantic continuum (e.g. no arbitrarily unrelated codes
for similar data), but expectation is typically intractable to compute
exactly, and Monte Carlo estimate of gradients will be high variance.
Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th
, 2017 77 / 119
Variational Auto-Encoders
Goal
Estimate, by maximising p(x):
The parameters θ of a function modelling part of the generative
process pθ(x|z) given samples from a fixed prior z ∼ p(z).
The parameters φ of a distribution qφ(z|x) approximating the true
posterior p(z|x).
How do we do it? We maximise p(x) via a variational lower bound (VLB):
log p(x) ≥ Eqφ(z|x) [log pθ(x|z)] − DKL (qφ(z|x) p(z))
Equivalently we can minimise NLL(x):
NLL(x) ≤ Eqφ(z|x)[NLLθ(x|z)] + DKL (qφ(z|x) p(z))
Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th
, 2017 78 / 119
Variational Auto-Encoders
Let’s derive the VLB:
log p(x) = log 1 · pθ(x|z)p(z)dz
= log
qφ(z|x)
qφ(z|x)
pθ(x|z)p(z)dz
= log Eqφ(z|x)
p(z)
qφ(z|x)
pθ(x|z)
≥ Eqφ(z|x) log
p(z)
qφ(z|x)
+ log pθ(x|z)
= Eqφ(z|x) [log pθ(x|z)] − DKL (qφ(z|x) p(z))
For right qφ(z|x) and p(z) (e.g. Gaussians) there is a closed-form
expression of DKL (qφ(z|x) p(z)).
Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th
, 2017 79 / 119
Variational Auto-Encoders
The problem of stochastic gradients
Estimating ∂
∂φ Eqφ(z|x) [log pθ(x|z)] requires backpropagating through
samples z ∼ qφ(z|x). For some choices of q, such as Gaussians there are
reparameterization tricks (cf. Kingma et al., 2013)
Reparameterizing Gaussians (Kingma et al., 2013)
z ∼ N(z; µ, σ2
)
equivalent to z = µ + σ
where ∼ N( ; 0, I)
Trivially:
∂z
∂µ
= 1
∂z
∂σ
=
Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th
, 2017 80 / 119
Variational Auto-Encoders for Sentences
1 Observe a sentence w1, . . . , wn. Encode it, e.g. with an LSTM:
he = LSTMe(w1, . . . , wn)
2 Predict µ = f µ(he) and σ2 = f σ(he) (in practice we operate in log
space for σ2 by determining log σ).
3 Sample z ∼ q(z|x) = N(z; µ, σ2)
4 Use conditional RNN to decode and measure log p(x|z). Use
closed-form formula of KL divergence of two Gaussians to calculate
−DKL (qφ(z|x) p(z)). Add both to obtain maximisation objective.
5 Backpropagate gradient through decoder normally based on log
component of the objective, and use reparameterisation trick to
backpropagate through sampling operation back to encoder.
6 Gradient of the KL divergence component of the loss with regard to
the encoder parameters is straightforward backpropagation.
Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th
, 2017 81 / 119
Variational Auto-Encoders and Autoregressivity
The problem of powerful auto-regressive decoders
We want to minimise NLL(x) ≤ Eq(z|x)[NLL(x|z)] + DKL (q(z|x) p(z)).
What if the decoder is powerful enough to model x without using z?
A degenerate solution:
If z can be ignored when minimising the reconstruction loss of x given
z, the model can safely let q(z|x) collapse to the prior p(z) to
minimise DKL (q(z|x) p(z)).
Since q need not depend on x (e.g. the encoder can just ignore x and
predict the mean and variance of the prior), z bears no relation to x.
Result: useless encoder, useless latent variable.
Is this really a problem?
If your decoder is not auto-regressive (e.g. MLPs expressing the probability
of pixels which are conditionally independent given z), then no.
If your decoder is an RNN and domain has systematic patterns, then yes.
Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th
, 2017 82 / 119
Variational Auto-Encoders and Autoregressivity
What are some solutions to this problem?
Pick a non-autoregressive decoder. If you care more about the latent
code than having a good generative model (e.g. document modelling),
this isn’t a bad idea, but frustrating if this is the only solution.
KL Annealing: set Eq(z|x)[NLL(x|z)] + αDKL (q(z|x) p(z)) as
objective. Start with α = 0 (basic seq2seq model). Increase α to 1
over time during training. Works somewhat, but unprincipled
changing of the objective function.
Set as objective Eq(z|x)[NLL(x|z)] + max(λ, DKL (q(z|x) p(z))) where
λ ≥ 0 is a scalar or vector hyperparameter. Once the KL dips below λ,
there is no benefit, so the model must rely on z to some extent. This
objective is still a valid upper bound on NLL(x) (albeit a looser one).
Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th
, 2017 83 / 119
Outline
1 Introduction
Semantic composition
Formal methods
Simple parametric models
2 Parameterizing Composition Functions
Recurrent composition models
Recursive composition models
Convolutional composition models
Unsupervised models
3 Selected Topics
Compositionality and non-compositionality
Subword composition methods
4 Summary
Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th
, 2017 84 / 119
Outline
1 Introduction
Semantic composition
Formal methods
Simple parametric models
2 Parameterizing Composition Functions
Recurrent composition models
Recursive composition models
Convolutional composition models
Unsupervised models
3 Selected Topics
Compositionality and non-compositionality
Subword composition methods
4 Summary
Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th
, 2017 85 / 119
Compositional or Non-compositional Representation
Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th
, 2017 86 / 119
Compositional or Non-compositional Representation
Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th
, 2017 87 / 119
Compositional or Non-compositional Representation
Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th
, 2017 88 / 119
Compositional or Non-compositional Representation
Such “hard” or “soft” non-compositionalilty exists at different
granularities of texts.
We will discuss some models on how to handle this at the
word-phrase level.
Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th
, 2017 88 / 119
Compositional and Non-compositional Semantics
Compositionality/non-compositionality is a common phenomenon in
language.
A framework that is able to consider both
compositionality/non-compositionality is of interest.
Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th
, 2017 89 / 119
Compositional and Non-compositional Semantics
Compositionality/non-compositionality is a common phenomenon in
language.
A framework that is able to consider both
compositionality/non-compositionality is of interest.
A pragmatic viewpoint: If one is able to obtain holistically the
representation of an n-gram or a phrase in text, it would be desirable
that a composition model has the ability to decide the sources of
knowledge it will use.
Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th
, 2017 89 / 119
Compositional and Non-compositional Semantics
Compositionality/non-compositionality is a common phenomenon in
language.
A framework that is able to consider both
compositionality/non-compositionality is of interest.
A pragmatic viewpoint: If one is able to obtain holistically the
representation of an n-gram or a phrase in text, it would be desirable
that a composition model has the ability to decide the sources of
knowledge it will use.
In addition to composition, considering non-compositionality may
avoid back-propagating errors unnecessarily to confuse word
embedding.
think about the “kick the bucket” example.
Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th
, 2017 89 / 119
Integrating Compositional and Non-compositional
Semantics
Integrating non-compositionality in recursive networks (Zhu et al., 2015a):
Basic idea: Enabling individual composition operations to be able to
choose information from different resources, compositional or
non-compositional (e.g., holistically learned).
Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th
, 2017 90 / 119
Integrating Compositional and Non-compositional
Semantics
Model 1: Regular bilinear merge (Zhu et al., 2015a):
Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th
, 2017 91 / 119
Integrating Compositional and Non-compositional
Semantics
Model 2: Tensor-based merging (Zhu et al., 2015a)
Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th
, 2017 92 / 119
Integrating Compositional and Non-compositional
Semantics
Model 3: Explicitly gated merging (Zhu et al., 2015a):
Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th
, 2017 93 / 119
Experiment Set-Up
Task: sentiment analysis
Data: Stanford Sentiment Treebank
Non-compositional sentiment
Sentiment of ngrams automatically learned from tweets (Mohammad
et al., 2013).
Polled the Twitter API every four hours from April to December 2012
in search of tweets with either a positive word hashtag or a negative
word hashtag.
Using 78 seed hashtags (32 positive and 36 negative) such as #good,
#excellent, and #terrible to annotate sentiment.
775,000 tweets that contain at least a positive hashtag or a negative
hashtag were used as the learning corpus.
Point-wise mutual information (PMI) is calculated for each bigrams
and trigrams.
Each sentiment score is converted to a one-hot vector; e.g. a bigram
with a score of -1.5 will be assigned a 5-dimensional vector [0, 1, 0, 0,
0] (i.e., the e vector).
Using the human annotation coming with Stanford Sentiment
Treebank for bigrams and trigrams.
Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th
, 2017 94 / 119
Results
Models sentence-level (roots) all phrases (all nodes)
(1) RNTN 42.44 79.95
(2) Regular-bilinear (auto) 42.37 79.97
(3) Regular-bilinear (manu) 42.98 80.14
(4) Explicitly-gated (auto) 42.58 80.06
(5) Explicitly-gated (manu) 43.21 80.21
(6) Confined-tensor (auto) 42.99 80.49
(7) Confined-tensor (manu) 43.75† 80.66†
Table: Model performances (accuracy) on predicting 5-category sentiment at the
sentence (root) level and phrase level.
1
The results is based on the version 3.3.0 of the Stanford CoreNLP.
Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th
, 2017 95 / 119
Integrating Compositional and Non-compositional
Semantics
We have discussed integrating non-compositionality in recursive
networks.
How if there are no prior input structures available?
Remember we have discussed the models that capture hidden
structures.
How if a syntactic parsing tree is not very reliable?
e.g., for data like social media text or speech transcripts.
In these situations, how can we still consider non-compositionality in
the composition process.
Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th
, 2017 96 / 119
Integrating Compositional and Non-compositional
Semantics
Integrating non-compositionality in chain recurrent networks (Zhu et al.,
2016)
Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th
, 2017 97 / 119
Integrating Compositional and Non-compositional
Semantics
Non-compositional nodes:
Form the non-compositional paths (e.g., 3-8-9 or 4-5-9).
Allow the embedding spaces of a non-compositional node to be
different from those of a compositional node.
Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th
, 2017 98 / 119
Integrating Compositional and Non-compositional
Semantics
Fork nodes:
Summarizing history so far to
support both compositional and
non-compositional paths.
Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th
, 2017 99 / 119
Integrating Compositional and Non-compositional
Semantics
Merging nodes:
Combining information from compositional and non-compositional paths.
Binarization
Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th
, 2017 100 / 119
Integrating Compositional and Non-compositional
Semantics
Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th
, 2017 101 / 119
Integrating Compositional and Non-compositional
Semantics
Binarization:
Binarizing the composition of in-bound
paths (we do not worry too much about
the order of merging.)
Now we do not need to design different
nodes for different fan-in, but let
parameter-sharing be all over the nets.
Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th
, 2017 101 / 119
Results
Method SemEval-13 SemEval-14
Majority baseline 29.19 34.46
Unigram (SVM) 56.95 58.58
3rd best model 64.86 69.95
2nd best model 65.27 70.14
The best model 69.02 70.96
DAG-LSTM 70.88 71.97
Table: Performances of different models in official evaluation metric (macro
F-scores) on the test sets of SemEval-2013 and SemEval-2014 Sentiment Analysis
in Twitter in predicting the sentiment of the tweet messages.
Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th
, 2017 102 / 119
Results
Method SemEval-13 SemEval-14
DAG-LSTM
Full paths 70.88 71.97
Full – {autoPaths} 69.36 69.27
Full – {triPaths} 70.16 70.77
Full – {triPaths, biPaths} 69.55 69.93
Full – {manuPaths} 69.88 70.58
LSTM without DAG
Full – {autoPaths,manuPaths} 64.00 66.40
Table: Ablation performances (macro-averaged F-scores) of DAG-LSTM with
different types of paths being removed.
Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th
, 2017 103 / 119
Outline
1 Introduction
Semantic composition
Formal methods
Simple parametric models
2 Parameterizing Composition Functions
Recurrent composition models
Recursive composition models
Convolutional composition models
Unsupervised models
3 Selected Topics
Compositionality and non-compositionality
Subword composition methods
4 Summary
Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th
, 2017 104 / 119
Subword Composition
Composition can also be performed to learn representations for words
from subword components (Botha et al., 2014; Ling et al., 2015;
Luong et al., 2015; Kim et al., 2016; Sennrich et al., 2016).
Rich morphology: some languages have larger vocabularies than others.
Informal text: very coooooool!
Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th
, 2017 105 / 119
Subword Composition
Composition can also be performed to learn representations for words
from subword components (Botha et al., 2014; Ling et al., 2015;
Luong et al., 2015; Kim et al., 2016; Sennrich et al., 2016).
Rich morphology: some languages have larger vocabularies than others.
Informal text: very coooooool!
Basically alleviate Sparseness!
Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th
, 2017 105 / 119
Subword Composition
Composition can also be performed to learn representations for words
from subword components (Botha et al., 2014; Ling et al., 2015;
Luong et al., 2015; Kim et al., 2016; Sennrich et al., 2016).
Rich morphology: some languages have larger vocabularies than others.
Informal text: very coooooool!
Basically alleviate Sparseness!
One perspective of viewing subword models:
Morpheme based composition: deriving word representation from
morphemes.
Character based composition: deriving word representation from
characters (pretty effective as well, even used by itself!)
Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th
, 2017 105 / 119
Subword Composition
Composition can also be performed to learn representations for words
from subword components (Botha et al., 2014; Ling et al., 2015;
Luong et al., 2015; Kim et al., 2016; Sennrich et al., 2016).
Rich morphology: some languages have larger vocabularies than others.
Informal text: very coooooool!
Basically alleviate Sparseness!
One perspective of viewing subword models:
Morpheme based composition: deriving word representation from
morphemes.
Character based composition: deriving word representation from
characters (pretty effective as well, even used by itself!)
Another perspective (by model architectures):
Recursive models
Convolutional models
Recurrent models
Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th
, 2017 105 / 119
Subword Composition
Composition can also be performed to learn representations for words
from subword components (Botha et al., 2014; Ling et al., 2015;
Luong et al., 2015; Kim et al., 2016; Sennrich et al., 2016).
Rich morphology: some languages have larger vocabularies than others.
Informal text: very coooooool!
Basically alleviate Sparseness!
One perspective of viewing subword models:
Morpheme based composition: deriving word representation from
morphemes.
Character based composition: deriving word representation from
characters (pretty effective as well, even used by itself!)
Another perspective (by model architectures):
Recursive models
Convolutional models
Recurrent models
We will discuss several typical methods here only briefly.
Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th
, 2017 105 / 119
Subword Composition: Recursive Networks
Morphological Recursive Neural Networks (Luong et al., 2013):
Extending recursive neural networks (Socher et al., 2011b) to learn word
representation through composition over morphemes.
Assume the availability of morphemic analyses.
Each tree node combines a stem vector and an affix vector.
Figure. Context insensitive (left) and sensitive (right) Morphological
Recursive Neural Networks.
Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th
, 2017 106 / 119
Subword Composition: Recurrent Networks
Bi-directional LSTM for subword composition (Ling et al., 2015).
Figure. Character RNN for sub-word composition.
Some more details ...Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th
, 2017 107 / 119
Subword Composition: Convolutional Networks
Convolutional neural networks for subword composition (Zhang et al.,
2015)
Figure. Character CNN for sub-word composition.
Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th
, 2017 108 / 119
Subword Composition: Convolutional Networks
Convolutional neural networks for subword composition (Zhang et al.,
2015)
Figure. Character CNN for sub-word composition.
In general, subword models have been successfully used in a wide
variety of problems such as translation, sentiment analysis, question
answering, etc.
You should seriously consider it in the situations such as OOV is high
or the word distribution has a long tail.
Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th
, 2017 108 / 119
Outline
1 Introduction
Semantic composition
Formal methods
Simple parametric models
2 Parameterizing Composition Functions
Recurrent composition models
Recursive composition models
Convolutional composition models
Unsupervised models
3 Selected Topics
Compositionality and non-compositionality
Subword composition methods
4 Summary
Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th
, 2017 109 / 119
Summary
The tutorial discusses semantic composition with distributed
representation learned with neural networks.
Neural networks are able to learn powerful representation and
complicated composition functions.
The models can achieve state-of-the-art performances on a wide
range of NLP tasks.
We expect further studies would continue to deepen our
understanding on such approaches:
Unsupervised models
Compositionality with other “ingredients” of intelligence
Compositionality in multi-modalities
Interpretability of models
Distributed vs./and symbolic composition models
... ...
Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th
, 2017 110 / 119
References I
C. E. Osgood, G. J. Suci, and P. H. Tannenbaum. The
Measurement of Meaning. University of Illinois Press, 1957.
Richard Montague. “English as a Formal Language”. In: Linguaggi
nella societa e nella tecnica. Ed. by Bruno Visentini. Edizioni di
Communita, 1970, pp. 188–221.
G. A. Miller and P. N. Johnson-Laird. Language and perception.
Cambridge, MA: Belknap Press, 1976.
J. A. Fodor and Z. W. Pylyshyn. “Connectionism and cognitive
architecture: A critical analysis”. In: Cognition 28 (1988), pp. 3–71.
Jordan B. Pollack. “Recursive Distributed Representations”. In:
Artif. Intell. 46.1-2 (1990), pp. 77–105.
Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th
, 2017 111 / 119
References II
Ronald J. Williams. “Simple Statistical Gradient-Following
Algorithms for Connectionist Reinforcement Learning”. In: Machine
Learning 8 (1992), pp. 229–256.
Barbara Partee. “Lexical semantics and compositionality”. In:
Invitation to Cognitive Science 1 (1995), pp. 311–360.
Elie Bienenstock, Stuart Geman, and Daniel Potter.
“Compositionality, MDL Priors, and Object Recognition”. In: NIPS.
1996.
Enrico Francesconi et al. “Logo Recognition by Recursive Neural
Networks”. In: GREC. 1997.
Jeff Mitchell and Mirella Lapata. “Vector-based Models of Semantic
Composition”. In: ACL. 2008, pp. 236–244.
Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th
, 2017 112 / 119
References III
Richard Socher et al. “Dynamic Pooling and Unfolding Recursive
Autoencoders for Paraphrase Detection”. In: NIPS. 2011,
pp. 801–809.
Richard Socher et al. “Parsing Natural Scenes and Natural Language
with Recursive Neural Networks”. In: ICML. 2011, pp. 129–136.
Richard Socher et al. “Semi-Supervised Recursive Autoencoders for
Predicting Sentiment Distributions”. In: EMNLP. 2011.
Richard Socher et al. “Semantic Compositionality through Recursive
Matrix-Vector Spaces”. In: EMNLP-CoNLL. 2012, pp. 1201–1211.
Nal Kalchbrenner and Phil Blunsom. “Recurrent Continuous
Translation Models.”. In: EMNLP. Vol. 3. 39. 2013, p. 413.
Diederik P. Kingma and Max Welling. “Auto-Encoding Variational
Bayes”. In: CoRR abs/1312.6114 (2013).
Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th
, 2017 113 / 119
References IV
Thang Luong, Richard Socher, and Christopher D. Manning.
“Better Word Representations with Recursive Neural Networks for
Morphology”. In: CoNLL. 2013.
Saif Mohammad, Svetlana Kiritchenko, and Xiao-Dan Zhu.
“NRC-Canada: Building the State-of-the-Art in Sentiment Analysis
of Tweets”. In: SemEval@NAACL-HLT. 2013.
Richard Socher et al. “Recursive Deep Models for Semantic
Compositionality Over a Sentiment Treebank”. In: EMNLP. 2013,
pp. 1631–1642.
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. “Neural
machine translation by jointly learning to align and translate”. In:
arXiv preprint arXiv:1409.0473 (2014).
Jan A. Botha and Phil Blunsom. “Compositional Morphology for
Word Representations and Language Modelling”. In: ICML. 2014.
Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th
, 2017 114 / 119
References V
Nal Kalchbrenner, Edward Grefenstette, and Phil Blunsom. “A
convolutional neural network for modelling sentences”. In: arXiv
preprint arXiv:1404.2188 (2014).
Ilya Sutskever, Oriol Vinyals, and Quoc V Le. “Sequence to
sequence learning with neural networks”. In: Advances in neural
information processing systems. 2014, pp. 3104–3112.
Xiaodan Zhu et al. “An Empirical Study on the Effect of Negation
Words on Sentiment”. In: ACL. 2014.
Ryan Kiros et al. “Skip-thought vectors”. In: Advances in neural
information processing systems. 2015, pp. 3294–3302.
Phong Le and Willem Zuidema. “Compositional Distributional
Semantics with Long Short Term Memory”. In:
*SEM@NAACL-HLT. 2015.
Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th
, 2017 115 / 119
References VI
Wang Ling et al. “Finding Function in Form: Compositional
Character Models for Open Vocabulary Word Representation”. In:
2015.
Thang Luong et al. “Addressing the Rare Word Problem in Neural
Machine Translation”. In: ACL. 2015.
Sheng Kai Tai, Richard Socher, and D. Christopher Manning.
“Improved Semantic Representations From Tree-Structured Long
Short-Term Memory Networks”. In: ACL. 2015, pp. 1556–1566.
Xiang Zhang and Yann LeCun. “Text Understanding from Scratch”.
In: CoRR abs/1502.01710 (2015).
Xiaodan Zhu, Hongyu Guo, and Parinaz Sobhani. “Neural Networks
for Integrating Compositional and Non-compositional Sentiment in
Sentiment Composition”. In: *SEM@NAACL-HLT. 2015.
Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th
, 2017 116 / 119
References VII
Xiaodan Zhu, Parinaz Sobhani, and Hongyu Guo. “Long Short-Term
Memory Over Recursive Structures”. In: ICML. 2015,
pp. 1604–1612.
Samuel R Bowman et al. “A fast unified model for parsing and
sentence understanding”. In: arXiv preprint arXiv:1603.06021
(2016).
Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning.
MIT Press, 2016.
Nal Kalchbrenner et al. “Neural Machine Translation in Linear
Time”. In: CoRR abs/1610.10099 (2016).
Yoon Kim et al. “Character-Aware Neural Language Models”. In:
AAAI. 2016.
B. M. Lake et al. “Building Machines that Learn and Think Like
People”. In: Behavioral and Brain Sciences. (in press). (2016).
Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th
, 2017 117 / 119
References VIII
Tsendsuren Munkhdalai and Hong Yu. “Neural Tree Indexers for
Text Understanding”. In: CoRR abs/1607.04492 (2016).
Rico Sennrich, Barry Haddow, and Alexandra Birch. “Neural
Machine Translation of Rare Words with Subword Units”. In: ACL.
2016.
Dani Yogatama et al. “Learning to Compose Words into Sentences
with Reinforcement Learning”. In: CoRR abs/1611.09100 (2016).
Xiaodan Zhu, Parinaz Sobhani, and Hongyu Guo. “DAG-Structured
Long Short-Term Memory for Semantic Compositionality”. In:
NAACL. 2016.
Qian Chen et al. “Enhanced LSTM for Natural Language Inference”.
In: ACL. 2017.
Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th
, 2017 118 / 119
References IX
Williams, Nikita Nangia, and Samuel R. Bowman. “A
Broad-Coverage Challenge Corpus for Sentence Understanding
through Inference”. In: CoRR abs/1704.05426 (2017).
Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th
, 2017 119 / 119

More Related Content

PPTX
Needs analysis part 2
PPTX
ESP-APPROACH NOT PRODUCT
PPTX
Course planning and syllabus design
PPTX
Syllabus desing
PPT
Language Learning Strategies
PPTX
Chapter no. 4
PPTX
Mixture varieties
PPTX
Task based syllabus
Needs analysis part 2
ESP-APPROACH NOT PRODUCT
Course planning and syllabus design
Syllabus desing
Language Learning Strategies
Chapter no. 4
Mixture varieties
Task based syllabus

What's hot (20)

PPTX
Genre based approach
PPT
The grammar translation method
PDF
Applied linguisticss
PPTX
Forensic linguistics
PPTX
Corpus annotation for corpus linguistics (nov2009)
PPTX
Structural syllabusppw
PPTX
Lecture 1 Materials Development and Adaptation
PPTX
Syllabus format
PPTX
Larry selinker's interlanguage
PPTX
Computational linguistics
PPTX
Content Based Instruction
PPTX
Evaluation in ESP
PDF
DEVELOPING LISTENING MATERIAL
PPTX
Needs analysis
PPTX
Language descriptions
PPTX
Discourse Analysis
PPTX
Corpus linguistics
PPT
Transformational grammar
PPTX
CALL and SLA
Genre based approach
The grammar translation method
Applied linguisticss
Forensic linguistics
Corpus annotation for corpus linguistics (nov2009)
Structural syllabusppw
Lecture 1 Materials Development and Adaptation
Syllabus format
Larry selinker's interlanguage
Computational linguistics
Content Based Instruction
Evaluation in ESP
DEVELOPING LISTENING MATERIAL
Needs analysis
Language descriptions
Discourse Analysis
Corpus linguistics
Transformational grammar
CALL and SLA
Ad

Similar to Deep Learning for Semantic Composition (20)

PDF
Prep Teachers SSP Handbook 2015
PPTX
Su 2012 ss syntax(1)
PPTX
PPTX
Introduction to linguistics syntax
PPT
Generative grammar
PDF
Vocabulary by atheer
DOC
CHAPTER I.doc
PPT
Analyzing language complexity of Chinese and African Learners
PPTX
Lexical Teaching
DOC
Ngu phap
PPT
Presentation alex
PPT
6 july learning to read reading to learn
PPT
Analyzing language complexity of Chinese and African Learners
PDF
Compound Adjectives in English
PPTX
Using Corpus Linguistics to Teach ESL Pronunication
PDF
Nlp ambiguity presentation
PDF
SYNTAX - PDF
PDF
Saying more with less: 4 ways grammatical metaphor improvesacademic writing
PDF
Journal of Education and Practice
PDF
To dig into_english_forms_issues_group_551019_20
Prep Teachers SSP Handbook 2015
Su 2012 ss syntax(1)
Introduction to linguistics syntax
Generative grammar
Vocabulary by atheer
CHAPTER I.doc
Analyzing language complexity of Chinese and African Learners
Lexical Teaching
Ngu phap
Presentation alex
6 july learning to read reading to learn
Analyzing language complexity of Chinese and African Learners
Compound Adjectives in English
Using Corpus Linguistics to Teach ESL Pronunication
Nlp ambiguity presentation
SYNTAX - PDF
Saying more with less: 4 ways grammatical metaphor improvesacademic writing
Journal of Education and Practice
To dig into_english_forms_issues_group_551019_20
Ad

More from MLReview (13)

PDF
Bayesian Non-parametric Models for Data Science using PyMC
PDF
Machine Learning and Counterfactual Reasoning for "Personalized" Decision- ...
PDF
Tutorial on Deep Generative Models
PDF
PixelGAN Autoencoders
PDF
Representing and comparing probabilities: Part 2
PDF
Representing and comparing probabilities
PDF
OPTIMIZATION AS A MODEL FOR FEW-SHOT LEARNING
PDF
Theoretical Neuroscience and Deep Learning Theory
PDF
2017 Tutorial - Deep Learning for Dialogue Systems
PDF
Near human performance in question answering?
PDF
Tutorial on Theory and Application of Generative Adversarial Networks
PDF
Real-time Edge-aware Image Processing with the Bilateral Grid
PDF
Yoav Goldberg: Word Embeddings What, How and Whither
Bayesian Non-parametric Models for Data Science using PyMC
Machine Learning and Counterfactual Reasoning for "Personalized" Decision- ...
Tutorial on Deep Generative Models
PixelGAN Autoencoders
Representing and comparing probabilities: Part 2
Representing and comparing probabilities
OPTIMIZATION AS A MODEL FOR FEW-SHOT LEARNING
Theoretical Neuroscience and Deep Learning Theory
2017 Tutorial - Deep Learning for Dialogue Systems
Near human performance in question answering?
Tutorial on Theory and Application of Generative Adversarial Networks
Real-time Edge-aware Image Processing with the Bilateral Grid
Yoav Goldberg: Word Embeddings What, How and Whither

Recently uploaded (20)

PPTX
Introduction to Fisheries Biotechnology_Lesson 1.pptx
PPTX
microscope-Lecturecjchchchchcuvuvhc.pptx
PDF
An interstellar mission to test astrophysical black holes
PPTX
Cell Membrane: Structure, Composition & Functions
PPTX
2. Earth - The Living Planet earth and life
PDF
VARICELLA VACCINATION: A POTENTIAL STRATEGY FOR PREVENTING MULTIPLE SCLEROSIS
PPTX
famous lake in india and its disturibution and importance
PPTX
The KM-GBF monitoring framework – status & key messages.pptx
PDF
Formation of Supersonic Turbulence in the Primordial Star-forming Cloud
PDF
AlphaEarth Foundations and the Satellite Embedding dataset
PPTX
2Systematics of Living Organisms t-.pptx
PPT
The World of Physical Science, • Labs: Safety Simulation, Measurement Practice
PPTX
INTRODUCTION TO EVS | Concept of sustainability
PPTX
GEN. BIO 1 - CELL TYPES & CELL MODIFICATIONS
PPTX
TOTAL hIP ARTHROPLASTY Presentation.pptx
PDF
Placing the Near-Earth Object Impact Probability in Context
PPTX
cpcsea ppt.pptxssssssssssssssjjdjdndndddd
PDF
The scientific heritage No 166 (166) (2025)
PPTX
Protein & Amino Acid Structures Levels of protein structure (primary, seconda...
PPTX
7. General Toxicologyfor clinical phrmacy.pptx
Introduction to Fisheries Biotechnology_Lesson 1.pptx
microscope-Lecturecjchchchchcuvuvhc.pptx
An interstellar mission to test astrophysical black holes
Cell Membrane: Structure, Composition & Functions
2. Earth - The Living Planet earth and life
VARICELLA VACCINATION: A POTENTIAL STRATEGY FOR PREVENTING MULTIPLE SCLEROSIS
famous lake in india and its disturibution and importance
The KM-GBF monitoring framework – status & key messages.pptx
Formation of Supersonic Turbulence in the Primordial Star-forming Cloud
AlphaEarth Foundations and the Satellite Embedding dataset
2Systematics of Living Organisms t-.pptx
The World of Physical Science, • Labs: Safety Simulation, Measurement Practice
INTRODUCTION TO EVS | Concept of sustainability
GEN. BIO 1 - CELL TYPES & CELL MODIFICATIONS
TOTAL hIP ARTHROPLASTY Presentation.pptx
Placing the Near-Earth Object Impact Probability in Context
cpcsea ppt.pptxssssssssssssssjjdjdndndddd
The scientific heritage No 166 (166) (2025)
Protein & Amino Acid Structures Levels of protein structure (primary, seconda...
7. General Toxicologyfor clinical phrmacy.pptx

Deep Learning for Semantic Composition

  • 1. Deep Learning for Semantic Composition Xiaodan Zhu∗ & Edward Grefenstette† ∗National Research Council Canada Queen’s University zhu2048@gmail.com †DeepMind etg@google.com July 30th, 2017 Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th , 2017 1 / 119
  • 2. Outline 1 Introduction Semantic composition Formal methods Simple parametric models 2 Parameterizing Composition Functions Recurrent composition models Recursive composition models Convolutional composition models Unsupervised models 3 Selected Topics Compositionality and non-compositionality Subword composition methods 4 Summary Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th , 2017 2 / 119
  • 3. Outline 1 Introduction Semantic composition Formal methods Simple parametric models 2 Parameterizing Composition Functions Recurrent composition models Recursive composition models Convolutional composition models Unsupervised models 3 Selected Topics Compositionality and non-compositionality Subword composition methods 4 Summary Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th , 2017 3 / 119
  • 4. Principle of Compositionality Principle of compositionality: The meaning of a whole is a function of the meaning of the parts. Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th , 2017 4 / 119
  • 5. Principle of Compositionality Principle of compositionality: The meaning of a whole is a function of the meaning of the parts. While we focus on natural language, compositionality exists not just in language. Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th , 2017 4 / 119
  • 6. Principle of Compositionality Principle of compositionality: The meaning of a whole is a function of the meaning of the parts. While we focus on natural language, compositionality exists not just in language. Sound/music Music notes are composed with some regularity but not randomly arranged to form a song. Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th , 2017 4 / 119
  • 7. Principle of Compositionality Principle of compositionality: The meaning of a whole is a function of the meaning of the parts. While we focus on natural language, compositionality exists not just in language. Sound/music Music notes are composed with some regularity but not randomly arranged to form a song. Vision Natural scenes are composed of meaningful components. Artificial visual art pieces often convey certain meaning with regularity from their parts. Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th , 2017 4 / 119
  • 8. Principle of Compositionality Compositionality is regarded by many as a fundamental component of intelligence in addition to language understanding (Miller et al., 1976; Fodor et al., 1988; Bienenstock et al., 1996; Lake et al., 2016). Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th , 2017 5 / 119
  • 9. Principle of Compositionality Compositionality is regarded by many as a fundamental component of intelligence in addition to language understanding (Miller et al., 1976; Fodor et al., 1988; Bienenstock et al., 1996; Lake et al., 2016). For example, Lake et al. (2016) emphasize several essential ingredients for building machines that “learn and think like people”: Compositionality Intuitive physics/psychology Learning-to-learn Causality models Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th , 2017 5 / 119
  • 10. Principle of Compositionality Compositionality is regarded by many as a fundamental component of intelligence in addition to language understanding (Miller et al., 1976; Fodor et al., 1988; Bienenstock et al., 1996; Lake et al., 2016). For example, Lake et al. (2016) emphasize several essential ingredients for building machines that “learn and think like people”: Compositionality Intuitive physics/psychology Learning-to-learn Causality models Note that many of these challenges present in natural language understanding. Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th , 2017 5 / 119
  • 11. Principle of Compositionality Compositionality is regarded by many as a fundamental component of intelligence in addition to language understanding (Miller et al., 1976; Fodor et al., 1988; Bienenstock et al., 1996; Lake et al., 2016). For example, Lake et al. (2016) emphasize several essential ingredients for building machines that “learn and think like people”: Compositionality Intuitive physics/psychology Learning-to-learn Causality models Note that many of these challenges present in natural language understanding. They are reflected in the sparseness in training a NLP model. Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th , 2017 5 / 119
  • 12. Principle of Compositionality Compositionality is regarded by many as a fundamental component of intelligence in addition to language understanding (Miller et al., 1976; Fodor et al., 1988; Bienenstock et al., 1996; Lake et al., 2016). For example, Lake et al. (2016) emphasize several essential ingredients for building machines that “learn and think like people”: Compositionality Intuitive physics/psychology Learning-to-learn Causality models Note that many of these challenges present in natural language understanding. They are reflected in the sparseness in training a NLP model. Note also that compositionality may be entangled with the other “ingredients” listed above. Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th , 2017 5 / 119
  • 13. Semantic Composition in Natural Language good → very good → not very good → ... Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th , 2017 6 / 119
  • 14. Semantic Composition in Natural Language Figure: Results from (Zhu et al., 2014). A dot in the figure corresponds to a negated phrase (e.g., not very good) in Stanford Sentiment Treebank (Socher et al., 2013). The y-axis is its sentiment value and x-axis the sentiment of its argument. Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th , 2017 7 / 119
  • 15. Semantic Composition in Natural Language Figure: Results from (Zhu et al., 2014). A dot in the figure corresponds to a negated phrase (e.g., not very good) in Stanford Sentiment Treebank (Socher et al., 2013). The y-axis is its sentiment value and x-axis the sentiment of its argument. Even a one-layer composition, over one dimension of meaning (e.g., semantic orientation (Osgood et al., 1957)), could be a complicated mapping. Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th , 2017 7 / 119
  • 16. Semantic Composition in Natural Language good → very good → not very good → ... senator → former senator → ... basketball player → short basketball player → ... giant → small giant → ... empty/full → half empty/full → almost half empty/full → ...1 1 See more examples in (Partee, 1995). Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th , 2017 8 / 119
  • 17. Semantic Composition in Natural Language Semantic composition in natural language: the task of modelling the meaning of a larger piece of text by composing the meaning of its constituents. Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th , 2017 9 / 119
  • 18. Semantic Composition in Natural Language Semantic composition in natural language: the task of modelling the meaning of a larger piece of text by composing the meaning of its constituents. modelling: learning a representation The compositionality in language is very challenging as discussed above. Compositionality can entangle with other challenges such as those emphasized in (Lake et al., 2016). Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th , 2017 9 / 119
  • 19. Semantic Composition in Natural Language Semantic composition in natural language: the task of modelling the meaning of a larger piece of text by composing the meaning of its constituents. modelling: learning a representation The compositionality in language is very challenging as discussed above. Compositionality can entangle with other challenges such as those emphasized in (Lake et al., 2016). a larger piece of text: a phrase, sentence, or document. Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th , 2017 9 / 119
  • 20. Semantic Composition in Natural Language Semantic composition in natural language: the task of modelling the meaning of a larger piece of text by composing the meaning of its constituents. modelling: learning a representation The compositionality in language is very challenging as discussed above. Compositionality can entangle with other challenges such as those emphasized in (Lake et al., 2016). a larger piece of text: a phrase, sentence, or document. constituents: subword components, words, phrases. Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th , 2017 9 / 119
  • 21. Semantic Composition in Natural Language Semantic composition in natural language: the task of modelling the meaning of a larger piece of text by composing the meaning of its constituents. modelling: learning a representation The compositionality in language is very challenging as discussed above. Compositionality can entangle with other challenges such as those emphasized in (Lake et al., 2016). a larger piece of text: a phrase, sentence, or document. constituents: subword components, words, phrases. Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th , 2017 9 / 119
  • 22. Introduction Two key problems: How to represent meaning? How to learn such a representation? Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th , 2017 10 / 119
  • 23. Representation Let’s first very briefly revisit the representation we assume in this tutorial ... and leave the learning problem to the entire tutorial that follows. Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th , 2017 11 / 119
  • 24. Representation Let’s first very briefly revisit the representation we assume in this tutorial ... and leave the learning problem to the entire tutorial that follows. Love Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th , 2017 11 / 119
  • 25. Representation Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th , 2017 12 / 119
  • 26. Representation love, admiration, satisfaction ... anger, fear, hunger ... Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th , 2017 12 / 119
  • 27. Representation A viewpoint from The Emotion Machine (Minsky, 2006) Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th , 2017 13 / 119
  • 28. Representation A viewpoint from The Emotion Machine (Minsky, 2006) Each variable responds to different concepts and each concept is represented by different variables. Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th , 2017 13 / 119
  • 29. Representation A viewpoint from The Emotion Machine (Minsky, 2006) Each variable responds to different concepts and each concept is represented by different variables. This is exactly a distributed representation. Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th , 2017 13 / 119
  • 30. Representation Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th , 2017 14 / 119
  • 31. Modelling Composition Functions How do we model the composition functions? Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th , 2017 15 / 119
  • 32. Representation Deep Learning for Semantic Composition Deep learning: We focus on deep learning models in this tutorial. Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th , 2017 16 / 119
  • 33. Representation Deep Learning for Semantic Composition Deep learning: We focus on deep learning models in this tutorial. “Wait a minute, deep learning again?” “DL people, leave language along ...” Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th , 2017 16 / 119
  • 34. Representation Deep Learning for Semantic Composition Deep learning: We focus on deep learning models in this tutorial. “Wait a minute, deep learning again?” “DL people, leave language along ...” Asking some questions may be helpful: Are deep learning models providing nice function or density approximation, the problems that many specific NLP tasks essentially seek to solve? X→Y Are continuous vector representations of meaning effective for (as least some) NLP tasks? Are DL models convenient for computing such continuous representations? Do DL models naturally bridge language with other modalities in terms of both representation and learning? (this could be important.) Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th , 2017 16 / 119
  • 35. Introduction More questions: What NLP problems (e.g., semantic problems here) can be better handled with DL and what cannot? Can NLP benefit from combining DL and other approaches (e.g., symbolic approaches)? In general, has the effectiveness of DL models for semantics already been well understood? Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th , 2017 17 / 119
  • 36. Introduction Deep Learning for Semantic Composition Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th , 2017 18 / 119
  • 37. Outline 1 Introduction Semantic composition Formal methods Simple parametric models 2 Parameterizing Composition Functions Recurrent composition models Recursive composition models Convolutional composition models Unsupervised models 3 Selected Topics Compositionality and non-compositionality Subword composition methods 4 Summary Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th , 2017 19 / 119
  • 38. Formal Semantics Montague Semantics (1970–1973): Treat natural language like a formal language via an interpretation function [[. . .]], and a mapping from CFG rules to function application order. Interpretation of a sentence reduces to logical form via β-reduction. High Level Idea Syntax guides composition, types determine their semantics, predicate logic does the rest. Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th , 2017 20 / 119
  • 39. Formal Semantics Syntactic Analysis Semantic Interpretation S ⇒ NP VP [[VP]]([[NP]]) NP ⇒ cats, milk, etc. [[cats]], [[milk]], . . . VP ⇒ Vt NP [[Vt]]([[NP]]) Vt ⇒ like, hug, etc. λyx.[[like]](x, y), . . . [[like]]([[cats]], [[milk]]) [[cats]] λx.[[like]](x, [[milk]]) λyx.[[like]](x, y) [[milk]] Cats like milk. Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th , 2017 21 / 119
  • 40. Formal Semantics Pros: Intuitive and interpretable(?) representations. Leverage the power of predicate logic to model semantics. Evaluate the truth of statements, derive conclusions, etc. Cons: Brittle, requires robust parsers. Extensive logical model required for evaluation of clauses. Extensive set of rules required to do anything useful. Overall, an intractable (or unappealing) learning problem. Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th , 2017 22 / 119
  • 41. Outline 1 Introduction Semantic composition Formal methods Simple parametric models 2 Parameterizing Composition Functions Recurrent composition models Recursive composition models Convolutional composition models Unsupervised models 3 Selected Topics Compositionality and non-compositionality Subword composition methods 4 Summary Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th , 2017 23 / 119
  • 42. Simple Parametric Models Basic models with pre-defined function form (Mitchell et al., 2008): General form : p = f (u, v, R, K) Add : p = u + v WeightAdd : p = αT u + βT v Multiplicative : p = u ⊗ v Combined : p = αT u + βT v + γT (u ⊗ v) We will see later in this tutorial that the above models could be seen as special cases of more complicated composition models. Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th , 2017 24 / 119
  • 43. Results Reference (R): The color ran. High-similarity landmark (H): The color dissolved. Low-similarity landmark (L): The color galloped. A good composition model should give the above R-H pair a similarity score higher than that given to the R-L pair. Also, a good model should assign such similarity scores with a high correlation (ρ) to what human assigned. Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th , 2017 25 / 119
  • 44. Results Reference (R): The color ran. High-similarity landmark (H): The color dissolved. Low-similarity landmark (L): The color galloped. A good composition model should give the above R-H pair a similarity score higher than that given to the R-L pair. Also, a good model should assign such similarity scores with a high correlation (ρ) to what human assigned. Models R-H similarity R-L similarity ρ NonComp 0.27 0.26 0.08** Add 0.59 0.59 0.04* WeightAdd 0.35 0.34 0.09** Kintsch 0.47 0.45 0.09** Multiply 0.42 0.28 0.17** Combined 0.38 0.28 0.19** UpperBound 4.94 3.25 0.40** Table: Mean cosine similarities for the R-H pairs and R-L pairs as well as the correlation coefficients (ρ) with human judgments (*: p < 0.05, **: p < 0.01). Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th , 2017 25 / 119
  • 45. Outline 1 Introduction Semantic composition Formal methods Simple parametric models 2 Parameterizing Composition Functions Recurrent composition models Recursive composition models Convolutional composition models Unsupervised models 3 Selected Topics Compositionality and non-compositionality Subword composition methods 4 Summary Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th , 2017 26 / 119
  • 46. Parameterizing Composition Functions To move beyond simple algebraic or parametric models we need function approximators which, ideally: Can approximate any arbitrary function (e.g. ANNs). Can cope with variable size sequences. Can capture long range or unbounded dependencies. Can implicitly or explicitly model structure. Can be trained against a supervised or unsupervised objective (or both — semi-supervised training). Can be trained chiefly or primarily through backpropagation. A Neural Network Model Zoo This section presents a selection of models satisfying some (if not all) of these criteria. Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th , 2017 27 / 119
  • 47. Outline 1 Introduction Semantic composition Formal methods Simple parametric models 2 Parameterizing Composition Functions Recurrent composition models Recursive composition models Convolutional composition models Unsupervised models 3 Selected Topics Compositionality and non-compositionality Subword composition methods 4 Summary Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th , 2017 28 / 119
  • 48. Recurrent Neural Networks Bounded Methods Many methods impose explicit or implicit length limits on conditioning information. For example: order-n Markov assumption in NLM/LBL fully-connected layers and dynamic pooling in conv-nets wj f(w1:j) hj-1 hj Recurrent Neural Networks introduce a repeatedly composable unit, the recurrent cell, which both models an unbounded sequence prefix and express a function over it. Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th , 2017 29 / 119
  • 49. The Mathematics of Recurrence wj f(w1:j) hj-1 hj previous state next state inputs outputs Building Blocks An input vector wj ∈ R|w| A previous state hj−1 ∈ R|h| A next state hj ∈ R|h| An output yj ∈ R|y| fy : R|w| × R|h| → R|y| fh : R|w| × R|h| → R|h| Putting it together hj = fh(wj , hj−1) yj = fy (wj , hj ) So yj = fy (wj , fh(wj−1, hj−1)) = fy (wj , fh(wj−1, fh(wj−2, hj−2))) = . . . Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th , 2017 30 / 119
  • 50. RNNs for Language Modelling Language modelling We want to model the joint probability of tokens t1, . . . tn in a sequence: P(t1, . . . tn) = P(t1) n i=2 P(ti |t1, . . . ti−1) Adapting a recurrence for basic LM For vocab V, define an embedding matrix E ∈ R|V |×|w| and a logit projection matrix WV ∈ R|y|×|V |. Then: wj = embed(tj , E) yj = fy (wj , hj ) hj = fh(wj , hj−1) pj = softmax(yj WV ) P(tj+1|t1, . . . , tj ) = Categorical(tj+1; pj ) Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th , 2017 31 / 119
  • 51. Aside: The Vanishing Gradient Problem and LSTM RNNs RNN is deep “by time”, so it could seriously suffer from the vanishing gradient issue. LSTM configures memory cells and multiple “gates” to control information flow. If properly learned, LSTM can keep pretty long-distance (hundreds of time steps) information in memory. Memory-cell details: it = σ(Wxi xt + Whi ht−1 + Wci ct−1) ft = σ(Wxf xt + Whf ht−1 + Wcf ct−1) ct = σ(ftct−1 + ittanh(Wxc xt + Whc ht−1)) ot = σ(Wxoxt + Whoht−1 + Wcoct) ht = σ(ottanh(ct)) Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th , 2017 32 / 119
  • 52. Conditional Language Models Conditional Language Modelling A strength of RNNs is that hj can model not only the history of the generated/observed sequence t1, . . . , tj , but any conditioning information β, e.g. by setting h0 = β. w1 w2 w3 w1 w2 w3 Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th , 2017 33 / 119
  • 53. Encoder-Decoder Models with RNNs Les chiens aiment les os ||| Dogs love bones Dogs love bones </s> Source sequence Target sequence cf. Kalchbrenner et al., 2013; Sutskever et al., 2014 Model p(t1, . . . , tn|s1, . . . , sm) he i = RNNencoder (si , he i−1) hd i = RNNdecoder (ti , hd i−1) hd 0 = he m ti+1 ∼ Categorical(t; fV (hi )) The encoder RNN as a composition module All information needed to transduce the source into the target sequence using RNNdecoder needs to be present in the start state hd 0 . This start state is produced by RNNencoder , which will learn to compose. Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th , 2017 34 / 119
  • 54. RNNs as Sentence Encoders This idea of RNNs as sentence encoder works for classification as well: Data is labelled sequences (s1, . . . , s|s|; ˆy). RNN is run over s to produce final state h|s| = RNN(s). A differentiable function of h|s| classifies: y = fθ(h|s|) h|s| can be taken to be the composed meaning of s, with regard to the task at hand. An aside: Bi-directional RNN encoders For both sequence classification and generation, sometimes a Bi-directional RNN is used to encode: h← i = RNN← (si , h← i+1) h→ i = RNN→ (si , h→ i−1) h|s| = concat(h← 1 , h→ |s|) Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th , 2017 35 / 119
  • 55. A Transduction Bottleneck Les chiens aiment les os ||| Dogs love bones Dogs love bones </s> Source sequence Target sequence Single vector representation of sentences causes problems: Training focusses on learning marginal language model of target language first. Longer input sequences cause compressive loss. Encoder gets significantly diminished gradient. In the words of Ray Mooney. . . “You can’t cram the meaning of a whole %&!$ing sentence into a single $&!*ing vector!” Yes, the censored-out swearing is copied verbatim. Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th , 2017 36 / 119
  • 56. Attention Les chiens aiment les os ||| Dogs love bones Dogs love bones </s> Source sequence Target sequence cf. Bahdanau et al., 2014 We want to use he 1, . . . , he m when predicting ti by conditioning on words that might relate to ti : 1 Compute hd i (RNN update) 2 eij = fatt(hd i , he j ) 3 aij = softmax(ei )j 4 hatt i = m j=1 aij he j 5 ˆhi = concat(hd i , hatt i ) 6 ti+1 ∼ Categorical(t; fV (ˆhi )) The many faces of attention Many variants on the above process: early attention (based on hd i−1 and ti , used to update hd i ), different attentive functions fatt (e.g. based on projected inner products, or MLPs), and so on. Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th , 2017 37 / 119
  • 57. Attention and Composition We refer to the set of source activation vectors he 1, . . . , he m in the previous slides as an attention matrix. Is it a suitable sentence representation? Pros: Locally compositional: vectors contain information about other words (especially with bi-directional RNN as encoder). Variable size sentence representation: longer sentences yield larger representation with more capacity. Cons: Single vector representation of sentences is convenient (many decoders, classifiers, etc. expect fixed-width feature vectors as input) Locally compositional, but are long range dependencies resolved in the attention matrix? Does it truly express the sentence’s meaning as a semantic unit (or is it just good for sequence transduction)? Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th , 2017 38 / 119
  • 58. Outline 1 Introduction Semantic composition Formal methods Simple parametric models 2 Parameterizing Composition Functions Recurrent composition models Recursive composition models Convolutional composition models Unsupervised models 3 Selected Topics Compositionality and non-compositionality Subword composition methods 4 Summary Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th , 2017 39 / 119
  • 59. Recursive Neural Networks Recursive networks: a generalization of (chain) recurrent networks with a computational graph, often a tree (Pollack, 1990; Francesconi et al., 1997; Socher et al., 2011a,b,c, 2013; Zhu et al., 2015b) Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th , 2017 40 / 119
  • 60. Recursive Neural Networks Successfully applied to consider input data structures. Natural language processing (Socher et al., 2011a,c; Le et al., 2015; Tai et al., 2015; Zhu et al., 2015b) Computer vision (Socher et al., 2011b) Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th , 2017 41 / 119
  • 61. Recursive Neural Networks Successfully applied to consider input data structures. Natural language processing (Socher et al., 2011a,c; Le et al., 2015; Tai et al., 2015; Zhu et al., 2015b) Computer vision (Socher et al., 2011b) How to determine the structures. Encode given “external” knowledge about the structure of the input data, e.g., syntactic structures; modelling sentential semantics and syntax is one of the most interesting problems in language. Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th , 2017 41 / 119
  • 62. Recursive Neural Networks Successfully applied to consider input data structures. Natural language processing (Socher et al., 2011a,c; Le et al., 2015; Tai et al., 2015; Zhu et al., 2015b) Computer vision (Socher et al., 2011b) How to determine the structures. Encode given “external” knowledge about the structure of the input data, e.g., syntactic structures; modelling sentential semantics and syntax is one of the most interesting problems in language. Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th , 2017 41 / 119
  • 63. Recursive Neural Networks Successfully applied to consider input data structures. Natural language processing (Socher et al., 2011a,c; Le et al., 2015; Tai et al., 2015; Zhu et al., 2015b) Computer vision (Socher et al., 2011b) How to determine the structures. Encode given “external” knowledge about the structure of the input data, e.g., syntactic structures; modelling sentential semantics and syntax is one of the most interesting problems in language. Encode simply a complete tree. Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th , 2017 41 / 119
  • 64. Integrating Syntactic Parses in Composition Recursive Neural Tensor Network (Socher et al., 2012): The structure is given (here by a constituency parser.) Each node here is implemented as a regular feed-forward layer plus a 3rd -order tensor. The tensor captures 2nd -degree (quadratic) polynomial interaction of children, e.g., b2 i , bi cj , and c2 j . Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th , 2017 42 / 119
  • 65. Results The models have been successfully applied to a number of tasks such as sentiment analysis (Socher et al., 2013). Table: Accuracy for fine grained (5-class) and binary predictions at the sentence level (root) and for all nodes. Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th , 2017 43 / 119
  • 66. Tree-LSTM Tree-structured LSTM (Le, *SEM-15; Tai, ACL-15; Zhu, ICML-15): It is an extension of chain LSTM to tree structures. Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th , 2017 44 / 119
  • 67. Tree-LSTM Tree-structured LSTM (Le, *SEM-15; Tai, ACL-15; Zhu, ICML-15): It is an extension of chain LSTM to tree structures. If your have a non-binary tree, a simple solution is to binarize it. Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th , 2017 44 / 119
  • 68. Tree-LSTM Application: Sentiment Analysis Sentiment composed over a constituency parse tree: Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th , 2017 45 / 119
  • 69. Tree-LSTM Application: Sentiment Analysis Results on Stanford Sentiment Treebank (Zhu et al., 2015b): Models roots phrases NB 41.0 67.2 SVM 40.7 64.3 RvNN 43.2 79.0 RNTN 45.7 80.7 Tree-LSTM 48.9 81.9 Table: Performances (accuracy) of models on Stanford Sentiment Treebank, at the sentence level (roots) and the phrase level. Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th , 2017 46 / 119
  • 70. Tree-LSTM Application: Natural Language Inference Applied to Natural Language Inference (NLI): Determine if a sentence entails another, if they contradict, or have no relation (Chen et al., 2017). Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th , 2017 47 / 119
  • 71. Tree-LSTM Application: Natural Language Inference Accuracy on Stanford Natural Language Inference (SNLI) dataset: (Chen et al., 2017) * Welcome to the poster at 6:00-9:30pm on July 31. Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th , 2017 48 / 119
  • 72. Learning Representation for Natural Language Inference RepEval-2017 Shared Task (Williams et al., 2017): Learn sentence representation as a fixed-length vector. Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th , 2017 49 / 119
  • 73. Tree-LSTM without Syntactic Parses How if we simply apply recursive networks over trees that are not generated from syntactic parses, e.g., a complete binary trees? Multiple efforts on SNLI (Munkhdalai et al., 2016; Chen et al., 2017) have observed that the models outperform sequential (chain) LSTM. This could be related to the discussion that recursive nets may capture long-distance dependency (Goodfellow et al., 2016). Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th , 2017 50 / 119
  • 74. SPINN: Doing Away with Test-Time Trees buffer stack t = 0 down sat cat the shift t = 1 down sat cat the shift t = 2 down sat cat the reduce t = 3 down sat the cat shift t = 4 down sat the cat shift t = 5 down sat the cat reduce t = 6 sat down the cat reduce t = 7 = T (the cat) (sat down) output to model for semantic task Image credit: Sam Bowman and co-authors. cf. Bowman et al., 2016 Shift-Reduce Parsers: Exploit isomorphism between binary branching trees with T leaves and sequences of 2T − 1 binary shift/reduce actions. Shift unattached leaves from a buffer onto a processing stack. Reduce the top two child nodes on the stack to a single parent node. SPINN: Jointly train a TreeRNN and a vector-based shift-reduce parser. Training time trees offer supervision for shift-reduce parser. No need for test time trees! Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th , 2017 51 / 119
  • 75. SPINN:Doing Away with Test-Time Trees buffer down sat stack cat the composition tracking transition reduce down sat the cat composition tracking transition shift down sat the cat tracking Image credit: Sam Bowman and co-authors. Word vectors start on buffer b (top: first word in sentence). Shift moves word vectors from buffer to stack s. Reduce pops top two vectors off the stack, applies f R : Rd × Rd → Rd , and pushes the result back to the stack (i.e. TreeRNN composition). Tracker LSTM tracks parser/composer state across operations, decides shift-reduce operations a, is supervised by both observed shift-reduce operations and end-task: ht = LSTM(f C (bt−1[0], st−1[0], st−1[1]), ht−1) at ∼ f A (ht) Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th , 2017 52 / 119
  • 76. A Quick Introduction to Reinforce What if some part of our process is not differentiable (e.g. samples from the shift-reduce module in SPINN) but we want to learn with no labels. . . x y x y z p(y|x) = Epθ(z|x) [fφ(z, x)] s.t. y ∼ fφ(z, x) or y = fφ(z, x) φp(y|x) = z pθ(z|x) φfφ(z, x) = Epθ(z|x) [ φfφ(z, x)] θp(y|x) = z fφ(z, x) θpθ(z|x) = ??? Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th , 2017 53 / 119
  • 77. A Quick Introduction to Reinforce The Reinforce Trick (R. J. Williams, 1992) θ log pθ(z|x) = θpθ(z|x) pθ(z|x) ⇒ θpθ(z|x) = pθ(z|x) θ log pθ(z|x) θp(y|x) = z fφ(z, x) θpθ(z|x) = z fφ(z, x)pθ(z|x) θ log pθ(z|x) = Epθ(z|x) [fφ(z, x) θ log pθ(z|x)] This naturally extends to cases where p(z|x) = p(z1, . . . , zn|x). RL vocab: samples of such sequences of of discrete actions are referred to as “traces”. We often refer to pθ(z|x) as a policy πθ(z; x). Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th , 2017 54 / 119
  • 78. SPINN+RL: Doing Away with Training-Time Trees “Drop in” extension to SPINN (Yogatama et al., 2016): Treat at ∼ f A(ht) as policy πA θ (at; ht), trained via Reinforce. Reward is negated loss of the end task, e.g. log-likelihood of the correct label. Everything else is trained by backpropagation against the end task: tracker LSTM, representations, etc. receive gradient both from the supervised objective, and from Reinforce via the shift-reduce policy. a wo man wea ring sun glas ses is frow ning . a boy drag s his sled s thro ugh the sno w . Model recovers linguistic-like structures (e.g. noun phrases, auxiliary verb-verb pairing, etc.). Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th , 2017 55 / 119
  • 79. SPINN+RL: Doing Away with Training-Time Trees Does RL-SPINN work? According to Yogatama et al. (2016): Better than LSTM baselines: model captures and exploits structure. Better than SPINN benchmarks: model is not biased by what linguists think trees should be like, only has a loose inductive biase towards tree structures. But some parses do not reflect order of composition (see below). Semi-supervised setup may be sensible. two men are playi ng frisb ee in the park . fami ly me mbe rs stan ding outs ide a hom e . Some “bad” parses, but not necessarily worse results. Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th , 2017 56 / 119
  • 80. Outline 1 Introduction Semantic composition Formal methods Simple parametric models 2 Parameterizing Composition Functions Recurrent composition models Recursive composition models Convolutional composition models Unsupervised models 3 Selected Topics Compositionality and non-compositionality Subword composition methods 4 Summary Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th , 2017 57 / 119
  • 81. Convolution Neural Networks Visual Inspiration: How do we learn to recognise pictures? Will a fully connected neural network do the trick? 8 Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th , 2017 58 / 119
  • 82. ConvNets for pictures Problem: lots of variance that shouldn’t matter (position, rotation, skew, difference in font/handwriting). 8 8 8 8 8 8Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th , 2017 59 / 119
  • 83. ConvNets for pictures Solution: Accept that features are local. Search for local features with a window. 8 Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th , 2017 60 / 119
  • 84. ConvNets for pictures Convolutional window acts as a classifer for local features. ⇒ Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th , 2017 61 / 119
  • 85. ConvNets for pictures Different convolutional maps can be trained to recognise different features (e.g. edges, curves, serifs). ... Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th , 2017 62 / 119
  • 86. ConvNets for pictures Stacked convolutional layers learn higher-level features. Fully Connected Layer Convolutional Layer 8 8Raw Image First Order Local Features Higher Order Features Prediction One or more fully-connected layers learn classification function over highest level of representation. Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th , 2017 63 / 119
  • 87. ConvNets for language Convolutional neural networks fit natural language well. Deep ConvNets capture: Positional invariances Local features Hierarchical structure Language has: Some positional invariance Local features (e.g. POS) Hierarchical structure (phrases, dependencies) Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th , 2017 64 / 119
  • 88. ConvNets for language How do we go from images to sentences? Sentence matrices! w1 w2 w3 w4 w5 Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th , 2017 65 / 119
  • 89. ConvNets for language Does a convolutional window make sense for language? w1 w2 w3 w4 w5 Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th , 2017 66 / 119
  • 90. ConvNets for language A better solution: feature-specific windows. w1 w2 w3 w4 w5 Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th , 2017 67 / 119
  • 91. Word Level Sentence Vectors with ConvNets K-Max pooling (k=3) Fully connected layer Folding Wide convolution (m=2) Dynamic k-max pooling (k= f(s) =5) Projected sentence matrix (s=7) Wide convolution (m=3) game's the same, just got more fierce cf. Kalchbrenner et al., 2014 Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th , 2017 68 / 119
  • 92. Character Level Sentence Vectors with ConvNets Image credit: Yoon Kim and co-authors. cf. Kim et al., 2016 Naively, we could just represent everything at character level. Convolutions seem to work well for low-level patterns (e.g. morphology) One interpretation: multiple filters can capture the low-level idiosyncrasies of natural language (e.g. arbitrary spelling) whereas language is more compositional at a higher level. Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th , 2017 69 / 119
  • 93. ConvNet-like Architectures for Composition t0 t1 t2 t3 t4 t5 t6 t7 t8 t9 t11 t12 t13 t14 t15 t16t10 s0 s1 s2 s3 s4 s5 s6 s7 s8 s9 s10 s11 s12 s13 s14 s15 s16 t11 t12 t13 t14 t15 t16 t17t10t9t8t7t6t5t4t3t2t1 Image credit: Nal Kalchbrenner and co-authors. cf. Kalchbrenner et al., 2016 Many other CNN-like architectures (e.g. ByteNet from Kalchbrenner et al. (2016)) Common recipe components: dilated convolutions and ResNet blocks. These model sequences well in domains like speech, and are beginning to find applications in NLP, so worth reading up on. Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th , 2017 70 / 119
  • 94. Outline 1 Introduction Semantic composition Formal methods Simple parametric models 2 Parameterizing Composition Functions Recurrent composition models Recursive composition models Convolutional composition models Unsupervised models 3 Selected Topics Compositionality and non-compositionality Subword composition methods 4 Summary Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th , 2017 71 / 119
  • 95. Unsupervised Composition Models Why care about unsupervised learning? Much more unlabelled linguistic data than labelled data. Learn general purpose representations and composition functions. Suitable pre-training for supervised models, semi-supervised, or multi-task objectives. In the (paraphrased) words of Yann LeCun: unsupervised learning is a cake, supervised learning is frosting, and RL is the cherry on top! Plot twist: it’s possibly a cherry cake. Yes, that’s nice. . . But what are we doing, concretely? Good question! Usually, just modelling—directly or indirectly—some aspect of the probability of the observed data. Further suggestions on a postcard, please! Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th , 2017 72 / 119
  • 96. Autoencoders Autoencoders provide an unsupervised method for representation learning: We minimise an objective function over inputs xi , i ∈ N and their reconstructions xi : J = 1 2 N i xi − xi 2 Warning: degenerate solution if xi can be updated (∀i.xi = 0). Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th , 2017 73 / 119
  • 97. Recursive Autoencoders cf. Socher et al., 2011a To auto-encode variable length sequences, we can chain autoencoders to create a recursive structure. Objective Function Minimizing the reconstruction error will learn a compression function over the inputs: Erec(i, θ) = 1 2 xi − xi 2 A “modern” alternative: use sequence to sequence model, and log-likelihood objective. Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th , 2017 74 / 119
  • 98. What’s wrong with auto-encoders? Empirically, narrow auto-encoders produce sharp latent codes, and unregularised wide auto-encoders learn identity functions. Reconstruction objective includes nothing about distance preservation in latent space: no guarantee that dist(a, b) ≤ dist(a, c) → dist(encode(a), encode(b)) ≤ dist(encode(a), encode(c)) Conversely, little incentive for similar latent codes to generate radically different (but semantically equivalent) observations. Ultimately, compression = meaning. Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th , 2017 75 / 119
  • 99. Skip-Thought Image credit: Jamie Kiros and co-authors. cf. Kiros et al., 2015 Similar to auto-encoding objective: encode sentence, but decode neighbouring sentences. Pair of LSTM-based seq2seq models with share encoder, but alternative formulations are possible. Conceptually similar to distributional semantics: a unit’s representation is a function of its neighbouring units, except units are sentence instead of words. Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th , 2017 76 / 119
  • 100. Variational Auto-Encoders Semantically Weak Codes Generally, auto-encoders sparsely encode or densely compress information. No pressure to ensure similarity continuum amongst codes. Factorized Generative Picture p(x) = p(x, z)dz = p(x|z)p(z)dz = Ep(z) [p(x|z)] z xN(0, I) Prior on z enforces semantic continuum (e.g. no arbitrarily unrelated codes for similar data), but expectation is typically intractable to compute exactly, and Monte Carlo estimate of gradients will be high variance. Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th , 2017 77 / 119
  • 101. Variational Auto-Encoders Goal Estimate, by maximising p(x): The parameters θ of a function modelling part of the generative process pθ(x|z) given samples from a fixed prior z ∼ p(z). The parameters φ of a distribution qφ(z|x) approximating the true posterior p(z|x). How do we do it? We maximise p(x) via a variational lower bound (VLB): log p(x) ≥ Eqφ(z|x) [log pθ(x|z)] − DKL (qφ(z|x) p(z)) Equivalently we can minimise NLL(x): NLL(x) ≤ Eqφ(z|x)[NLLθ(x|z)] + DKL (qφ(z|x) p(z)) Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th , 2017 78 / 119
  • 102. Variational Auto-Encoders Let’s derive the VLB: log p(x) = log 1 · pθ(x|z)p(z)dz = log qφ(z|x) qφ(z|x) pθ(x|z)p(z)dz = log Eqφ(z|x) p(z) qφ(z|x) pθ(x|z) ≥ Eqφ(z|x) log p(z) qφ(z|x) + log pθ(x|z) = Eqφ(z|x) [log pθ(x|z)] − DKL (qφ(z|x) p(z)) For right qφ(z|x) and p(z) (e.g. Gaussians) there is a closed-form expression of DKL (qφ(z|x) p(z)). Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th , 2017 79 / 119
  • 103. Variational Auto-Encoders The problem of stochastic gradients Estimating ∂ ∂φ Eqφ(z|x) [log pθ(x|z)] requires backpropagating through samples z ∼ qφ(z|x). For some choices of q, such as Gaussians there are reparameterization tricks (cf. Kingma et al., 2013) Reparameterizing Gaussians (Kingma et al., 2013) z ∼ N(z; µ, σ2 ) equivalent to z = µ + σ where ∼ N( ; 0, I) Trivially: ∂z ∂µ = 1 ∂z ∂σ = Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th , 2017 80 / 119
  • 104. Variational Auto-Encoders for Sentences 1 Observe a sentence w1, . . . , wn. Encode it, e.g. with an LSTM: he = LSTMe(w1, . . . , wn) 2 Predict µ = f µ(he) and σ2 = f σ(he) (in practice we operate in log space for σ2 by determining log σ). 3 Sample z ∼ q(z|x) = N(z; µ, σ2) 4 Use conditional RNN to decode and measure log p(x|z). Use closed-form formula of KL divergence of two Gaussians to calculate −DKL (qφ(z|x) p(z)). Add both to obtain maximisation objective. 5 Backpropagate gradient through decoder normally based on log component of the objective, and use reparameterisation trick to backpropagate through sampling operation back to encoder. 6 Gradient of the KL divergence component of the loss with regard to the encoder parameters is straightforward backpropagation. Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th , 2017 81 / 119
  • 105. Variational Auto-Encoders and Autoregressivity The problem of powerful auto-regressive decoders We want to minimise NLL(x) ≤ Eq(z|x)[NLL(x|z)] + DKL (q(z|x) p(z)). What if the decoder is powerful enough to model x without using z? A degenerate solution: If z can be ignored when minimising the reconstruction loss of x given z, the model can safely let q(z|x) collapse to the prior p(z) to minimise DKL (q(z|x) p(z)). Since q need not depend on x (e.g. the encoder can just ignore x and predict the mean and variance of the prior), z bears no relation to x. Result: useless encoder, useless latent variable. Is this really a problem? If your decoder is not auto-regressive (e.g. MLPs expressing the probability of pixels which are conditionally independent given z), then no. If your decoder is an RNN and domain has systematic patterns, then yes. Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th , 2017 82 / 119
  • 106. Variational Auto-Encoders and Autoregressivity What are some solutions to this problem? Pick a non-autoregressive decoder. If you care more about the latent code than having a good generative model (e.g. document modelling), this isn’t a bad idea, but frustrating if this is the only solution. KL Annealing: set Eq(z|x)[NLL(x|z)] + αDKL (q(z|x) p(z)) as objective. Start with α = 0 (basic seq2seq model). Increase α to 1 over time during training. Works somewhat, but unprincipled changing of the objective function. Set as objective Eq(z|x)[NLL(x|z)] + max(λ, DKL (q(z|x) p(z))) where λ ≥ 0 is a scalar or vector hyperparameter. Once the KL dips below λ, there is no benefit, so the model must rely on z to some extent. This objective is still a valid upper bound on NLL(x) (albeit a looser one). Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th , 2017 83 / 119
  • 107. Outline 1 Introduction Semantic composition Formal methods Simple parametric models 2 Parameterizing Composition Functions Recurrent composition models Recursive composition models Convolutional composition models Unsupervised models 3 Selected Topics Compositionality and non-compositionality Subword composition methods 4 Summary Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th , 2017 84 / 119
  • 108. Outline 1 Introduction Semantic composition Formal methods Simple parametric models 2 Parameterizing Composition Functions Recurrent composition models Recursive composition models Convolutional composition models Unsupervised models 3 Selected Topics Compositionality and non-compositionality Subword composition methods 4 Summary Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th , 2017 85 / 119
  • 109. Compositional or Non-compositional Representation Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th , 2017 86 / 119
  • 110. Compositional or Non-compositional Representation Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th , 2017 87 / 119
  • 111. Compositional or Non-compositional Representation Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th , 2017 88 / 119
  • 112. Compositional or Non-compositional Representation Such “hard” or “soft” non-compositionalilty exists at different granularities of texts. We will discuss some models on how to handle this at the word-phrase level. Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th , 2017 88 / 119
  • 113. Compositional and Non-compositional Semantics Compositionality/non-compositionality is a common phenomenon in language. A framework that is able to consider both compositionality/non-compositionality is of interest. Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th , 2017 89 / 119
  • 114. Compositional and Non-compositional Semantics Compositionality/non-compositionality is a common phenomenon in language. A framework that is able to consider both compositionality/non-compositionality is of interest. A pragmatic viewpoint: If one is able to obtain holistically the representation of an n-gram or a phrase in text, it would be desirable that a composition model has the ability to decide the sources of knowledge it will use. Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th , 2017 89 / 119
  • 115. Compositional and Non-compositional Semantics Compositionality/non-compositionality is a common phenomenon in language. A framework that is able to consider both compositionality/non-compositionality is of interest. A pragmatic viewpoint: If one is able to obtain holistically the representation of an n-gram or a phrase in text, it would be desirable that a composition model has the ability to decide the sources of knowledge it will use. In addition to composition, considering non-compositionality may avoid back-propagating errors unnecessarily to confuse word embedding. think about the “kick the bucket” example. Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th , 2017 89 / 119
  • 116. Integrating Compositional and Non-compositional Semantics Integrating non-compositionality in recursive networks (Zhu et al., 2015a): Basic idea: Enabling individual composition operations to be able to choose information from different resources, compositional or non-compositional (e.g., holistically learned). Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th , 2017 90 / 119
  • 117. Integrating Compositional and Non-compositional Semantics Model 1: Regular bilinear merge (Zhu et al., 2015a): Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th , 2017 91 / 119
  • 118. Integrating Compositional and Non-compositional Semantics Model 2: Tensor-based merging (Zhu et al., 2015a) Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th , 2017 92 / 119
  • 119. Integrating Compositional and Non-compositional Semantics Model 3: Explicitly gated merging (Zhu et al., 2015a): Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th , 2017 93 / 119
  • 120. Experiment Set-Up Task: sentiment analysis Data: Stanford Sentiment Treebank Non-compositional sentiment Sentiment of ngrams automatically learned from tweets (Mohammad et al., 2013). Polled the Twitter API every four hours from April to December 2012 in search of tweets with either a positive word hashtag or a negative word hashtag. Using 78 seed hashtags (32 positive and 36 negative) such as #good, #excellent, and #terrible to annotate sentiment. 775,000 tweets that contain at least a positive hashtag or a negative hashtag were used as the learning corpus. Point-wise mutual information (PMI) is calculated for each bigrams and trigrams. Each sentiment score is converted to a one-hot vector; e.g. a bigram with a score of -1.5 will be assigned a 5-dimensional vector [0, 1, 0, 0, 0] (i.e., the e vector). Using the human annotation coming with Stanford Sentiment Treebank for bigrams and trigrams. Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th , 2017 94 / 119
  • 121. Results Models sentence-level (roots) all phrases (all nodes) (1) RNTN 42.44 79.95 (2) Regular-bilinear (auto) 42.37 79.97 (3) Regular-bilinear (manu) 42.98 80.14 (4) Explicitly-gated (auto) 42.58 80.06 (5) Explicitly-gated (manu) 43.21 80.21 (6) Confined-tensor (auto) 42.99 80.49 (7) Confined-tensor (manu) 43.75† 80.66† Table: Model performances (accuracy) on predicting 5-category sentiment at the sentence (root) level and phrase level. 1 The results is based on the version 3.3.0 of the Stanford CoreNLP. Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th , 2017 95 / 119
  • 122. Integrating Compositional and Non-compositional Semantics We have discussed integrating non-compositionality in recursive networks. How if there are no prior input structures available? Remember we have discussed the models that capture hidden structures. How if a syntactic parsing tree is not very reliable? e.g., for data like social media text or speech transcripts. In these situations, how can we still consider non-compositionality in the composition process. Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th , 2017 96 / 119
  • 123. Integrating Compositional and Non-compositional Semantics Integrating non-compositionality in chain recurrent networks (Zhu et al., 2016) Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th , 2017 97 / 119
  • 124. Integrating Compositional and Non-compositional Semantics Non-compositional nodes: Form the non-compositional paths (e.g., 3-8-9 or 4-5-9). Allow the embedding spaces of a non-compositional node to be different from those of a compositional node. Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th , 2017 98 / 119
  • 125. Integrating Compositional and Non-compositional Semantics Fork nodes: Summarizing history so far to support both compositional and non-compositional paths. Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th , 2017 99 / 119
  • 126. Integrating Compositional and Non-compositional Semantics Merging nodes: Combining information from compositional and non-compositional paths. Binarization Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th , 2017 100 / 119
  • 127. Integrating Compositional and Non-compositional Semantics Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th , 2017 101 / 119
  • 128. Integrating Compositional and Non-compositional Semantics Binarization: Binarizing the composition of in-bound paths (we do not worry too much about the order of merging.) Now we do not need to design different nodes for different fan-in, but let parameter-sharing be all over the nets. Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th , 2017 101 / 119
  • 129. Results Method SemEval-13 SemEval-14 Majority baseline 29.19 34.46 Unigram (SVM) 56.95 58.58 3rd best model 64.86 69.95 2nd best model 65.27 70.14 The best model 69.02 70.96 DAG-LSTM 70.88 71.97 Table: Performances of different models in official evaluation metric (macro F-scores) on the test sets of SemEval-2013 and SemEval-2014 Sentiment Analysis in Twitter in predicting the sentiment of the tweet messages. Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th , 2017 102 / 119
  • 130. Results Method SemEval-13 SemEval-14 DAG-LSTM Full paths 70.88 71.97 Full – {autoPaths} 69.36 69.27 Full – {triPaths} 70.16 70.77 Full – {triPaths, biPaths} 69.55 69.93 Full – {manuPaths} 69.88 70.58 LSTM without DAG Full – {autoPaths,manuPaths} 64.00 66.40 Table: Ablation performances (macro-averaged F-scores) of DAG-LSTM with different types of paths being removed. Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th , 2017 103 / 119
  • 131. Outline 1 Introduction Semantic composition Formal methods Simple parametric models 2 Parameterizing Composition Functions Recurrent composition models Recursive composition models Convolutional composition models Unsupervised models 3 Selected Topics Compositionality and non-compositionality Subword composition methods 4 Summary Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th , 2017 104 / 119
  • 132. Subword Composition Composition can also be performed to learn representations for words from subword components (Botha et al., 2014; Ling et al., 2015; Luong et al., 2015; Kim et al., 2016; Sennrich et al., 2016). Rich morphology: some languages have larger vocabularies than others. Informal text: very coooooool! Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th , 2017 105 / 119
  • 133. Subword Composition Composition can also be performed to learn representations for words from subword components (Botha et al., 2014; Ling et al., 2015; Luong et al., 2015; Kim et al., 2016; Sennrich et al., 2016). Rich morphology: some languages have larger vocabularies than others. Informal text: very coooooool! Basically alleviate Sparseness! Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th , 2017 105 / 119
  • 134. Subword Composition Composition can also be performed to learn representations for words from subword components (Botha et al., 2014; Ling et al., 2015; Luong et al., 2015; Kim et al., 2016; Sennrich et al., 2016). Rich morphology: some languages have larger vocabularies than others. Informal text: very coooooool! Basically alleviate Sparseness! One perspective of viewing subword models: Morpheme based composition: deriving word representation from morphemes. Character based composition: deriving word representation from characters (pretty effective as well, even used by itself!) Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th , 2017 105 / 119
  • 135. Subword Composition Composition can also be performed to learn representations for words from subword components (Botha et al., 2014; Ling et al., 2015; Luong et al., 2015; Kim et al., 2016; Sennrich et al., 2016). Rich morphology: some languages have larger vocabularies than others. Informal text: very coooooool! Basically alleviate Sparseness! One perspective of viewing subword models: Morpheme based composition: deriving word representation from morphemes. Character based composition: deriving word representation from characters (pretty effective as well, even used by itself!) Another perspective (by model architectures): Recursive models Convolutional models Recurrent models Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th , 2017 105 / 119
  • 136. Subword Composition Composition can also be performed to learn representations for words from subword components (Botha et al., 2014; Ling et al., 2015; Luong et al., 2015; Kim et al., 2016; Sennrich et al., 2016). Rich morphology: some languages have larger vocabularies than others. Informal text: very coooooool! Basically alleviate Sparseness! One perspective of viewing subword models: Morpheme based composition: deriving word representation from morphemes. Character based composition: deriving word representation from characters (pretty effective as well, even used by itself!) Another perspective (by model architectures): Recursive models Convolutional models Recurrent models We will discuss several typical methods here only briefly. Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th , 2017 105 / 119
  • 137. Subword Composition: Recursive Networks Morphological Recursive Neural Networks (Luong et al., 2013): Extending recursive neural networks (Socher et al., 2011b) to learn word representation through composition over morphemes. Assume the availability of morphemic analyses. Each tree node combines a stem vector and an affix vector. Figure. Context insensitive (left) and sensitive (right) Morphological Recursive Neural Networks. Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th , 2017 106 / 119
  • 138. Subword Composition: Recurrent Networks Bi-directional LSTM for subword composition (Ling et al., 2015). Figure. Character RNN for sub-word composition. Some more details ...Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th , 2017 107 / 119
  • 139. Subword Composition: Convolutional Networks Convolutional neural networks for subword composition (Zhang et al., 2015) Figure. Character CNN for sub-word composition. Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th , 2017 108 / 119
  • 140. Subword Composition: Convolutional Networks Convolutional neural networks for subword composition (Zhang et al., 2015) Figure. Character CNN for sub-word composition. In general, subword models have been successfully used in a wide variety of problems such as translation, sentiment analysis, question answering, etc. You should seriously consider it in the situations such as OOV is high or the word distribution has a long tail. Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th , 2017 108 / 119
  • 141. Outline 1 Introduction Semantic composition Formal methods Simple parametric models 2 Parameterizing Composition Functions Recurrent composition models Recursive composition models Convolutional composition models Unsupervised models 3 Selected Topics Compositionality and non-compositionality Subword composition methods 4 Summary Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th , 2017 109 / 119
  • 142. Summary The tutorial discusses semantic composition with distributed representation learned with neural networks. Neural networks are able to learn powerful representation and complicated composition functions. The models can achieve state-of-the-art performances on a wide range of NLP tasks. We expect further studies would continue to deepen our understanding on such approaches: Unsupervised models Compositionality with other “ingredients” of intelligence Compositionality in multi-modalities Interpretability of models Distributed vs./and symbolic composition models ... ... Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th , 2017 110 / 119
  • 143. References I C. E. Osgood, G. J. Suci, and P. H. Tannenbaum. The Measurement of Meaning. University of Illinois Press, 1957. Richard Montague. “English as a Formal Language”. In: Linguaggi nella societa e nella tecnica. Ed. by Bruno Visentini. Edizioni di Communita, 1970, pp. 188–221. G. A. Miller and P. N. Johnson-Laird. Language and perception. Cambridge, MA: Belknap Press, 1976. J. A. Fodor and Z. W. Pylyshyn. “Connectionism and cognitive architecture: A critical analysis”. In: Cognition 28 (1988), pp. 3–71. Jordan B. Pollack. “Recursive Distributed Representations”. In: Artif. Intell. 46.1-2 (1990), pp. 77–105. Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th , 2017 111 / 119
  • 144. References II Ronald J. Williams. “Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning”. In: Machine Learning 8 (1992), pp. 229–256. Barbara Partee. “Lexical semantics and compositionality”. In: Invitation to Cognitive Science 1 (1995), pp. 311–360. Elie Bienenstock, Stuart Geman, and Daniel Potter. “Compositionality, MDL Priors, and Object Recognition”. In: NIPS. 1996. Enrico Francesconi et al. “Logo Recognition by Recursive Neural Networks”. In: GREC. 1997. Jeff Mitchell and Mirella Lapata. “Vector-based Models of Semantic Composition”. In: ACL. 2008, pp. 236–244. Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th , 2017 112 / 119
  • 145. References III Richard Socher et al. “Dynamic Pooling and Unfolding Recursive Autoencoders for Paraphrase Detection”. In: NIPS. 2011, pp. 801–809. Richard Socher et al. “Parsing Natural Scenes and Natural Language with Recursive Neural Networks”. In: ICML. 2011, pp. 129–136. Richard Socher et al. “Semi-Supervised Recursive Autoencoders for Predicting Sentiment Distributions”. In: EMNLP. 2011. Richard Socher et al. “Semantic Compositionality through Recursive Matrix-Vector Spaces”. In: EMNLP-CoNLL. 2012, pp. 1201–1211. Nal Kalchbrenner and Phil Blunsom. “Recurrent Continuous Translation Models.”. In: EMNLP. Vol. 3. 39. 2013, p. 413. Diederik P. Kingma and Max Welling. “Auto-Encoding Variational Bayes”. In: CoRR abs/1312.6114 (2013). Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th , 2017 113 / 119
  • 146. References IV Thang Luong, Richard Socher, and Christopher D. Manning. “Better Word Representations with Recursive Neural Networks for Morphology”. In: CoNLL. 2013. Saif Mohammad, Svetlana Kiritchenko, and Xiao-Dan Zhu. “NRC-Canada: Building the State-of-the-Art in Sentiment Analysis of Tweets”. In: SemEval@NAACL-HLT. 2013. Richard Socher et al. “Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank”. In: EMNLP. 2013, pp. 1631–1642. Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. “Neural machine translation by jointly learning to align and translate”. In: arXiv preprint arXiv:1409.0473 (2014). Jan A. Botha and Phil Blunsom. “Compositional Morphology for Word Representations and Language Modelling”. In: ICML. 2014. Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th , 2017 114 / 119
  • 147. References V Nal Kalchbrenner, Edward Grefenstette, and Phil Blunsom. “A convolutional neural network for modelling sentences”. In: arXiv preprint arXiv:1404.2188 (2014). Ilya Sutskever, Oriol Vinyals, and Quoc V Le. “Sequence to sequence learning with neural networks”. In: Advances in neural information processing systems. 2014, pp. 3104–3112. Xiaodan Zhu et al. “An Empirical Study on the Effect of Negation Words on Sentiment”. In: ACL. 2014. Ryan Kiros et al. “Skip-thought vectors”. In: Advances in neural information processing systems. 2015, pp. 3294–3302. Phong Le and Willem Zuidema. “Compositional Distributional Semantics with Long Short Term Memory”. In: *SEM@NAACL-HLT. 2015. Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th , 2017 115 / 119
  • 148. References VI Wang Ling et al. “Finding Function in Form: Compositional Character Models for Open Vocabulary Word Representation”. In: 2015. Thang Luong et al. “Addressing the Rare Word Problem in Neural Machine Translation”. In: ACL. 2015. Sheng Kai Tai, Richard Socher, and D. Christopher Manning. “Improved Semantic Representations From Tree-Structured Long Short-Term Memory Networks”. In: ACL. 2015, pp. 1556–1566. Xiang Zhang and Yann LeCun. “Text Understanding from Scratch”. In: CoRR abs/1502.01710 (2015). Xiaodan Zhu, Hongyu Guo, and Parinaz Sobhani. “Neural Networks for Integrating Compositional and Non-compositional Sentiment in Sentiment Composition”. In: *SEM@NAACL-HLT. 2015. Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th , 2017 116 / 119
  • 149. References VII Xiaodan Zhu, Parinaz Sobhani, and Hongyu Guo. “Long Short-Term Memory Over Recursive Structures”. In: ICML. 2015, pp. 1604–1612. Samuel R Bowman et al. “A fast unified model for parsing and sentence understanding”. In: arXiv preprint arXiv:1603.06021 (2016). Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, 2016. Nal Kalchbrenner et al. “Neural Machine Translation in Linear Time”. In: CoRR abs/1610.10099 (2016). Yoon Kim et al. “Character-Aware Neural Language Models”. In: AAAI. 2016. B. M. Lake et al. “Building Machines that Learn and Think Like People”. In: Behavioral and Brain Sciences. (in press). (2016). Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th , 2017 117 / 119
  • 150. References VIII Tsendsuren Munkhdalai and Hong Yu. “Neural Tree Indexers for Text Understanding”. In: CoRR abs/1607.04492 (2016). Rico Sennrich, Barry Haddow, and Alexandra Birch. “Neural Machine Translation of Rare Words with Subword Units”. In: ACL. 2016. Dani Yogatama et al. “Learning to Compose Words into Sentences with Reinforcement Learning”. In: CoRR abs/1611.09100 (2016). Xiaodan Zhu, Parinaz Sobhani, and Hongyu Guo. “DAG-Structured Long Short-Term Memory for Semantic Compositionality”. In: NAACL. 2016. Qian Chen et al. “Enhanced LSTM for Natural Language Inference”. In: ACL. 2017. Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th , 2017 118 / 119
  • 151. References IX Williams, Nikita Nangia, and Samuel R. Bowman. “A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference”. In: CoRR abs/1704.05426 (2017). Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th , 2017 119 / 119