Frontiers of Natural Language Processing

Frontiers of Natural Language Processing
Deep Learning Indaba 2018, Stellenbosch, South Africa
Sebastian Ruder, Herman Kamper, Panellists, Leaders in NLP, Everyone

Goals of session
1. What is NLP? What are the major developments in the last few
years?
2. What are the biggest open problems in NLP?
3. Get to know the local community and start thinking about
collaborations
1 / 68

What is NLP? What were the major advances?
A Review of the Recent History of NLP

What is NLP? What were the major advances?
A Review of the Recent History of NLP
Sebastian Ruder

Timeline
2001 • Neural language models
2008 • Multi-task learning
2013 • Word embeddings
2013 • Neural networks for NLP
2014 • Sequence-to-sequence models
2015 • Attention
2015 • Memory-based networks
2018 • Pretrained language models
3 / 68

Timeline
2015 • Attention
4 / 68

Neural language models
• Language modeling: predict next word given previous words
• Classic language models: n-grams with smoothing
• First neural language models: feed-forward neural networks that take
into account n previous words
• Initial look-up layer is commonly known as word embedding matrix as
each word corresponds to one vector
[Bengio et al., NIPS ’01; Bengio et al., JMLR ’03] 5 / 68

Neural language models
• Later language models: RNNs and LSTMs [Mikolov et al., Interspeech ’10]
• Many new models in recent years; classic LSTM is still a strong
baseline [Melis et al., ICLR ’18]
• Active research area: What information do language models capture?
• Language modelling: despite its simplicity, core to many later
advances
• Word embeddings: the objective of word2vec is a simpliﬁcation of
language modelling
• Sequence-to-sequence models: predict response word-by-word
• Pretrained language models: representations useful for transfer learning
6 / 68

Timeline
2015 • Attention
7 / 68

Multi-task learning
• Multi-task learning: sharing parameters between models trained on
multiple tasks
[Collobert & Weston, ICML ’08; Collobert et al., JMLR ’11]
8 / 68

Multi-task learning
• [Collobert & Weston, ICML ’08] won Test-of-time Award at ICML 2018
• Paper contained a lot of other inﬂuential ideas:
• Word embeddings
• CNNs for text
9 / 68

Multi-task learning
• Multi-task learning goes back a lot further
[Caruana, ICML ’93; Caruana, ICML ’96]
10 / 68

Multi-task learning
• “Joint learning” / “multi-task learning” used interchangeably
• Now used for many tasks in NLP, either using existing tasks or
“artiﬁcial” auxiliary tasks
• MT + dependency parsing / POS tagging / NER
• Joint multilingual training
• Video captioning + entailment + next-frame prediction [Pasunuru &
Bansal; ACL ’17]
• . . .
11 / 68

Multi-task learning
• Sharing of parameters is typically predeﬁned
• Can also be learned [Ruder et al., ’17]
[Yang et al., ICLR ’17]
12 / 68

Timeline
2015 • Attention
13 / 68

Word embeddings
• Main innovation: pretraining word embedding look-up matrix on a
large unlabelled corpus
• Popularized by word2vec, an eﬃcient approximation to language
modelling
• word2vec comes in two variants: skip-gram and CBOW
[Mikolov et al., ICLR ’13; Mikolov et al., NIPS ’13]
14 / 68

Word embeddings
• Word embeddings pretrained on an unlabelled corpus capture certain
relations between words
[Tensorﬂow tutorial]
15 / 68

Word embeddings
• Pretrained word embeddings have been shown to improve
performance on many downstream tasks [Kim, EMNLP ’14]
• Later methods show that word embeddings can also be learned via
matrix factorization [Pennington et al., EMNLP ’14; Levy et al., NIPS ’14]
• Nothing inherently special about word2vec; classic methods (PMI,
SVD) can also be used to learn good word embeddings from
unlabeled corpora [Levy et al., TACL ’15]
16 / 68

Word embeddings
• Lots of work on word embeddings, but word2vec is still widely used
• Skip-gram has been applied to learn representations in many other
settings, e.g. sentences [Le & Mikolov, ICML ’14; Kiros et al., NIPS ’15],
networks [Grover & Leskovec, KDD ’16], biological sequences [Asgari & Mofrad,
PLoS One ’15], etc.
17 / 68

Word embeddings
• Projecting word embeddings of diﬀerent languages into the same
space enables (zero-shot) cross-lingual transfer [Ruder et al., JAIR ’18]
[Luong et al., ’15]
18 / 68

Timeline
2015 • Attention
19 / 68

Neural networks for NLP
• Key challenge for neural networks: dealing with dynamic input
sequences
• Three main model types
• Recurrent neural networks
• Convolutional neural networks
• Recursive neural networks
20 / 68

Recurrent neural networks
• Vanilla RNNs [Elman, CogSci ’90] are typically not used as gradients
vanish or explode with longer inputs
• Long-short term memory networks [Hochreiter & Schmidhuber, NeuComp ’97]
are the model of choice
[Olah, ’15]
21 / 68

Convolutional neural networks
• 1D adaptation of convolutional neural networks for images
• Filter is moved along temporal dimension
[Kim, EMNLP ’14]
22 / 68

Convolutional neural networks
• More parallelizable than RNNs, focus on local features
• Can be extended with wider receptive ﬁelds (dilated convolutions) to
capture wider context [Kalchbrenner et al., ’17]
• CNNs and LSTMs can be combined and stacked [Wang et al., ACL ’16]
• Convolutions can be used to speed up an LSTM [Bradbury et al., ICLR ’17]
23 / 68

Recursive neural networks
• Natural language is inherently hierarchical
• Treat input as tree rather than as a sequence
• Can also be extended to LSTMs [Tai et al., ACL ’15]
[Socher et al., EMNLP ’13]
24 / 68

Other tree-based based neural networks
• Word embeddings based on dependencies [Levy and Goldberg, ACL ’14]
• Language models that generate words based on a syntactic stack [Dyer
et al., NAACL ’16]
• CNNs over a graph (trees), e.g. graph-convolutional neural networks
[Bastings et al., EMNLP ’17]
25 / 68

Timeline
2015 • Attention
26 / 68

Sequence-to-sequence models
• General framework for applying neural networks to tasks where output
is a sequence
• Killer application: Neural Machine Translation
• Encoder processes input word by word; decoder then predicts output
word by word
[Sutskever et al., NIPS ’14]
27 / 68

• Go-to framework for natural language generation tasks
• Output can not only be conditioned on a sequence, but on arbitrary
representations, e.g. an image for image captioning
[Vinyals et al., CVPR ’15]
28 / 68

• Even applicable to structured prediction tasks, e.g. constituency
parsing [Vinyals et al., NIPS ’15], named entity recognition [Gillick et al.,
NAACL ’16], etc. by linearizing the output
[Vinyals et al., NIPS ’15]
29 / 68

• Typically RNN-based, but other encoders and decoders can be used
• New architectures mainly coming out of work in Machine Translation
• Recent models: Deep LSTM [Wu et al., ’16], Convolutional encoders
[Kalchbrenner et al., arXiv ’16; Gehring et al., arXiv ’17], Transformer [Vaswani et al.,
NIPS ’17], Combination of LSTM and Transformer [Chen et al., ACL ’18]
30 / 68

Timeline
2015 • Attention
31 / 68

Attention
• One of the core innovations in Neural Machine Translation
• Weighted average of source sentence hidden states
• Mitigates bottleneck of compressing source sentence into a single
vector
[Bahdanau et al., ICLR ’15]
32 / 68

Attention
• Diﬀerent forms of attention available [Luong et al., EMNLP ’15]
• Widely applicable: constituency parsing [Vinyals et al., NIPS ’15], reading
comprehension [Hermann et al., NIPS ’15], one-shot learning [Vinyals et al.,
NIPS ’16], image captioning [Xu et al., ICML ’15]
[Xu et al., ICML ’15]
33 / 68

Attention
• Not only restricted to looking at an another sequence
• Can be used to obtain more contextually sensitive word
representations by attending to the same sequence → self-attention
• Used in Transformer [Vaswani et al., NIPS ’17], state-of-the-art architecture
for machine translation
34 / 68

Timeline
2015 • Attention
35 / 68

Memory-based neural networks
• Attention can be seen as fuzzy memory
• Models with more explicit memory have been proposed
• Different variants: Neural Turing Machine [Graves et al., arXiv ’14],
Memory Networks [Weston et al., ICLR ’15] and End-to-end Memory
Networks [Sukhbaatar et al., NIPS ’15], Dynamic Memory Networks [Kumar et
al., ICML ’16], Neural Differentiable Computer [Graves et al., Nature ’16],
Recurrent Entity Network [Henaff et al., ICLR ’17]
36 / 68

Memory-based neural networks
• Memory is typically accessed based on similarity to current state
similar to attention; can be written to and read from
• End-to-end Memory Networks [Sukhbaatar et al., NIPS ’15] process input
multiple times and update memory
• Neural Turing Machines also have a location-based addressing; can
learn simple computer programs like sorting
• Memory can be a knowledge base or populated based on input
37 / 68

Timeline
2015 • Attention
38 / 68

Pretrained language models
• Word embeddings are context-agnostic, only used to initialize ﬁrst
layer
• Use better representations for initialization or as features
• Language models pretrained on a large corpus capture a lot of
additional information
• Language model embeddings can be used as features in a target
model [Peters et al., NAACL ’18] or a language model can be ﬁne-tuned on
target task data [Howard & Ruder, ACL ’18]
39 / 68

• Adding language model embeddings gives a large improvement over
state-of-the-art across many diﬀerent tasks
[Peters et al., ’18]
40 / 68

• Enables learning models with signiﬁcantly less data
• Additional beneﬁt: Language models only require unlabelled data
• Enables application to low-resource languages where labelled data is
scarce
41 / 68

Other milestones
• Character-based representations
• Use a CNN/LSTM over characters to obtain a character-based word
representation
• First used for sequence labelling tasks [Lample et al., NAACL ’16; Plank et
al., ACL ’16]; now widely used
• Even fully character-based NMT [Lee et al., TACL ’17]
• Adversarial learning
• Adversarial examples are becoming widely used [Jia & Liang, EMNLP ’17]
• (Virtual) adversarial training [Miyato et al., ICLR ’17; Yasunaga et al., NAACL
’18] and domain-adversarial loss [Ganin et al., JMLR ’16; Kim et al., ACL ’17]
are useful forms of regularization
• GANs are used, but not yet too eﬀective for NLG [Semeniuta et al., ’18]
• Reinforcement learning
• Useful for tasks with a temporal dependency, e.g. selecting data [Fang &
Cohn, EMNLP ’17; Wu et al., NAACL ’18] and dialogue [Liu et al., NAACL ’18]
• Also eﬀective for directly optimizing a surrogate loss (ROUGE, BLEU)
for summarization [Paulus et al., ICLR ’18; ] or MT [Ranzato et al., ICLR ’16]
42 / 68

The Biggest Open Problems in NLP

The Biggest Open Problems in NLP
Sebastian
Ruder
Jade
Abbott
Stephan
Gouws
Omoju
Miller
Bernardt
Duvenhage

The biggest open problems: Answers from experts
Hal Daumé III Barbara Plank Miguel Ballesteros Anders Søgaard
Manaal Faruqui Mikel Artetxe Sebastian Riedel Isabelle Augenstein
Bernardt Duvenhage Lea Frermann Brink van der Merwe Karen
Livescu Jan Buys Kevin Gimpel Christine de Kock Alta de
Waal Michael Roth Maletěabisa Molapo Annie Louise Chris Dyer
Yoshua Bengio Felix Hill Kevin Knight Richard Socher George
Dahl Dirk Hovy Kyunghyun Cho
44 / 68

We asked the experts:
What are the three biggest open problems
in NLP at the moment?

The biggest open problems in NLP
1. Natural language understanding
2. NLP for low-resource scenarios
3. Reasoning about large or multiple documents
4. Datasets, problems and evaluation
46 / 68

Problem 1: Natural language understanding
• Many experts argued that this is central, also for generation
• Almost none of our current models have “real” understanding
• What (biases, structure) should we build explicitly into our models?
• Models should incorporate common sense
• Dialogue systems (and chat bots) were mentioned in several responses
47 / 68

Article: Nicola Tesla
Paragraph: In January 1880, two of Tesla’s uncles put together enough
money to help him leave Gospić for Prague where he was to study.
Unfortunately, he arrived too late to enroll at Charles-Ferdinand University;
he never studied Greek, a required subject; and he was illiterate in Czech,
another required subject. Tesla did, however, attend lectures at the
university, although, as an auditor, he did not receive grades for the
courses.
48 / 68

courses.
Question: What city did Tesla move to in 1880?
48 / 68

courses.
Answer: Prague
48 / 68

courses.
Answer: Prague
Model predicts: Prague
48 / 68

courses. Tadakatsu moved to the city of Chicago in 1881.
Answer:
Model predicts:
48 / 68

Answer: Prague
Model predicts:
48 / 68

Answer: Prague
Model predicts: Chicago
48 / 68

[Jia and Liang, EMNLP’17]
49 / 68

I think the biggest open problems are all related to natural language
understanding. . . we should develop systems that read and
understand text the way a person does, by forming a representation of
the world of the text, with the agents, objects, settings, and the
relationships, goals, desires, and beliefs of the agents, and everything else
that humans create to understand a piece of text.
Until we can do that, all of our progress is in improving
our systems’ ability to do pattern matching. Pattern
matching can be very eﬀective for developing products
and improving people’s lives, so I don’t want to
denigrate it, but . . .
— Kevin Gimpel
50 / 68

Questions to panellists/audience:
• To achieve NLU, is it important to build models that process
language “the way a person does”?
51 / 68

• How do you think we would go about doing this?
51 / 68

• Do we need inductive biases or can we expect models to learn
everything from enough data?
51 / 68

• Do we need inductive biases or can we expect models to learn
everything from enough data?
• Questions from audience
51 / 68

Problem 2: NLP for low-resource scenarios
52 / 68

• Generalisation beyond the training data
52 / 68

• Generalisation beyond the training data – relevant everywhere!
52 / 68

• Domain-transfer, transfer learning, multi-task learning
• Learning from small amounts of data
• Semi-supervised, weakly-supervised, “Wiki-ly” supervised,
distantly-supervised, lightly-supervised, minimally-supervised
52 / 68

• Domain-transfer, transfer learning, multi-task learning
• Learning from small amounts of data
• Semi-supervised, weakly-supervised, “Wiki-ly” supervised,
distantly-supervised, lightly-supervised, minimally-supervised
• Unsupervised learning
52 / 68

Word translation without parallel data:
[Conneau et al., ICLR’18]
53 / 68

[Chung et al., arXiv’18]
54 / 68

• Is it necessary to develop specialised NLP tools for speciﬁc languages,
or is it enough to work on general NLP?
55 / 68

• Since there is inherently only small amounts of text available for
under-resourced languages, the beneﬁts of NLP in such settings will
also be limited. Agree or disagree?
55 / 68

• Since there is inherently only small amounts of text available for
under-resourced languages, the beneﬁts of NLP in such settings will
also be limited. Agree or disagree?
• Unsupervised learning vs. transfer learning from high-resource
languages?
55 / 68

Problem 3: Reasoning about large or multiple
documents
• Related to understanding
• How do we deal with large contexts?
• Can be either text or spoken documents
• Again incorporating common sense is essential
56 / 68

documents
Example from NarrativeQA dataset:
[Kočiský et al., TACL’18]
57 / 68

documents
• Do we need better models or just train on more data?
58 / 68

Problem 4: Datasets, problems and evaluation
Perhaps the biggest problem is to properly deﬁne the
problems themselves. And by properly deﬁning a
problem, I mean building datasets and evaluation
procedures that are appropriate to measure our
progress towards concrete goals. Things would be
easier if we could reduce everything to Kaggle style
competitions! — Mikel Artetxe
. . . basic resources (e.g. stop word lists) — Alta de Waal
59 / 68

https://rma.nwu.ac.za
60 / 68

• What are the most important NLP problems that should be tackled
for societies in Africa?
61 / 68

• How do we make sure that we don’t overﬁt to our benchmarks?
61 / 68

We asked the experts a few more questions:

What, if anything, has led the ﬁeld in the
wrong direction?

What has led the ﬁeld in the wrong direction?
• “Synthetic data/synthetic problems” — Hal Daumé III
• “Benchmark/leaderboard chasing” — Sebastian Riedel
• “Obsession of . . . beating the state of the art through “neural
architecture search” — Isabelle Augenstein
63 / 68

• “Chomskyan theories of linguistics instead of corpus linguistics”
— Brink van der Merwe
63 / 68

• “Not incorporating enough Chomskyan theory into our models”
— Someone Else
63 / 68

— Someone Else
• “Too much emphasis on Bayesian methods (sorry :)”— Karen Livescu
63 / 68

— Someone Else
• “Too much emphasis on Bayesian methods (sorry :)”— Karen Livescu
• “Haha, as if the ﬁeld as a whole moved in a single direction”
— Michael Roth
63 / 68

I don’t think there is anything like that. We can learn
from “wrong” directions and “correct” directions, if
such a thing even exists.
— Miguel Ballesteros
Anything new will temporarily lead the ﬁeld in the
wrong direction, I guess, but upon returning, we may
nevertheless have pushed research horizons.
— Anders Søgaard
Sentiment shared in many of the other responses
64 / 68

What advice would you give a
postgraduate student in NLP
starting their project now?

What advice would you give a postgraduate
student in NLP starting their project now?
Do not limit yourself to reading NLP papers. Read a lot
of machine learning, deep learning, reinforcement learning
papers. A PhD is a great time in one’s life to go for a
big goal, and even small steps towards that will be valued.
— Yoshua Bengio
Learn how to tune your models, learn how to make
strong baselines, and learn how to build baselines that
test particular hypotheses. Don’t take any single paper
too seriously, wait for its conclusions to show up more
than once. — George Dahl
66 / 68

What advice would you give a postgraduate
student in NLP starting their project now?
i believe scientiﬁc pursuit is meant to be full of failures.
. . . if every idea works out, it’s either (a) you’re not
ambitious enough, (b) you’re subconciously cheating
yourself, or (c) you’re a genius, the last of which i heard
happens only once every century or so. so, don’t despair!
— Kyunghyun Cho
Understand psychology and the core problems of semantic
cognition. Read . . . Go to CogSci. Understand machine
learning. Go to NIPS. Don’t worry about ACL. Submit
something terrible (or even good, if possible) to a
workshop as soon as you can. You can’t learn how to do
these things without going through the process. — Felix Hill
67 / 68

Summary of session
• What is NLP? What are the major developments in the last few
years?
• What are the biggest open problems in NLP?
• Get to know the local community and start thinking about
collaborations
68 / 68

Summary of session
• What is NLP? What are the major developments in the last few
years?
• What are the biggest open problems in NLP?
• Get to know the local community and start thinking about
collaborations
• We now have the closing ceremony, so eat and chat!
68 / 68

Frontiers of Natural Language Processing

More Related Content

What's hot (20)

Similar to Frontiers of Natural Language Processing (20)

More from Sebastian Ruder (20)

Recently uploaded (20)

Frontiers of Natural Language Processing