NLP Project: Machine Comprehension Using Attention-Based LSTM Encoder-Decoder Architecture

LSTM Encoder-Decoder Architecture with
Attention Mechanism for Machine Comprehension
Eugene Nho
MBA/MS, Intelligent Systems
Stanford University
enho@stanford.edu
Brian Higgins
BS, Computer Science
Stanford University
bhiggins@stanford.edu
Abstract
Machine comprehension remains a challenging open area of research. While many question answering
models have been explored for existing datasets, little work has been done with the newly released
MS MARCO dataset, which mirrors the reality much more closely and poses many unique challenges.
We explore an end-to-end neural architecture with attention mechanisms for comprehending relevant
information and generating text answers for MS MARCO.
1 Introduction
Machine comprehension—building systems that comprehend natural language documents—is a challenging open area
of research within natural language processing (NLP). Since the widespread use of statistical machine learning and
more recently deep learning, its progress has been made largely in lockstep with the introduction of large-scale datasets.
The latest such dataset is MS MARCO [1]. Unlike previous machine comprehension and question answering datasets,
MS MARCO consists of real questions generated by anonymized queries from Bing, and has target answers in the
form of sequence of words. Compared to other forms of answers like multiple choice, single-token answer or span of
words, generating text is significantly more challenging. To our knowledge, no literature exists on end-to-end systems
specifically addressing this dataset yet.
In this paper, we explore an attention-based encoder-decoder architecture that comprehends input text and generates
text answers.
2 Related Work
Machine comprehension and question answering tasks went through multiple stages of evolution over the past decade.
Traditionally, these tasks relied on complex NLP pipelines involving steps like syntactic parsing, semantic parsing, and
question classification. With the rise of neural networks, end-to-end neural architectures have increasingly been applied
to comprehension tasks [2][3][4][5]. The evolution of available datasets was a driving force behind the progress in
this domain—from models simply choosing between multiple choice answers [4][5] to those generating single-token
answers [2][3] or span-of-words answers [6].
Throughout this evolution, attention has emerged as a key concept, in particular as the task demanded by dataset became
more complex. Attention was first utilized more in context of non-NLP tasks such as learning alignments between image
objects and agent actions [7], or between visual features of an image and text description in the caption generation task.
Bahdanau et al. (2015) applied the attention mechanism to Neural Machine Translation (NMT) [8], and Luong et al.
(2015) explored different architectures for attention-based NMT, including a global approach and a local approach [9].

Figure 1: Architecture of the encoder-decoder model with attention
Several approaches have been tried to incorporate attention into machine comprehension and question answering tasks.
Seo et al. (2017) applied bidirectional models with attention to achieve near state-of-the-art results for SQuAD [10].
Wang & Jiang (2016) combined match-LSTM, an attention model originally proposed for text entailment, with Pointer
Net, a sequence-to-sequence model proposed by Vinyals et al (2015) that constrains output words to the tokens from
input sequences [6][11].
3 Dataset: MS MARCO
MS MARCO is a reading comprehension dataset with a number of characteristics that make it the most realistic
available dataset for question answering. All questions are real anonymized queries from Bing, and the context passages,
from which target answers are derived, are sampled from real web documents, closely mirroring the real-world scenario
of finding an answer to a question on a search engine. The target answers, as mentioned earlier, are sequence of words
and are human-generated.
The dataset has 100,000 queries. Each query consists of one question, approximately 10 context passages, and target
answers. While most queries have one answer, some have many and some have none. The average length of the
questions, passages, and target answers are approximately 15, 85, and 8 tokens, respectively. There are no particular
topical focus, but the queries fall into one of the following five categories: description (52.6%), numeric (28.4%), entity
(10.5%), location (5.7%) and person (2.7%).
Given the nature of text generation task required by the dataset, we use ROUGE-L and BLEU as evaluation metrics.
4 Methods
The problem we are trying to solve can be defined formally as follows. For each example in the dataset, we are given a
question and a set of potentially relevant context passages. The question is represented as Q 2 Rd⇥n
where d is the
embedding size and n is the length of the question. The set of passages is represented as S = {P1, P2, ..., P }, where
is the number of passages. Our objective is to (1) choose the most relevant passage P out of S, and (2) generate the
answer consisting of a sequence of words, represented as R 2 Rd⇥k
where k is the length of the answer. Because both
tasks require very similar architectures, this paper will only focus on the second objective, assuming the best passage P
has already been selected.
2

We use an encoder-decoder architecture with multiple attention mechanisms to achieve this, as shown in Figure 1. The
components of this model are:
1. Question Encoder is an LSTM layer mapping the question information to a vector space.
2. Passage Encoder with Attention is an LSTM layer with an attention mechanism connecting the passage
information with the encoded knowledge about the question.
3. Attention Encoder is an extra LSTM layer further distilling the output from the Passage Encoder with
Attention.
4. Decoder is an LSTM layer that takes in information from the three encoders and generates the answer. It has
an attention mechanism connecting to the Attention Encoder.
4.1 Question Encoder
The purpose of the question encoder is to incorporate contextual information from each token of the question to a
vector space. We use a standard unidirectional LSTM [12] layer to process the question, represented as follows:
Hq
=
!
LSTM(Q) (1)
The output matrix Hq
2 Rh⇥n
is the hidden state representation of the question, where h is the dimension of the
hidden state. hq
i represents the ith column of Hq
, and encapsulates the contextual information up to the ith token of the
question (qi).
4.2 Passage Encoder with Attention
The objective of this layer is to capture the information from the passage relevant to the contents of the question. To that
end, it employs a standard LSTM layer with the global attention mechanism proposed by Luong et al [9]. P 2 Rd⇥l
is the matrix representing the most relevant context passage, where d is the embedding size and l is the length of the
passage. At each time step t, this encoder uses the embedding of tth token of the passage and the entire hidden state
representation from the question encoder to capture contextual information up to the tth token relevant to the question.
This is represented as follows:
˜hp
t =
!
AttentionLSTM(pt, Hq
) (2)
where ˜hp
t 2 Rh
is the hidden vector capturing the information, pt 2 Rd
is the tth token of the passage (i.e. tth column
of P), and Hq
is the matrix representing all the hidden states of the question encoder.
!
AttentionLSTM is an abstraction involving the following two steps: First, it captures information from the tth token
using a regular LSTM cell, represented as hp
t =
!
LSTM(pt), where hp
t 2 Rh
. Second, using Hq
and hp
t , it derives a
context vector ct 2 Rh
that captures relevant information from the questions. ct is concatenated with hp
t to generate ˜hp
t ,
as shown below:
˜hp
t =
!
AttentionLSTM(pt, Hq
) = tanh(Wc[hp
t ; ct] + bc) (3)
where Wc 2 Rh⇥2h
is a weight matrix and bc 2 Rh
is bias.
The context vector ct captures all the hidden states from the question encoder weighted by how relevant each question
word is to the tth passage word. To derive ct, we ﬁrst calculate the relevance score between the tth passage word,
represented by hp
t , and jth question word, represented by hq
j :
score(hp
t , hq
j ) = (Wshp
t + bs)|
hq
j (4)
where Ws 2 Rh⇥h
is a weight matrix and bs 2 Rh
is bias. We then create the attention vector at = {a
(1)
t , a
(2)
t , ... ,
a
(n)
t } 2 R1⇥n
. at is a horizontal vector whose jth element is a softmax value of score(hp
t , hq
j ). ct is the weighted
average of Hq
based on the attention vector at.
a
(j)
t =
exp(score(hp
t , hq
j ))
Pn
⇢ exp(score(hp
t , hq
⇢))
(5)
gt = Hq ¯at (6)
ct =
nX
i
g
(i)
t (7)
3

Figure 2: Loss and evaluation metrics
where ¯at 2 Rh⇥n
is the vertically broadcasted matrix of at, and g
(i)
t is the ith column of gt. All the formulas aside, the
important intuition here is that the hidden state of this encoder at each time step t incorporates information about not
only the passage tokens up to that step (as is the case with a normal LSTM hidden state), but also a measure of how
relevant each of the question tokens is to that particular tth passage token.
4.3 Attention Encoder
The purpose of this LSTM layer is to further distill the contextual information captured by the passage encoder. It is
a variant of the Match LSTM layer, ﬁrst proposed by Wang & Jiang (2016) for text entailment and later applied to
question answering by the same authors [6]. We utilize a standard unidirectional LSTM layer taking as input all the
hidden states of the Passage Encoder with Attention, as shown below:
Hm
=
!
LSTM( ˜Hp
) (8)
where ˜Hp
= {˜hp
1, ˜hp
2, ..., ˜hp
l } 2 Rh⇥l
is the hidden state representation from the passage encoder.
4.4 Decoder
We use an LSTM layer with global attention between the decoder and the Attention Encoder to generate the answer.
The purpose of this attention mechanism is to let the decoder "peek" at the relevant information encapsulating the
passage and the question as it generates the answer. In addition, to pass along the contextual information captured by
the encoders, we concatenate the last hidden states from all three encoders and feed the combined matrix as the initial
hidden state to the decoder. At each time step t, the decoder takes as input the embedding of the generated token from
the previous step, and uses the same attention mechanism described in the earlier section to derive the hidden state
representation ˜hr
t 2 R3h
:
˜hr
t =
!
AttentionLSTM(rt 1, Hm
) (9)
where rt 1 2 Rd
is the embedding of the word generated in the last time step. Then we apply matrix multiplication and
softmax to generate the output vector ot 2 RV
.
ot = softmax(U˜hr
t + b) (10)
where V is the vocabulary size. rt is the word embedding corresponding to ot.
4

Description Person Numeric Location Entity Full Data Set
BLEU-1 9.1 3.2 7.4 3.8 7.4 9.3
ROUGE-L 15.1 2.8 13.7 3.4 5.5 12.8
Table 1: Evaluation metrics by question type from the held out validation set
5 Experiments and Discussion
5.1 Results
A variant of the model described earlier with a double-layer LSTM decoder achieved ROUGE-L of 12.8 and BLEU of
9.3 on the held out validation set. While this result is not sate-of-the-art, our model outperformed several benchmark
models such as Seq2Seq model with Memory Networks [1].
Figure 2 shows the training progression for our model. Loss steadily declines, but the model starts overﬁtting around
the 10th epoch, where both evaluation metrics on the development set peaked. Our classiﬁer for selecting the most
relevant context passage achieved an accuracy of 100% on both the development set and the held out validation set.
We conducted over 40 runs for hyperparameter optimization, and found the following:
• Larger batch size generally performed better, with a batch size of 256 outperforming 64 (9.3 vs 6.4 BLEU)
• L2 loss outperformed cross entropy (9.3 vs 7.0 BLEU)
• Our model was sensitive to learning rate, with evaluation metrics dropping precipitously when the rate
approached 0.01 range (vs 0.001 or 0.0001)
5.2 Error Analysis
As illustrated by Table 1, our model was much more effective in generating descriptions and numeric answers than
people and locations. Its relative weakness on people and locations most likely occurred because of the rarity of their
names. The dataset contained a total of approximately 650,000 tokens. Because of memory restrictions, we used a
vocabulary size of 20,000 during training, and the rest of the words were cast to the < unknown > token. This practice
led to lower performance on the people and location questions, which on average contain more uncommon tokens
outside of the vocabulary size.
We also found that longer answers were much harder to predict. This is well known and reported in text generation
tasks because the farther into the decoder the model gets, the more diluted the original hidden state becomes. Because
of this, it is very challenging to predict long sequences of words.
Lastly, our design decision on how to treat the < end > token in the generated answer impacted our error rate. When
training, we masked the generated answers based on the lengths of ground truth. Our hypothesis was that this would
encourage our model to generate predictions that are of the same length as the target answers. However, because of this
decision, the model was not penalized for generating often irrelevant tokens beyond the length of the target answer.
This led to lower evaluation metrics on the held out validation set.
6 Conclusion
We see a few areas of improvement for future work. First, we would like to reduce the burden on the decoder softmax
by limiting the vocabulary to only the tokens used in that particular query’s question and passage. Predicting the right
token out of 20,000 options is challenging, especially when the model is required to get that right 20 to 30 times in a
row. Second, we would like to train a language model to provide a warm start for our decoder, so the decoder does not
have to learn how to generate sensible sequence of words in English in addition to responding to the questions and
context passages simultaneously.
5

References
[1] Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng Gao, Saurabh Tiwary, Rangan Majumder, and Li Deng. 2016. MS MARCO: A
Human Generated MAchine Reading COmprehension Dataset. In NIPS 2016.
[2] Karl Moritz Hermann, Tomáš Koˇciský, Edward Grefenstette, Lasse Espeholt, Will Kay, Mustafa Suleyman, and Phil Blunsom.
2015. Teaching Machines To Read And Comprehend In arXiv:1506.03340 [cs.CL].
[3] Rudolf Kadlec, Martin Schmid, Ondrej Bajgar, and Jan Kleindienst. 2016. Text Understanding With The Attention Sum Reader
Network. In ACL 2016.
[4] Felix Hill, Antoine Bordes, Sumit Chopra, and Jason Weston. 2015. The Goldilocks Principle: Reading Children’s Books with
Explicit Memory Representations. In arXiv:1511.02301 [cs.CL].
[5] Wenpeng Yin, Sebastian Ebert, and Hinrich Schütze. 2016. Attention-Based Convolutional Neural Network For Machine
Comprehension. In arXiv:1602.04341 [cs.CL].
[6] Shuohang Wang and Jing Jiang. 2016. Machine comprehension using Match-LSTM and answer pointer. Under review as a
conference paper at ICLR 2017.
[7] Volodymyr Mnih, Nicolas Heess, Alex Graves, and Koray Kavukcuoglu. 2014. Recurrent models of visual attention. In NIPS.
[8] Dzmitry Bahdanau, Kyunghyun Cho, Yoshua Bengio. 2014. Neural Machine Translation By Jointly Learning To Align And
Translate. In arXiv:1409.0473 [cs.CL].
[9] Minh-Thang Luong, Hieu Pham, and Christopher D. Manning. 2015. Effective Approaches to Attention-based Neural Machine
Translation. In arXiv:1508.04025 [cs.CL].
[10] Minjoon Seo, Aniruddha Kembhavi, Ali Farhadi, and Hannaneh Hajishirzi. 2017. Bidirectional Attention Flow for Machine
Comprehension. In Proceedings of ICLR 2017.
[11] Oriol Vinyals, Meire Fortunato, and Navdeep Jaitly. 2015. Pointer Networks. arXiv:1506.03134 [stat.ML].
[12] Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long Short-Term Memory. Neural Computation 9.8 (1997): 1735-1780.
[13] Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. SQuAD: 100,000+ Questions for Machine
Comprehension of Text. In Proceedings of EMNLP 2016.
[14] Yelong Shen, Po-Sen Huang, Jianfeng Gao, and Weizhu Chen. Reasonet: Learning To Stop Reading In Machine Comprehension.
In arXiv:1609.05284 [cs.LG].
6

NLP Project: Machine Comprehension Using Attention-Based LSTM Encoder-Decoder Architecture

More Related Content

What's hot (16)

Similar to NLP Project: Machine Comprehension Using Attention-Based LSTM Encoder-Decoder Architecture (20)

Recently uploaded (20)

NLP Project: Machine Comprehension Using Attention-Based LSTM Encoder-Decoder Architecture