SlideShare a Scribd company logo
LSTM Encoder-Decoder Architecture with
Attention Mechanism for Machine Comprehension
Eugene Nho
MBA/MS, Intelligent Systems
Stanford University
enho@stanford.edu
Brian Higgins
BS, Computer Science
Stanford University
bhiggins@stanford.edu
Abstract
Machine comprehension remains a challenging open area of research. While many question answering
models have been explored for existing datasets, little work has been done with the newly released
MS MARCO dataset, which mirrors the reality much more closely and poses many unique challenges.
We explore an end-to-end neural architecture with attention mechanisms for comprehending relevant
information and generating text answers for MS MARCO.
1 Introduction
Machine comprehension—building systems that comprehend natural language documents—is a challenging open area
of research within natural language processing (NLP). Since the widespread use of statistical machine learning and
more recently deep learning, its progress has been made largely in lockstep with the introduction of large-scale datasets.
The latest such dataset is MS MARCO [1]. Unlike previous machine comprehension and question answering datasets,
MS MARCO consists of real questions generated by anonymized queries from Bing, and has target answers in the
form of sequence of words. Compared to other forms of answers like multiple choice, single-token answer or span of
words, generating text is significantly more challenging. To our knowledge, no literature exists on end-to-end systems
specifically addressing this dataset yet.
In this paper, we explore an attention-based encoder-decoder architecture that comprehends input text and generates
text answers.
2 Related Work
Machine comprehension and question answering tasks went through multiple stages of evolution over the past decade.
Traditionally, these tasks relied on complex NLP pipelines involving steps like syntactic parsing, semantic parsing, and
question classification. With the rise of neural networks, end-to-end neural architectures have increasingly been applied
to comprehension tasks [2][3][4][5]. The evolution of available datasets was a driving force behind the progress in
this domain—from models simply choosing between multiple choice answers [4][5] to those generating single-token
answers [2][3] or span-of-words answers [6].
Throughout this evolution, attention has emerged as a key concept, in particular as the task demanded by dataset became
more complex. Attention was first utilized more in context of non-NLP tasks such as learning alignments between image
objects and agent actions [7], or between visual features of an image and text description in the caption generation task.
Bahdanau et al. (2015) applied the attention mechanism to Neural Machine Translation (NMT) [8], and Luong et al.
(2015) explored different architectures for attention-based NMT, including a global approach and a local approach [9].
Figure 1: Architecture of the encoder-decoder model with attention
Several approaches have been tried to incorporate attention into machine comprehension and question answering tasks.
Seo et al. (2017) applied bidirectional models with attention to achieve near state-of-the-art results for SQuAD [10].
Wang & Jiang (2016) combined match-LSTM, an attention model originally proposed for text entailment, with Pointer
Net, a sequence-to-sequence model proposed by Vinyals et al (2015) that constrains output words to the tokens from
input sequences [6][11].
3 Dataset: MS MARCO
MS MARCO is a reading comprehension dataset with a number of characteristics that make it the most realistic
available dataset for question answering. All questions are real anonymized queries from Bing, and the context passages,
from which target answers are derived, are sampled from real web documents, closely mirroring the real-world scenario
of finding an answer to a question on a search engine. The target answers, as mentioned earlier, are sequence of words
and are human-generated.
The dataset has 100,000 queries. Each query consists of one question, approximately 10 context passages, and target
answers. While most queries have one answer, some have many and some have none. The average length of the
questions, passages, and target answers are approximately 15, 85, and 8 tokens, respectively. There are no particular
topical focus, but the queries fall into one of the following five categories: description (52.6%), numeric (28.4%), entity
(10.5%), location (5.7%) and person (2.7%).
Given the nature of text generation task required by the dataset, we use ROUGE-L and BLEU as evaluation metrics.
4 Methods
The problem we are trying to solve can be defined formally as follows. For each example in the dataset, we are given a
question and a set of potentially relevant context passages. The question is represented as Q 2 Rd⇥n
where d is the
embedding size and n is the length of the question. The set of passages is represented as S = {P1, P2, ..., P }, where
is the number of passages. Our objective is to (1) choose the most relevant passage P out of S, and (2) generate the
answer consisting of a sequence of words, represented as R 2 Rd⇥k
where k is the length of the answer. Because both
tasks require very similar architectures, this paper will only focus on the second objective, assuming the best passage P
has already been selected.
2
We use an encoder-decoder architecture with multiple attention mechanisms to achieve this, as shown in Figure 1. The
components of this model are:
1. Question Encoder is an LSTM layer mapping the question information to a vector space.
2. Passage Encoder with Attention is an LSTM layer with an attention mechanism connecting the passage
information with the encoded knowledge about the question.
3. Attention Encoder is an extra LSTM layer further distilling the output from the Passage Encoder with
Attention.
4. Decoder is an LSTM layer that takes in information from the three encoders and generates the answer. It has
an attention mechanism connecting to the Attention Encoder.
4.1 Question Encoder
The purpose of the question encoder is to incorporate contextual information from each token of the question to a
vector space. We use a standard unidirectional LSTM [12] layer to process the question, represented as follows:
Hq
=
!
LSTM(Q) (1)
The output matrix Hq
2 Rh⇥n
is the hidden state representation of the question, where h is the dimension of the
hidden state. hq
i represents the ith column of Hq
, and encapsulates the contextual information up to the ith token of the
question (qi).
4.2 Passage Encoder with Attention
The objective of this layer is to capture the information from the passage relevant to the contents of the question. To that
end, it employs a standard LSTM layer with the global attention mechanism proposed by Luong et al [9]. P 2 Rd⇥l
is the matrix representing the most relevant context passage, where d is the embedding size and l is the length of the
passage. At each time step t, this encoder uses the embedding of tth token of the passage and the entire hidden state
representation from the question encoder to capture contextual information up to the tth token relevant to the question.
This is represented as follows:
˜hp
t =
!
AttentionLSTM(pt, Hq
) (2)
where ˜hp
t 2 Rh
is the hidden vector capturing the information, pt 2 Rd
is the tth token of the passage (i.e. tth column
of P), and Hq
is the matrix representing all the hidden states of the question encoder.
!
AttentionLSTM is an abstraction involving the following two steps: First, it captures information from the tth token
using a regular LSTM cell, represented as hp
t =
!
LSTM(pt), where hp
t 2 Rh
. Second, using Hq
and hp
t , it derives a
context vector ct 2 Rh
that captures relevant information from the questions. ct is concatenated with hp
t to generate ˜hp
t ,
as shown below:
˜hp
t =
!
AttentionLSTM(pt, Hq
) = tanh(Wc[hp
t ; ct] + bc) (3)
where Wc 2 Rh⇥2h
is a weight matrix and bc 2 Rh
is bias.
The context vector ct captures all the hidden states from the question encoder weighted by how relevant each question
word is to the tth passage word. To derive ct, we first calculate the relevance score between the tth passage word,
represented by hp
t , and jth question word, represented by hq
j :
score(hp
t , hq
j ) = (Wshp
t + bs)|
hq
j (4)
where Ws 2 Rh⇥h
is a weight matrix and bs 2 Rh
is bias. We then create the attention vector at = {a
(1)
t , a
(2)
t , ... ,
a
(n)
t } 2 R1⇥n
. at is a horizontal vector whose jth element is a softmax value of score(hp
t , hq
j ). ct is the weighted
average of Hq
based on the attention vector at.
a
(j)
t =
exp(score(hp
t , hq
j ))
Pn
⇢ exp(score(hp
t , hq
⇢))
(5)
gt = Hq ¯at (6)
ct =
nX
i
g
(i)
t (7)
3
Figure 2: Loss and evaluation metrics
where ¯at 2 Rh⇥n
is the vertically broadcasted matrix of at, and g
(i)
t is the ith column of gt. All the formulas aside, the
important intuition here is that the hidden state of this encoder at each time step t incorporates information about not
only the passage tokens up to that step (as is the case with a normal LSTM hidden state), but also a measure of how
relevant each of the question tokens is to that particular tth passage token.
4.3 Attention Encoder
The purpose of this LSTM layer is to further distill the contextual information captured by the passage encoder. It is
a variant of the Match LSTM layer, first proposed by Wang & Jiang (2016) for text entailment and later applied to
question answering by the same authors [6]. We utilize a standard unidirectional LSTM layer taking as input all the
hidden states of the Passage Encoder with Attention, as shown below:
Hm
=
!
LSTM( ˜Hp
) (8)
where ˜Hp
= {˜hp
1, ˜hp
2, ..., ˜hp
l } 2 Rh⇥l
is the hidden state representation from the passage encoder.
4.4 Decoder
We use an LSTM layer with global attention between the decoder and the Attention Encoder to generate the answer.
The purpose of this attention mechanism is to let the decoder "peek" at the relevant information encapsulating the
passage and the question as it generates the answer. In addition, to pass along the contextual information captured by
the encoders, we concatenate the last hidden states from all three encoders and feed the combined matrix as the initial
hidden state to the decoder. At each time step t, the decoder takes as input the embedding of the generated token from
the previous step, and uses the same attention mechanism described in the earlier section to derive the hidden state
representation ˜hr
t 2 R3h
:
˜hr
t =
!
AttentionLSTM(rt 1, Hm
) (9)
where rt 1 2 Rd
is the embedding of the word generated in the last time step. Then we apply matrix multiplication and
softmax to generate the output vector ot 2 RV
.
ot = softmax(U˜hr
t + b) (10)
where V is the vocabulary size. rt is the word embedding corresponding to ot.
4
Description Person Numeric Location Entity Full Data Set
BLEU-1 9.1 3.2 7.4 3.8 7.4 9.3
ROUGE-L 15.1 2.8 13.7 3.4 5.5 12.8
Table 1: Evaluation metrics by question type from the held out validation set
5 Experiments and Discussion
5.1 Results
A variant of the model described earlier with a double-layer LSTM decoder achieved ROUGE-L of 12.8 and BLEU of
9.3 on the held out validation set. While this result is not sate-of-the-art, our model outperformed several benchmark
models such as Seq2Seq model with Memory Networks [1].
Figure 2 shows the training progression for our model. Loss steadily declines, but the model starts overfitting around
the 10th epoch, where both evaluation metrics on the development set peaked. Our classifier for selecting the most
relevant context passage achieved an accuracy of 100% on both the development set and the held out validation set.
We conducted over 40 runs for hyperparameter optimization, and found the following:
• Larger batch size generally performed better, with a batch size of 256 outperforming 64 (9.3 vs 6.4 BLEU)
• L2 loss outperformed cross entropy (9.3 vs 7.0 BLEU)
• Our model was sensitive to learning rate, with evaluation metrics dropping precipitously when the rate
approached 0.01 range (vs 0.001 or 0.0001)
5.2 Error Analysis
As illustrated by Table 1, our model was much more effective in generating descriptions and numeric answers than
people and locations. Its relative weakness on people and locations most likely occurred because of the rarity of their
names. The dataset contained a total of approximately 650,000 tokens. Because of memory restrictions, we used a
vocabulary size of 20,000 during training, and the rest of the words were cast to the < unknown > token. This practice
led to lower performance on the people and location questions, which on average contain more uncommon tokens
outside of the vocabulary size.
We also found that longer answers were much harder to predict. This is well known and reported in text generation
tasks because the farther into the decoder the model gets, the more diluted the original hidden state becomes. Because
of this, it is very challenging to predict long sequences of words.
Lastly, our design decision on how to treat the < end > token in the generated answer impacted our error rate. When
training, we masked the generated answers based on the lengths of ground truth. Our hypothesis was that this would
encourage our model to generate predictions that are of the same length as the target answers. However, because of this
decision, the model was not penalized for generating often irrelevant tokens beyond the length of the target answer.
This led to lower evaluation metrics on the held out validation set.
6 Conclusion
We see a few areas of improvement for future work. First, we would like to reduce the burden on the decoder softmax
by limiting the vocabulary to only the tokens used in that particular query’s question and passage. Predicting the right
token out of 20,000 options is challenging, especially when the model is required to get that right 20 to 30 times in a
row. Second, we would like to train a language model to provide a warm start for our decoder, so the decoder does not
have to learn how to generate sensible sequence of words in English in addition to responding to the questions and
context passages simultaneously.
5
References
[1] Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng Gao, Saurabh Tiwary, Rangan Majumder, and Li Deng. 2016. MS MARCO: A
Human Generated MAchine Reading COmprehension Dataset. In NIPS 2016.
[2] Karl Moritz Hermann, Tomáš Koˇciský, Edward Grefenstette, Lasse Espeholt, Will Kay, Mustafa Suleyman, and Phil Blunsom.
2015. Teaching Machines To Read And Comprehend In arXiv:1506.03340 [cs.CL].
[3] Rudolf Kadlec, Martin Schmid, Ondrej Bajgar, and Jan Kleindienst. 2016. Text Understanding With The Attention Sum Reader
Network. In ACL 2016.
[4] Felix Hill, Antoine Bordes, Sumit Chopra, and Jason Weston. 2015. The Goldilocks Principle: Reading Children’s Books with
Explicit Memory Representations. In arXiv:1511.02301 [cs.CL].
[5] Wenpeng Yin, Sebastian Ebert, and Hinrich Schütze. 2016. Attention-Based Convolutional Neural Network For Machine
Comprehension. In arXiv:1602.04341 [cs.CL].
[6] Shuohang Wang and Jing Jiang. 2016. Machine comprehension using Match-LSTM and answer pointer. Under review as a
conference paper at ICLR 2017.
[7] Volodymyr Mnih, Nicolas Heess, Alex Graves, and Koray Kavukcuoglu. 2014. Recurrent models of visual attention. In NIPS.
[8] Dzmitry Bahdanau, Kyunghyun Cho, Yoshua Bengio. 2014. Neural Machine Translation By Jointly Learning To Align And
Translate. In arXiv:1409.0473 [cs.CL].
[9] Minh-Thang Luong, Hieu Pham, and Christopher D. Manning. 2015. Effective Approaches to Attention-based Neural Machine
Translation. In arXiv:1508.04025 [cs.CL].
[10] Minjoon Seo, Aniruddha Kembhavi, Ali Farhadi, and Hannaneh Hajishirzi. 2017. Bidirectional Attention Flow for Machine
Comprehension. In Proceedings of ICLR 2017.
[11] Oriol Vinyals, Meire Fortunato, and Navdeep Jaitly. 2015. Pointer Networks. arXiv:1506.03134 [stat.ML].
[12] Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long Short-Term Memory. Neural Computation 9.8 (1997): 1735-1780.
[13] Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. SQuAD: 100,000+ Questions for Machine
Comprehension of Text. In Proceedings of EMNLP 2016.
[14] Yelong Shen, Po-Sen Huang, Jianfeng Gao, and Weizhu Chen. Reasonet: Learning To Stop Reading In Machine Comprehension.
In arXiv:1609.05284 [cs.LG].
6

More Related Content

PDF
Modeling Text Independent Speaker Identification with Vector Quantization
PDF
NLP_Project_Paper_up276_vec241
DOC
2nd sem
PDF
Nlp research presentation
PDF
THE EFFECTS OF THE LDA TOPIC MODEL ON SENTIMENT CLASSIFICATION
PPTX
[Paper Reading] Supervised Learning of Universal Sentence Representations fro...
PDF
Summarization using ntc approach based on keyword extraction for discussion f...
PPT
similarity measure
Modeling Text Independent Speaker Identification with Vector Quantization
NLP_Project_Paper_up276_vec241
2nd sem
Nlp research presentation
THE EFFECTS OF THE LDA TOPIC MODEL ON SENTIMENT CLASSIFICATION
[Paper Reading] Supervised Learning of Universal Sentence Representations fro...
Summarization using ntc approach based on keyword extraction for discussion f...
similarity measure

What's hot (16)

PDF
A Novel Approach for User Search Results Using Feedback Sessions
PDF
EXTENDING OUTPUT ATTENTIONS IN RECURRENT NEURAL NETWORKS FOR DIALOG GENERATION
PDF
FAST FUZZY FEATURE CLUSTERING FOR TEXT CLASSIFICATION
PDF
text summarization using amr
PDF
IRJET- Chatbot Using Gated End-to-End Memory Networks
PDF
Cleveree: an artificially intelligent web service for Jacob voice chatbot
PDF
A4 elanjceziyan
PDF
LOG MESSAGE ANOMALY DETECTION WITH OVERSAMPLING
PDF
GENERAL REGRESSION NEURAL NETWORK BASED POS TAGGING FOR NEPALI TEXT
PDF
BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...
ODP
PDF
Neural Network in Knowledge Bases
PDF
ENSEMBLE MODEL FOR CHUNKING
PPTX
Reasoning Over Knowledge Base
PDF
A critical reassessment of
PPTX
NLP Project Presentation
A Novel Approach for User Search Results Using Feedback Sessions
EXTENDING OUTPUT ATTENTIONS IN RECURRENT NEURAL NETWORKS FOR DIALOG GENERATION
FAST FUZZY FEATURE CLUSTERING FOR TEXT CLASSIFICATION
text summarization using amr
IRJET- Chatbot Using Gated End-to-End Memory Networks
Cleveree: an artificially intelligent web service for Jacob voice chatbot
A4 elanjceziyan
LOG MESSAGE ANOMALY DETECTION WITH OVERSAMPLING
GENERAL REGRESSION NEURAL NETWORK BASED POS TAGGING FOR NEPALI TEXT
BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...
Neural Network in Knowledge Bases
ENSEMBLE MODEL FOR CHUNKING
Reasoning Over Knowledge Base
A critical reassessment of
NLP Project Presentation
Ad

Similar to NLP Project: Machine Comprehension Using Attention-Based LSTM Encoder-Decoder Architecture (20)

PDF
Improving neural question generation using answer separation
PPTX
Deep Learning Models for Question Answering
PPTX
2010 PACLIC - pay attention to categories
PDF
Transformer_tutorial.pdf
PDF
Deep Language and Vision by Amaia Salvador (Insight DCU 2018)
PDF
Attention
PDF
Lecture - 10 Transformer Model, Motivation to Transformers, Principles, and ...
PPTX
Embed, Encode, Attend, Predict – applying the 4 step NLP recipe for text clas...
PDF
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020
PPTX
Deep Learning Project.pptx
PPTX
Show, Attend and Tell: Neural Image Caption Generation with Visual Attention
PDF
M5 Topic 1 - Encoder Decoder MODEL-JEC.pdf
PPTX
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
PDF
From_seq2seq_to_BERT
PPTX
Natural Language Processing Advancements By Deep Learning: A Survey
PDF
Vjai paper reading201808-acl18-simple-and_effective multi-paragraph reading c...
PDF
Natural Language Processing NLP (Transformers)
PPTX
240115_Attention Is All You Need (2017 NIPS).pptx
PPTX
IMAGE CAPTION GENERATOR.pptx1.pptxxxxxxxxxx
PPTX
Sujit Pal - Applying the four-step "Embed, Encode, Attend, Predict" framework...
Improving neural question generation using answer separation
Deep Learning Models for Question Answering
2010 PACLIC - pay attention to categories
Transformer_tutorial.pdf
Deep Language and Vision by Amaia Salvador (Insight DCU 2018)
Attention
Lecture - 10 Transformer Model, Motivation to Transformers, Principles, and ...
Embed, Encode, Attend, Predict – applying the 4 step NLP recipe for text clas...
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020
Deep Learning Project.pptx
Show, Attend and Tell: Neural Image Caption Generation with Visual Attention
M5 Topic 1 - Encoder Decoder MODEL-JEC.pdf
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
From_seq2seq_to_BERT
Natural Language Processing Advancements By Deep Learning: A Survey
Vjai paper reading201808-acl18-simple-and_effective multi-paragraph reading c...
Natural Language Processing NLP (Transformers)
240115_Attention Is All You Need (2017 NIPS).pptx
IMAGE CAPTION GENERATOR.pptx1.pptxxxxxxxxxx
Sujit Pal - Applying the four-step "Embed, Encode, Attend, Predict" framework...
Ad

Recently uploaded (20)

PPTX
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
PDF
R24 SURVEYING LAB MANUAL for civil enggi
PPTX
CYBER-CRIMES AND SECURITY A guide to understanding
PPTX
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
PDF
TFEC-4-2020-Design-Guide-for-Timber-Roof-Trusses.pdf
PPTX
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
PDF
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
PDF
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
PDF
PPT on Performance Review to get promotions
PDF
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
PPTX
Welding lecture in detail for understanding
PPTX
Construction Project Organization Group 2.pptx
PDF
Embodied AI: Ushering in the Next Era of Intelligent Systems
PPTX
bas. eng. economics group 4 presentation 1.pptx
PPT
Project quality management in manufacturing
PPTX
UNIT-1 - COAL BASED THERMAL POWER PLANTS
DOCX
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
PDF
Digital Logic Computer Design lecture notes
PPTX
OOP with Java - Java Introduction (Basics)
DOCX
573137875-Attendance-Management-System-original
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
R24 SURVEYING LAB MANUAL for civil enggi
CYBER-CRIMES AND SECURITY A guide to understanding
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
TFEC-4-2020-Design-Guide-for-Timber-Roof-Trusses.pdf
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
PPT on Performance Review to get promotions
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
Welding lecture in detail for understanding
Construction Project Organization Group 2.pptx
Embodied AI: Ushering in the Next Era of Intelligent Systems
bas. eng. economics group 4 presentation 1.pptx
Project quality management in manufacturing
UNIT-1 - COAL BASED THERMAL POWER PLANTS
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
Digital Logic Computer Design lecture notes
OOP with Java - Java Introduction (Basics)
573137875-Attendance-Management-System-original

NLP Project: Machine Comprehension Using Attention-Based LSTM Encoder-Decoder Architecture

  • 1. LSTM Encoder-Decoder Architecture with Attention Mechanism for Machine Comprehension Eugene Nho MBA/MS, Intelligent Systems Stanford University enho@stanford.edu Brian Higgins BS, Computer Science Stanford University bhiggins@stanford.edu Abstract Machine comprehension remains a challenging open area of research. While many question answering models have been explored for existing datasets, little work has been done with the newly released MS MARCO dataset, which mirrors the reality much more closely and poses many unique challenges. We explore an end-to-end neural architecture with attention mechanisms for comprehending relevant information and generating text answers for MS MARCO. 1 Introduction Machine comprehension—building systems that comprehend natural language documents—is a challenging open area of research within natural language processing (NLP). Since the widespread use of statistical machine learning and more recently deep learning, its progress has been made largely in lockstep with the introduction of large-scale datasets. The latest such dataset is MS MARCO [1]. Unlike previous machine comprehension and question answering datasets, MS MARCO consists of real questions generated by anonymized queries from Bing, and has target answers in the form of sequence of words. Compared to other forms of answers like multiple choice, single-token answer or span of words, generating text is significantly more challenging. To our knowledge, no literature exists on end-to-end systems specifically addressing this dataset yet. In this paper, we explore an attention-based encoder-decoder architecture that comprehends input text and generates text answers. 2 Related Work Machine comprehension and question answering tasks went through multiple stages of evolution over the past decade. Traditionally, these tasks relied on complex NLP pipelines involving steps like syntactic parsing, semantic parsing, and question classification. With the rise of neural networks, end-to-end neural architectures have increasingly been applied to comprehension tasks [2][3][4][5]. The evolution of available datasets was a driving force behind the progress in this domain—from models simply choosing between multiple choice answers [4][5] to those generating single-token answers [2][3] or span-of-words answers [6]. Throughout this evolution, attention has emerged as a key concept, in particular as the task demanded by dataset became more complex. Attention was first utilized more in context of non-NLP tasks such as learning alignments between image objects and agent actions [7], or between visual features of an image and text description in the caption generation task. Bahdanau et al. (2015) applied the attention mechanism to Neural Machine Translation (NMT) [8], and Luong et al. (2015) explored different architectures for attention-based NMT, including a global approach and a local approach [9].
  • 2. Figure 1: Architecture of the encoder-decoder model with attention Several approaches have been tried to incorporate attention into machine comprehension and question answering tasks. Seo et al. (2017) applied bidirectional models with attention to achieve near state-of-the-art results for SQuAD [10]. Wang & Jiang (2016) combined match-LSTM, an attention model originally proposed for text entailment, with Pointer Net, a sequence-to-sequence model proposed by Vinyals et al (2015) that constrains output words to the tokens from input sequences [6][11]. 3 Dataset: MS MARCO MS MARCO is a reading comprehension dataset with a number of characteristics that make it the most realistic available dataset for question answering. All questions are real anonymized queries from Bing, and the context passages, from which target answers are derived, are sampled from real web documents, closely mirroring the real-world scenario of finding an answer to a question on a search engine. The target answers, as mentioned earlier, are sequence of words and are human-generated. The dataset has 100,000 queries. Each query consists of one question, approximately 10 context passages, and target answers. While most queries have one answer, some have many and some have none. The average length of the questions, passages, and target answers are approximately 15, 85, and 8 tokens, respectively. There are no particular topical focus, but the queries fall into one of the following five categories: description (52.6%), numeric (28.4%), entity (10.5%), location (5.7%) and person (2.7%). Given the nature of text generation task required by the dataset, we use ROUGE-L and BLEU as evaluation metrics. 4 Methods The problem we are trying to solve can be defined formally as follows. For each example in the dataset, we are given a question and a set of potentially relevant context passages. The question is represented as Q 2 Rd⇥n where d is the embedding size and n is the length of the question. The set of passages is represented as S = {P1, P2, ..., P }, where is the number of passages. Our objective is to (1) choose the most relevant passage P out of S, and (2) generate the answer consisting of a sequence of words, represented as R 2 Rd⇥k where k is the length of the answer. Because both tasks require very similar architectures, this paper will only focus on the second objective, assuming the best passage P has already been selected. 2
  • 3. We use an encoder-decoder architecture with multiple attention mechanisms to achieve this, as shown in Figure 1. The components of this model are: 1. Question Encoder is an LSTM layer mapping the question information to a vector space. 2. Passage Encoder with Attention is an LSTM layer with an attention mechanism connecting the passage information with the encoded knowledge about the question. 3. Attention Encoder is an extra LSTM layer further distilling the output from the Passage Encoder with Attention. 4. Decoder is an LSTM layer that takes in information from the three encoders and generates the answer. It has an attention mechanism connecting to the Attention Encoder. 4.1 Question Encoder The purpose of the question encoder is to incorporate contextual information from each token of the question to a vector space. We use a standard unidirectional LSTM [12] layer to process the question, represented as follows: Hq = ! LSTM(Q) (1) The output matrix Hq 2 Rh⇥n is the hidden state representation of the question, where h is the dimension of the hidden state. hq i represents the ith column of Hq , and encapsulates the contextual information up to the ith token of the question (qi). 4.2 Passage Encoder with Attention The objective of this layer is to capture the information from the passage relevant to the contents of the question. To that end, it employs a standard LSTM layer with the global attention mechanism proposed by Luong et al [9]. P 2 Rd⇥l is the matrix representing the most relevant context passage, where d is the embedding size and l is the length of the passage. At each time step t, this encoder uses the embedding of tth token of the passage and the entire hidden state representation from the question encoder to capture contextual information up to the tth token relevant to the question. This is represented as follows: ˜hp t = ! AttentionLSTM(pt, Hq ) (2) where ˜hp t 2 Rh is the hidden vector capturing the information, pt 2 Rd is the tth token of the passage (i.e. tth column of P), and Hq is the matrix representing all the hidden states of the question encoder. ! AttentionLSTM is an abstraction involving the following two steps: First, it captures information from the tth token using a regular LSTM cell, represented as hp t = ! LSTM(pt), where hp t 2 Rh . Second, using Hq and hp t , it derives a context vector ct 2 Rh that captures relevant information from the questions. ct is concatenated with hp t to generate ˜hp t , as shown below: ˜hp t = ! AttentionLSTM(pt, Hq ) = tanh(Wc[hp t ; ct] + bc) (3) where Wc 2 Rh⇥2h is a weight matrix and bc 2 Rh is bias. The context vector ct captures all the hidden states from the question encoder weighted by how relevant each question word is to the tth passage word. To derive ct, we first calculate the relevance score between the tth passage word, represented by hp t , and jth question word, represented by hq j : score(hp t , hq j ) = (Wshp t + bs)| hq j (4) where Ws 2 Rh⇥h is a weight matrix and bs 2 Rh is bias. We then create the attention vector at = {a (1) t , a (2) t , ... , a (n) t } 2 R1⇥n . at is a horizontal vector whose jth element is a softmax value of score(hp t , hq j ). ct is the weighted average of Hq based on the attention vector at. a (j) t = exp(score(hp t , hq j )) Pn ⇢ exp(score(hp t , hq ⇢)) (5) gt = Hq ¯at (6) ct = nX i g (i) t (7) 3
  • 4. Figure 2: Loss and evaluation metrics where ¯at 2 Rh⇥n is the vertically broadcasted matrix of at, and g (i) t is the ith column of gt. All the formulas aside, the important intuition here is that the hidden state of this encoder at each time step t incorporates information about not only the passage tokens up to that step (as is the case with a normal LSTM hidden state), but also a measure of how relevant each of the question tokens is to that particular tth passage token. 4.3 Attention Encoder The purpose of this LSTM layer is to further distill the contextual information captured by the passage encoder. It is a variant of the Match LSTM layer, first proposed by Wang & Jiang (2016) for text entailment and later applied to question answering by the same authors [6]. We utilize a standard unidirectional LSTM layer taking as input all the hidden states of the Passage Encoder with Attention, as shown below: Hm = ! LSTM( ˜Hp ) (8) where ˜Hp = {˜hp 1, ˜hp 2, ..., ˜hp l } 2 Rh⇥l is the hidden state representation from the passage encoder. 4.4 Decoder We use an LSTM layer with global attention between the decoder and the Attention Encoder to generate the answer. The purpose of this attention mechanism is to let the decoder "peek" at the relevant information encapsulating the passage and the question as it generates the answer. In addition, to pass along the contextual information captured by the encoders, we concatenate the last hidden states from all three encoders and feed the combined matrix as the initial hidden state to the decoder. At each time step t, the decoder takes as input the embedding of the generated token from the previous step, and uses the same attention mechanism described in the earlier section to derive the hidden state representation ˜hr t 2 R3h : ˜hr t = ! AttentionLSTM(rt 1, Hm ) (9) where rt 1 2 Rd is the embedding of the word generated in the last time step. Then we apply matrix multiplication and softmax to generate the output vector ot 2 RV . ot = softmax(U˜hr t + b) (10) where V is the vocabulary size. rt is the word embedding corresponding to ot. 4
  • 5. Description Person Numeric Location Entity Full Data Set BLEU-1 9.1 3.2 7.4 3.8 7.4 9.3 ROUGE-L 15.1 2.8 13.7 3.4 5.5 12.8 Table 1: Evaluation metrics by question type from the held out validation set 5 Experiments and Discussion 5.1 Results A variant of the model described earlier with a double-layer LSTM decoder achieved ROUGE-L of 12.8 and BLEU of 9.3 on the held out validation set. While this result is not sate-of-the-art, our model outperformed several benchmark models such as Seq2Seq model with Memory Networks [1]. Figure 2 shows the training progression for our model. Loss steadily declines, but the model starts overfitting around the 10th epoch, where both evaluation metrics on the development set peaked. Our classifier for selecting the most relevant context passage achieved an accuracy of 100% on both the development set and the held out validation set. We conducted over 40 runs for hyperparameter optimization, and found the following: • Larger batch size generally performed better, with a batch size of 256 outperforming 64 (9.3 vs 6.4 BLEU) • L2 loss outperformed cross entropy (9.3 vs 7.0 BLEU) • Our model was sensitive to learning rate, with evaluation metrics dropping precipitously when the rate approached 0.01 range (vs 0.001 or 0.0001) 5.2 Error Analysis As illustrated by Table 1, our model was much more effective in generating descriptions and numeric answers than people and locations. Its relative weakness on people and locations most likely occurred because of the rarity of their names. The dataset contained a total of approximately 650,000 tokens. Because of memory restrictions, we used a vocabulary size of 20,000 during training, and the rest of the words were cast to the < unknown > token. This practice led to lower performance on the people and location questions, which on average contain more uncommon tokens outside of the vocabulary size. We also found that longer answers were much harder to predict. This is well known and reported in text generation tasks because the farther into the decoder the model gets, the more diluted the original hidden state becomes. Because of this, it is very challenging to predict long sequences of words. Lastly, our design decision on how to treat the < end > token in the generated answer impacted our error rate. When training, we masked the generated answers based on the lengths of ground truth. Our hypothesis was that this would encourage our model to generate predictions that are of the same length as the target answers. However, because of this decision, the model was not penalized for generating often irrelevant tokens beyond the length of the target answer. This led to lower evaluation metrics on the held out validation set. 6 Conclusion We see a few areas of improvement for future work. First, we would like to reduce the burden on the decoder softmax by limiting the vocabulary to only the tokens used in that particular query’s question and passage. Predicting the right token out of 20,000 options is challenging, especially when the model is required to get that right 20 to 30 times in a row. Second, we would like to train a language model to provide a warm start for our decoder, so the decoder does not have to learn how to generate sensible sequence of words in English in addition to responding to the questions and context passages simultaneously. 5
  • 6. References [1] Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng Gao, Saurabh Tiwary, Rangan Majumder, and Li Deng. 2016. MS MARCO: A Human Generated MAchine Reading COmprehension Dataset. In NIPS 2016. [2] Karl Moritz Hermann, Tomáš Koˇciský, Edward Grefenstette, Lasse Espeholt, Will Kay, Mustafa Suleyman, and Phil Blunsom. 2015. Teaching Machines To Read And Comprehend In arXiv:1506.03340 [cs.CL]. [3] Rudolf Kadlec, Martin Schmid, Ondrej Bajgar, and Jan Kleindienst. 2016. Text Understanding With The Attention Sum Reader Network. In ACL 2016. [4] Felix Hill, Antoine Bordes, Sumit Chopra, and Jason Weston. 2015. The Goldilocks Principle: Reading Children’s Books with Explicit Memory Representations. In arXiv:1511.02301 [cs.CL]. [5] Wenpeng Yin, Sebastian Ebert, and Hinrich Schütze. 2016. Attention-Based Convolutional Neural Network For Machine Comprehension. In arXiv:1602.04341 [cs.CL]. [6] Shuohang Wang and Jing Jiang. 2016. Machine comprehension using Match-LSTM and answer pointer. Under review as a conference paper at ICLR 2017. [7] Volodymyr Mnih, Nicolas Heess, Alex Graves, and Koray Kavukcuoglu. 2014. Recurrent models of visual attention. In NIPS. [8] Dzmitry Bahdanau, Kyunghyun Cho, Yoshua Bengio. 2014. Neural Machine Translation By Jointly Learning To Align And Translate. In arXiv:1409.0473 [cs.CL]. [9] Minh-Thang Luong, Hieu Pham, and Christopher D. Manning. 2015. Effective Approaches to Attention-based Neural Machine Translation. In arXiv:1508.04025 [cs.CL]. [10] Minjoon Seo, Aniruddha Kembhavi, Ali Farhadi, and Hannaneh Hajishirzi. 2017. Bidirectional Attention Flow for Machine Comprehension. In Proceedings of ICLR 2017. [11] Oriol Vinyals, Meire Fortunato, and Navdeep Jaitly. 2015. Pointer Networks. arXiv:1506.03134 [stat.ML]. [12] Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long Short-Term Memory. Neural Computation 9.8 (1997): 1735-1780. [13] Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. SQuAD: 100,000+ Questions for Machine Comprehension of Text. In Proceedings of EMNLP 2016. [14] Yelong Shen, Po-Sen Huang, Jianfeng Gao, and Weizhu Chen. Reasonet: Learning To Stop Reading In Machine Comprehension. In arXiv:1609.05284 [cs.LG]. 6