Simple and Effective Multi-Paragraph
Reading Comprehension
Christopher Clark, Matt Gardner
(ACL’18)
Paper Reading Fest - 2018/8
Nguyen Phuoc Tat Dat
Question answering task
2
• Return a span of the answer from text
• Challenges:
• Understand the natural language
• Knowledge on the domain
Given text Generate answer to questions
Rajpurkar et al., 2016
Paper overview
● Title: Simple and Effective Multi-Paragraph Reading Comprehension (ACL’18)
→ http://aclweb.org/anthology/P18-1078
● Authors: Christopher Clark, Matt Gardner
● Abstract:
○ Current question answering (QA) models cannot scale to document or multi-document input.
○ Propose a method to apply neural paragraph-level QA models to document-level problem.
● Reasons I chose this paper:
○ My personal research interest on Natural Language Understanding, especially on QA system
○ Proposed method in this paper can work well with document retrieval system (search engine),
which bridges the gap between current QA research on single paragraph and practical QA
system.
3
Agenda
● Introduction
● Pipeline method
● Confidence method
● Experiments
● Results
● Conclusion
4
Introduction
5
Paragraph-level
6https://rajpurkar.github.io/SQuAD-explorer/
7
Document-level
...
Two approaches for document-level QA
● Pipeline approaches:
○ Select one paragraph
○ Extract an answer from the paragraph
● Confidence-based methods:
○ Seek for answers and produce confidence scores from multiple paragraphs
○ Return the answer with highest confidence score
8
Pipeline method
9
Paragraph selection
● Single document:
○ TF-IDF cosine distance between paragraph & query
○ IDF is computed using single paragraphs in the document
● Multi-documents:
○ Linear classification with the following features for each paragraph:
■ TF-IDF score as above
■ Whether the paragraph is the first in its document or not
■ How many tokens preceded it
■ Number of question words included in the paragraph
● Ground truth: select paragraphs containing at least one answer span
● Train the classifier on distantly supervised objective on positive paragraphs
10
where:
A is set of tokens that start the answer
pi
: answer start probability predicted by the model for token i
● Some spans of answer may not relate to the
question -> noisy
● Use summed objective function to optimize the
negative log-likelihood of any correct answer span.
● Apply for both start and end token of the answer
span independently.
● The objective for predicting the answer start token:
Noisy labels
11
The model
12Input text Query text
Embedding layer:
● Word embedding: pre-trained
● Character-derived word embedding: learn
Preprocess layer:
● Bi-directional GRU
Attention layer
The model
13Input text Query text
1. Attention between context word i and question word j:
● nq
, nc
: the lengths of the question and context respectively
● hi
, qj
: vector for context word i and question word j respectively
where w1
, w2
, and w3
are learned vectors
2. Compute attended vector ci
for each context token:
3. Compute query-to-context vector qc
4. Concatenate
5. Linear layer with ReLU activations
Variational dropout is applied before all GRUs and
attention mechanisms at rate of 0.2
The model
14Input text Query text
Self-Attention layer
● Residual style self-attention
● Bi-directional GRU
● Only context-to-query attends itself
Prediction layer
● Start score: Bi-directional GRU, then linear layer
● End score: residual branch of Bi-directional GRU
is added to the input, then pass to another
Bi-directional GRU and finally a linear layer
Confidence method
15
Confidence method
● Span confidence score: sum of start and end score of the span
● At test time:
○ Run the model on each paragraph
○ Select the span with highest confidence score
● Experiment with four approaches to train the confidence model
○ Shared-Normalization
○ Merge
○ No-Answer Option
○ Sigmoid
16
Shared-Normalization
● Normalized start and end scores for all paragraphs from the same context with
the same normalized factor
● Produce comparable scores across paragraphs
17
Merge
● Concatenate all paragraphs from the same context
● Add paragraph separator token & a learned embedding before each paragraph
18
No-Answer Option
● For each paragraph, allow model to return “no-answer”
● Objective function:
● Calculate z as “no-answer” probability. Then, the objective will be:
19
where:
● sj
and gj
are start & end scores produced by the model for token j
● a and b are the correct start and end tokens.
where δ is 1 if an answer exists and 0 otherwise.
Sigmoid
● Sigmoid loss objective function
● Start/end probability for each token: sigmoid function to the start/end scores
● Cross entropy loss is used on each individual probability
● Scores are calculated independently → comparable between different
paragraphs
20
Experiments
21
● TriviaQA (Joshi et al., 2017)
○ TriviaQA unfiltered: paired with documents found by completing a web search of the questions
○ TriviaQA wiki: the same dataset but only including Wikipedia articles
○ TriviaQA web: a dataset derived from TriviaQA unfiltered by treating each question-document
pair where the document contains the question answer as an individual training point
● SQuAD (Rajpurkar et al., 2016)
○ A collection of Wikipedia articles and crowdsourced questions
Datasets
22
● GloVe word vectors: 300 dimensional
● On SQuAD: 100-dim for GRUs, 200-dim for the linear layers, batch size 45
● On TriviaQA: 140-dim for GRUs, 280-dim for the linear layers, batch size 60
● Optimizer: Adadelta
● On training: maintain an exponential moving average of the weights with a
decay rate of 0.999.
● On testing: select the most probable answer span of length less than or
equal to 8 for TriviaQA and 17 for SQuAD.
Implementation details
23
Results
24
● Exact match
○ Average number of predictions which exactly match any one of the ground truth answers of the
question.
● Macro-averaged F1-score
○ Treat prediction and ground truth answer as bags of tokens, then compute their F1.
○ Take the maximum F1 over all of the ground truth answers for a given question, and then
average over all of the questions
Evaluation scores
25
Trivia web
26
TriviaQA web & wiki
27
TriviaQA Unfiltered
28
SQuAD
29
Curated TREC
30
Discussion
● Drawback of SQuAD for document-level QA
○ Models trained on SQuAD data perform very poorly in the multi-paragraph setting
○ Reasons:
■ Only paragraph-specific questions are provided
■ All questions are answerable
■ Paragraphs are short
● Shared-norm model performs well even more paragraphs are added
● No-answer and merge approaches are effective, but do not provide confidence
score
● Sigmoid object function reduces paragraph-level performance (Fig4) →
vulnerable to label noise 31
Label 200 random TriviaQA web errors of shared-norm model
Error analysis
32
■ Sources of errors on multi-sentence reading:
1. Connecting multiple statements in the same paragraph
2. Long-range coreference
3. Knowledge background (few)
1. Continue advancing the sentence and paragraph level reading comprehension
2. Adding a mechanism to handle document-level coreferences.
Conclusion
● Proposed techniques:
○ Sampling non-answer-containing paragraphs
○ Shared-norm objective function
○ Paragraph selection
● This work can be applied to build open Question Answering system.
33
Thank you!
34
Happy discussion!

More Related Content

PDF
Introduction to TCAV (ICML2018)
PDF
Benchmarking transfer learning approaches for NLP
PPTX
Natural Language Generation / Stanford cs224n 2019w lecture 15 Review
PPTX
Competitive Programming Guide
PPTX
Deep Learning Models for Question Answering
PDF
Improving neural question generation using answer separation
PDF
NLP Project: Machine Comprehension Using Attention-Based LSTM Encoder-Decoder...
PDF
[poster] A Compare-Aggregate Model with Latent Clustering for Answer Selection
Introduction to TCAV (ICML2018)
Benchmarking transfer learning approaches for NLP
Natural Language Generation / Stanford cs224n 2019w lecture 15 Review
Competitive Programming Guide
Deep Learning Models for Question Answering
Improving neural question generation using answer separation
NLP Project: Machine Comprehension Using Attention-Based LSTM Encoder-Decoder...
[poster] A Compare-Aggregate Model with Latent Clustering for Answer Selection

Similar to Vjai paper reading201808-acl18-simple-and_effective multi-paragraph reading comprehension (20)

PDF
From_seq2seq_to_BERT
PDF
05-transformers.pdf
PPTX
Embed, Encode, Attend, Predict – applying the 4 step NLP recipe for text clas...
PDF
NLP_Project_Paper_up276_vec241
PDF
Transformers to Learn Hierarchical Contexts in Multiparty Dialogue for Span-b...
PPTX
a deep reinforced model for abstractive summarization
PPTX
[Paper Reading] Attention is All You Need
PPTX
Sujit Pal - Applying the four-step "Embed, Encode, Attend, Predict" framework...
PDF
Transformers to Learn Hierarchical Contexts in Multiparty Dialogue
PDF
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
PPTX
Deep Learning Enabled Question Answering System to Automate Corporate Helpdesk
PPTX
240122_Attention Is All You Need (2017 NIPS)2.pptx
PDF
Lecture - 10 Transformer Model, Motivation to Transformers, Principles, and ...
PDF
IRJET- Factoid Question and Answering System
PDF
Attention is all you need (UPC Reading Group 2018, by Santi Pascual)
PPTX
wordembedding.pptx
PPTX
[AIoTLab]attention mechanism.pptx
PDF
Transformers in AI: Revolutionizing Natural Language Processing
PDF
Transformers: Revolutionizing NLP with Self-Attention
PDF
Question Answering with Subgraph Embeddings
From_seq2seq_to_BERT
05-transformers.pdf
Embed, Encode, Attend, Predict – applying the 4 step NLP recipe for text clas...
NLP_Project_Paper_up276_vec241
Transformers to Learn Hierarchical Contexts in Multiparty Dialogue for Span-b...
a deep reinforced model for abstractive summarization
[Paper Reading] Attention is All You Need
Sujit Pal - Applying the four-step "Embed, Encode, Attend, Predict" framework...
Transformers to Learn Hierarchical Contexts in Multiparty Dialogue
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Deep Learning Enabled Question Answering System to Automate Corporate Helpdesk
240122_Attention Is All You Need (2017 NIPS)2.pptx
Lecture - 10 Transformer Model, Motivation to Transformers, Principles, and ...
IRJET- Factoid Question and Answering System
Attention is all you need (UPC Reading Group 2018, by Santi Pascual)
wordembedding.pptx
[AIoTLab]attention mechanism.pptx
Transformers in AI: Revolutionizing Natural Language Processing
Transformers: Revolutionizing NLP with Self-Attention
Question Answering with Subgraph Embeddings
Ad

Recently uploaded (20)

PDF
Introduction to Data Science and Data Analysis
PPTX
FMIS 108 and AISlaudon_mis17_ppt_ch11.pptx
PPTX
Steganography Project Steganography Project .pptx
PPT
Image processing and pattern recognition 2.ppt
PPTX
sac 451hinhgsgshssjsjsjheegdggeegegdggddgeg.pptx
PPTX
DS-40-Pre-Engagement and Kickoff deck - v8.0.pptx
PDF
REAL ILLUMINATI AGENT IN KAMPALA UGANDA CALL ON+256765750853/0705037305
PPTX
(Ali Hamza) Roll No: (F24-BSCS-1103).pptx
PPTX
retention in jsjsksksksnbsndjddjdnFPD.pptx
PPTX
Phase1_final PPTuwhefoegfohwfoiehfoegg.pptx
PPTX
SAP 2 completion done . PRESENTATION.pptx
PPTX
New ISO 27001_2022 standard and the changes
PDF
Tetra Pak Index 2023 - The future of health and nutrition - Full report.pdf
PDF
Data Engineering Interview Questions & Answers Cloud Data Stacks (AWS, Azure,...
PPTX
Business_Capability_Map_Collection__pptx
PDF
[EN] Industrial Machine Downtime Prediction
PPT
Predictive modeling basics in data cleaning process
PDF
Systems Analysis and Design, 12th Edition by Scott Tilley Test Bank.pdf
PPTX
QUANTUM_COMPUTING_AND_ITS_POTENTIAL_APPLICATIONS[2].pptx
PPT
lectureusjsjdhdsjjshdshshddhdhddhhd1.ppt
Introduction to Data Science and Data Analysis
FMIS 108 and AISlaudon_mis17_ppt_ch11.pptx
Steganography Project Steganography Project .pptx
Image processing and pattern recognition 2.ppt
sac 451hinhgsgshssjsjsjheegdggeegegdggddgeg.pptx
DS-40-Pre-Engagement and Kickoff deck - v8.0.pptx
REAL ILLUMINATI AGENT IN KAMPALA UGANDA CALL ON+256765750853/0705037305
(Ali Hamza) Roll No: (F24-BSCS-1103).pptx
retention in jsjsksksksnbsndjddjdnFPD.pptx
Phase1_final PPTuwhefoegfohwfoiehfoegg.pptx
SAP 2 completion done . PRESENTATION.pptx
New ISO 27001_2022 standard and the changes
Tetra Pak Index 2023 - The future of health and nutrition - Full report.pdf
Data Engineering Interview Questions & Answers Cloud Data Stacks (AWS, Azure,...
Business_Capability_Map_Collection__pptx
[EN] Industrial Machine Downtime Prediction
Predictive modeling basics in data cleaning process
Systems Analysis and Design, 12th Edition by Scott Tilley Test Bank.pdf
QUANTUM_COMPUTING_AND_ITS_POTENTIAL_APPLICATIONS[2].pptx
lectureusjsjdhdsjjshdshshddhdhddhhd1.ppt
Ad

Vjai paper reading201808-acl18-simple-and_effective multi-paragraph reading comprehension

  • 1. Simple and Effective Multi-Paragraph Reading Comprehension Christopher Clark, Matt Gardner (ACL’18) Paper Reading Fest - 2018/8 Nguyen Phuoc Tat Dat
  • 2. Question answering task 2 • Return a span of the answer from text • Challenges: • Understand the natural language • Knowledge on the domain Given text Generate answer to questions Rajpurkar et al., 2016
  • 3. Paper overview ● Title: Simple and Effective Multi-Paragraph Reading Comprehension (ACL’18) → http://aclweb.org/anthology/P18-1078 ● Authors: Christopher Clark, Matt Gardner ● Abstract: ○ Current question answering (QA) models cannot scale to document or multi-document input. ○ Propose a method to apply neural paragraph-level QA models to document-level problem. ● Reasons I chose this paper: ○ My personal research interest on Natural Language Understanding, especially on QA system ○ Proposed method in this paper can work well with document retrieval system (search engine), which bridges the gap between current QA research on single paragraph and practical QA system. 3
  • 4. Agenda ● Introduction ● Pipeline method ● Confidence method ● Experiments ● Results ● Conclusion 4
  • 8. Two approaches for document-level QA ● Pipeline approaches: ○ Select one paragraph ○ Extract an answer from the paragraph ● Confidence-based methods: ○ Seek for answers and produce confidence scores from multiple paragraphs ○ Return the answer with highest confidence score 8
  • 10. Paragraph selection ● Single document: ○ TF-IDF cosine distance between paragraph & query ○ IDF is computed using single paragraphs in the document ● Multi-documents: ○ Linear classification with the following features for each paragraph: ■ TF-IDF score as above ■ Whether the paragraph is the first in its document or not ■ How many tokens preceded it ■ Number of question words included in the paragraph ● Ground truth: select paragraphs containing at least one answer span ● Train the classifier on distantly supervised objective on positive paragraphs 10
  • 11. where: A is set of tokens that start the answer pi : answer start probability predicted by the model for token i ● Some spans of answer may not relate to the question -> noisy ● Use summed objective function to optimize the negative log-likelihood of any correct answer span. ● Apply for both start and end token of the answer span independently. ● The objective for predicting the answer start token: Noisy labels 11
  • 12. The model 12Input text Query text Embedding layer: ● Word embedding: pre-trained ● Character-derived word embedding: learn Preprocess layer: ● Bi-directional GRU
  • 13. Attention layer The model 13Input text Query text 1. Attention between context word i and question word j: ● nq , nc : the lengths of the question and context respectively ● hi , qj : vector for context word i and question word j respectively where w1 , w2 , and w3 are learned vectors 2. Compute attended vector ci for each context token: 3. Compute query-to-context vector qc 4. Concatenate 5. Linear layer with ReLU activations
  • 14. Variational dropout is applied before all GRUs and attention mechanisms at rate of 0.2 The model 14Input text Query text Self-Attention layer ● Residual style self-attention ● Bi-directional GRU ● Only context-to-query attends itself Prediction layer ● Start score: Bi-directional GRU, then linear layer ● End score: residual branch of Bi-directional GRU is added to the input, then pass to another Bi-directional GRU and finally a linear layer
  • 16. Confidence method ● Span confidence score: sum of start and end score of the span ● At test time: ○ Run the model on each paragraph ○ Select the span with highest confidence score ● Experiment with four approaches to train the confidence model ○ Shared-Normalization ○ Merge ○ No-Answer Option ○ Sigmoid 16
  • 17. Shared-Normalization ● Normalized start and end scores for all paragraphs from the same context with the same normalized factor ● Produce comparable scores across paragraphs 17
  • 18. Merge ● Concatenate all paragraphs from the same context ● Add paragraph separator token & a learned embedding before each paragraph 18
  • 19. No-Answer Option ● For each paragraph, allow model to return “no-answer” ● Objective function: ● Calculate z as “no-answer” probability. Then, the objective will be: 19 where: ● sj and gj are start & end scores produced by the model for token j ● a and b are the correct start and end tokens. where δ is 1 if an answer exists and 0 otherwise.
  • 20. Sigmoid ● Sigmoid loss objective function ● Start/end probability for each token: sigmoid function to the start/end scores ● Cross entropy loss is used on each individual probability ● Scores are calculated independently → comparable between different paragraphs 20
  • 22. ● TriviaQA (Joshi et al., 2017) ○ TriviaQA unfiltered: paired with documents found by completing a web search of the questions ○ TriviaQA wiki: the same dataset but only including Wikipedia articles ○ TriviaQA web: a dataset derived from TriviaQA unfiltered by treating each question-document pair where the document contains the question answer as an individual training point ● SQuAD (Rajpurkar et al., 2016) ○ A collection of Wikipedia articles and crowdsourced questions Datasets 22
  • 23. ● GloVe word vectors: 300 dimensional ● On SQuAD: 100-dim for GRUs, 200-dim for the linear layers, batch size 45 ● On TriviaQA: 140-dim for GRUs, 280-dim for the linear layers, batch size 60 ● Optimizer: Adadelta ● On training: maintain an exponential moving average of the weights with a decay rate of 0.999. ● On testing: select the most probable answer span of length less than or equal to 8 for TriviaQA and 17 for SQuAD. Implementation details 23
  • 25. ● Exact match ○ Average number of predictions which exactly match any one of the ground truth answers of the question. ● Macro-averaged F1-score ○ Treat prediction and ground truth answer as bags of tokens, then compute their F1. ○ Take the maximum F1 over all of the ground truth answers for a given question, and then average over all of the questions Evaluation scores 25
  • 27. TriviaQA web & wiki 27
  • 31. Discussion ● Drawback of SQuAD for document-level QA ○ Models trained on SQuAD data perform very poorly in the multi-paragraph setting ○ Reasons: ■ Only paragraph-specific questions are provided ■ All questions are answerable ■ Paragraphs are short ● Shared-norm model performs well even more paragraphs are added ● No-answer and merge approaches are effective, but do not provide confidence score ● Sigmoid object function reduces paragraph-level performance (Fig4) → vulnerable to label noise 31
  • 32. Label 200 random TriviaQA web errors of shared-norm model Error analysis 32 ■ Sources of errors on multi-sentence reading: 1. Connecting multiple statements in the same paragraph 2. Long-range coreference 3. Knowledge background (few) 1. Continue advancing the sentence and paragraph level reading comprehension 2. Adding a mechanism to handle document-level coreferences.
  • 33. Conclusion ● Proposed techniques: ○ Sampling non-answer-containing paragraphs ○ Shared-norm objective function ○ Paragraph selection ● This work can be applied to build open Question Answering system. 33