Vjai paper reading201808-acl18-simple-and_effective multi-paragraph reading comprehension

Simple and Effective Multi-Paragraph
Reading Comprehension
Christopher Clark, Matt Gardner
(ACL’18)
Paper Reading Fest - 2018/8
Nguyen Phuoc Tat Dat

Question answering task
2
• Return a span of the answer from text
• Challenges:
• Understand the natural language
• Knowledge on the domain
Given text Generate answer to questions
Rajpurkar et al., 2016

Paper overview
● Title: Simple and Effective Multi-Paragraph Reading Comprehension (ACL’18)
→ http://aclweb.org/anthology/P18-1078
● Authors: Christopher Clark, Matt Gardner
● Abstract:
○ Current question answering (QA) models cannot scale to document or multi-document input.
○ Propose a method to apply neural paragraph-level QA models to document-level problem.
● Reasons I chose this paper:
○ My personal research interest on Natural Language Understanding, especially on QA system
○ Proposed method in this paper can work well with document retrieval system (search engine),
which bridges the gap between current QA research on single paragraph and practical QA
system.
3

Agenda
● Introduction
● Pipeline method
● Confidence method
● Experiments
● Results
● Conclusion
4

Paragraph-level
6https://rajpurkar.github.io/SQuAD-explorer/

Two approaches for document-level QA
● Pipeline approaches:
○ Select one paragraph
○ Extract an answer from the paragraph
● Confidence-based methods:
○ Seek for answers and produce confidence scores from multiple paragraphs
○ Return the answer with highest confidence score
8

Paragraph selection
● Single document:
○ TF-IDF cosine distance between paragraph & query
○ IDF is computed using single paragraphs in the document
● Multi-documents:
○ Linear classification with the following features for each paragraph:
■ TF-IDF score as above
■ Whether the paragraph is the first in its document or not
■ How many tokens preceded it
■ Number of question words included in the paragraph
● Ground truth: select paragraphs containing at least one answer span
● Train the classifier on distantly supervised objective on positive paragraphs
10

where:
A is set of tokens that start the answer
pi
: answer start probability predicted by the model for token i
● Some spans of answer may not relate to the
question -> noisy
● Use summed objective function to optimize the
negative log-likelihood of any correct answer span.
● Apply for both start and end token of the answer
span independently.
● The objective for predicting the answer start token:
Noisy labels
11

The model
12Input text Query text
Embedding layer:
● Word embedding: pre-trained
● Character-derived word embedding: learn
Preprocess layer:
● Bi-directional GRU

Attention layer
The model
1. Attention between context word i and question word j:
● nq
, nc
: the lengths of the question and context respectively
● hi
, qj
: vector for context word i and question word j respectively
where w1
, w2
, and w3
are learned vectors
2. Compute attended vector ci
for each context token:
3. Compute query-to-context vector qc
4. Concatenate
5. Linear layer with ReLU activations

Variational dropout is applied before all GRUs and
attention mechanisms at rate of 0.2
The model
Self-Attention layer
● Residual style self-attention
● Bi-directional GRU
● Only context-to-query attends itself
Prediction layer
● Start score: Bi-directional GRU, then linear layer
● End score: residual branch of Bi-directional GRU
is added to the input, then pass to another
Bi-directional GRU and finally a linear layer

Confidence method
● Span confidence score: sum of start and end score of the span
● At test time:
○ Run the model on each paragraph
○ Select the span with highest confidence score
● Experiment with four approaches to train the confidence model
○ Shared-Normalization
○ Merge
○ No-Answer Option
○ Sigmoid
16

Shared-Normalization
● Normalized start and end scores for all paragraphs from the same context with
the same normalized factor
● Produce comparable scores across paragraphs
17

Merge
● Concatenate all paragraphs from the same context
● Add paragraph separator token & a learned embedding before each paragraph
18

No-Answer Option
● For each paragraph, allow model to return “no-answer”
● Objective function:
● Calculate z as “no-answer” probability. Then, the objective will be:
19
where:
● sj
and gj
are start & end scores produced by the model for token j
● a and b are the correct start and end tokens.
where δ is 1 if an answer exists and 0 otherwise.

Sigmoid
● Sigmoid loss objective function
● Start/end probability for each token: sigmoid function to the start/end scores
● Cross entropy loss is used on each individual probability
● Scores are calculated independently → comparable between different
paragraphs
20

● TriviaQA (Joshi et al., 2017)
○ TriviaQA unfiltered: paired with documents found by completing a web search of the questions
○ TriviaQA wiki: the same dataset but only including Wikipedia articles
○ TriviaQA web: a dataset derived from TriviaQA unfiltered by treating each question-document
pair where the document contains the question answer as an individual training point
● SQuAD (Rajpurkar et al., 2016)
○ A collection of Wikipedia articles and crowdsourced questions
Datasets
22

● GloVe word vectors: 300 dimensional
● On SQuAD: 100-dim for GRUs, 200-dim for the linear layers, batch size 45
● On TriviaQA: 140-dim for GRUs, 280-dim for the linear layers, batch size 60
● Optimizer: Adadelta
● On training: maintain an exponential moving average of the weights with a
decay rate of 0.999.
● On testing: select the most probable answer span of length less than or
equal to 8 for TriviaQA and 17 for SQuAD.
Implementation details
23

● Exact match
○ Average number of predictions which exactly match any one of the ground truth answers of the
question.
● Macro-averaged F1-score
○ Treat prediction and ground truth answer as bags of tokens, then compute their F1.
○ Take the maximum F1 over all of the ground truth answers for a given question, and then
average over all of the questions
Evaluation scores
25

Discussion
● Drawback of SQuAD for document-level QA
○ Models trained on SQuAD data perform very poorly in the multi-paragraph setting
○ Reasons:
■ Only paragraph-specific questions are provided
■ All questions are answerable
■ Paragraphs are short
● Shared-norm model performs well even more paragraphs are added
● No-answer and merge approaches are effective, but do not provide confidence
score
● Sigmoid object function reduces paragraph-level performance (Fig4) →
vulnerable to label noise 31

Label 200 random TriviaQA web errors of shared-norm model
Error analysis
32
■ Sources of errors on multi-sentence reading:
1. Connecting multiple statements in the same paragraph
2. Long-range coreference
3. Knowledge background (few)
1. Continue advancing the sentence and paragraph level reading comprehension
2. Adding a mechanism to handle document-level coreferences.

Conclusion
● Proposed techniques:
○ Sampling non-answer-containing paragraphs
○ Shared-norm objective function
○ Paragraph selection
● This work can be applied to build open Question Answering system.
33

Thank you!
34
Happy discussion!

Vjai paper reading201808-acl18-simple-and_effective multi-paragraph reading comprehension

More Related Content

Similar to Vjai paper reading201808-acl18-simple-and_effective multi-paragraph reading comprehension (20)

Recently uploaded (20)

Vjai paper reading201808-acl18-simple-and_effective multi-paragraph reading comprehension