Lab Seminar - Reading Wikipedia to Answer Open-Domain Questions (DrQA)

Reading Wikipedia to Answer
Open-Domain Questions (DrQA)
Danqi Chen, Adam Fisch, Jason Weston & Antoine Bordes
(Standford Univ. & Facebook AI Research)
ACL 2017 - Poster
서강대학교 자연어처리 연구실
2017-08-23
허광호
1

Abstract
• To tackle open-domain question answering
• using Wikipedia as the unique knowledge source.
• MRS: “Machine reading at scale”
1) Document Retriever (relevant articles 찾기) 정보 검색
• Search component (Bigram hashing + TF-IDF matching)
2) Document Reader (identifying the answer from articles) 정보 추출
• Answer detection (multi-layer RNN)
• Multitask learning + Distant supervision 기법으로 Full system성능 향상
3

Introduction
• Wikipedia as a knowledge base (KB)
- A constantly evolving source of detailed information that could facilitate intelligent
machines.
- Contains up-to-date knowledge that humans are interested in.
- However, Wikipedia is designed for humans to read (not machines).
4

Introduction - 본 연구의 특징
• 1) Wikipedia article만 사용하고 graph structure 등 meta 정보를 사용하지 않음.
• Generic한 특징 – KB를 기타 documents, books, daily updated newspapers 로 쉽게 변환 가능.
• 2) Wikipedia만 KB로 사용
• IBM의 DeepQA 는 여러 KB를 중복으로 사용하여 Information redundancy를 사용함.
• 문제는 문서에 Evidence가 한번만 나타난 경우, Answer를 정확(precise)하게 찾아내기 어려움.
• 3) 검색 기능을 통합한 오픈 도메인 Q&A 시스템
• 일반적으로 QA 시스템은 “Question”과 “Answer가 포함된 short text” 입력으로 주고
그 short text 중에서 Answer 부분의 시작, 끝 위치를 찍어주는 문제임.
(Machine comprehension of text, or Information extraction)
• 이러한 방법은 Open-domain QA 시스템을 구축하는데 있어서 비 현실적임.
5

Related Work (1/2)
• KB 발전과 함께 KB-based QA 시스템들이 제안됨.
• WebQuestion (Berant et al., 2013)
• SimpleQuestions (Bordes et al., 2015)
• KB: Freebase KB, OpenIE triples and NELL
• KB의 문제점 (incompleteness, fixed schemas)들 때문에
• 다시 raw text에서 answer를 찾아내는 original 방식으로 돌아 감.
• 또 다른 이유는 deep learning으로 인한 machine comprehension of text 성능 향상.
• 모델 (Attention-based and Memory neural networks.) 과
• 코퍼스 (QuizBowl, CNN/Daily Mail, SQuAD and WikiReading)
6

Related Work (2/2)
Several highly developed full pipeline QA 시스템
• 웹 기반
• QuASE (Sun et al., 2015)
• Wikipedia 기반
• Microsoft’s AskMSR (Brill et al., 2002)
• 검색엔진 기반 QA 시스템, linguistic analysis 없이 data redundancy만 이용.
• IBM’s DeepQA (Ferrucci et al., 2010)
• 세련된(Sophisticated) QA 시스템
• 비정형 문서, KB, Databases, Ontology 이용.
• YodaQA (Baudis, 2015) – 오픈 소스!!
• Websites, 정보추출, Databases, Wikipedia 이용.
• 논문 실험의 성능비교에 사용함
7

Proposed System: DrQA
8
출처: Reading Wikipedia to Answer Open-Domain Questions
1933년에 폴란드어를 구사하는 바르샤바 주민 수는 얼마입니까?

Document Retriever
• A simple inverted index + term vector model scoring (TF-IDF)
• Articles와 Question을 TF-IDF weighted BoW로 연관 문서를 검색.
• 추가로 Word order가 고려하여 Bi-gram feature 사용
• 속도와 메모리 최적화를 위하여
• (Weinberger et al., 2009) 제안한 murmur3 hash 방법 사용
• Question이 입력되면 Wikipedia에서 연관성이 가장 높은
Articles 5개를 검색해낸 후 Document Reader에 넘김.
9

Document Reader
• Paragraph encoding
• Question encoding
• Prediction
• 문서 내에서 Answer 위치 결정
10

Document Reader
• Inspired by AttentiveReader (Hermann et al., 2015; Chen et al., 2016)
• Notations
• A question( 𝑞 ) - 𝑙 개 token으로 구성: 𝑞1, 𝑞2, … 𝑞𝑙
• A document or a single paragraph ( 𝑝 ) - 𝑚개 token으로 구성: 𝑝1, 𝑝2, … 𝑝 𝑚
• Paragraph encoding
• 문단 내 모든 token 𝑝𝑖 를 feature vector 로 변환 (다음 슬라이드에서 설명)
• Feature vector를 RNN에 입력하여 각 token 𝑝𝑖 의 주변 context 정보를 담은 𝐩𝐢 를 얻음.
• RNN → multi-layer bidirectional LSTM
11

Feature vector ෤pi Representation (1/2)
• ෤pi comprised of 4 parts
• 1. Word embeddings –
• 300차원 GloVe vector trained from 840B Web crawl data.
• 대부분 word embeddings은 keep fixed, 1000 most freq. token은 fine-tuning
→“what, how, which, many” 이러한 단어들은 QA 시스템에서 사용 중요할 수 있음
• 2. Exact match –
• 3 binary features: token 𝑝𝑖 가 question 𝑞 단어들의
• original, lower-case, lemma form에 exactly matching 여부
• * Extremely helpful !
12

Feature vector ෤pi Representation (2/2)
• ෤pi comprised of 4 parts
• 3. Token features –
• 4. Aligned question embedding –
• Weighted sum of Attention score: 𝑎𝑖,𝑗
• Exact match feature와 비교할 때 유사한 단어 사이 soft alignment 가능.
• 예: car and vehicle
13
→ softmax of 𝑝𝑖 over 𝑞 𝑗

Document Reader
• 𝑞 = 𝑞1, 𝑞2, … 𝑞𝑙 word embedding → 𝐪1, 𝐪2, … 𝐪𝑙 → 𝐪
• Weighted sum, 𝑏𝑗 는 각 question word의 중요도를 encoding 한다.
• Prediction
• Train 2 classifiers independently,
• Find 𝑖 and 𝑖′ such that
14
where 𝐰 is a weight vector to learn.
is maximized
→

Data (1/2)
• Wikipedia (Knowledge Source)
• ≈5M articles, ≈ 9M unique uncased token
• SQuAD (Rajpukar et a., 2016) QA 핵심 Dataset !!
• Training 87k, Development 10k
• 데이터 형식 A human generated Question + A paragraph contains answer span.
• Document Reader를 훈련함
• Open-domain QA 평가를 위한 Datasets
• CuratedTREC – (Baudis and Sedivy 2015) (2,180 questions from TREC 1999~2004)
• WebQuestions - (Berant et al., 2013)
• WikiMovies – (Miller et al., 2016) (96k question-answer pairs in domain of movies)
15

Data (2/2)
• Distantly Supervised Data
• 앞서 소개한 Dataset 중, CuratedTREC, WebQuestions, WikiMovies는
• Question-answer pair만 있고, 관련 document나 paragraph를 포함하지 않음.
• 따라서 Document Reader의 훈련 데이터로 사용하기 위하여
• Distant Supervision 기법을 이용하여 Wikipedia에서 연관성 문서를 검색하여
• Weakly tagged training data를 구성함. (Detail 생략, 논문 참고)
16

Experiments (1/3)
• Document Retriever 성능
• Wikipedia 검색엔진 – ElasticSearch 와 비교한 결과
17
Question으로 5개 Document를 검색했을 때
Answer span이 Top-5에 포함된 비율

Experiments (2/3)
• Document Reader 성능
18
SQuAD Dataset에서의 성능
Paragraph encoding feature 들에 대한
Ablation analysis (삭마)분석 결과

Experiments (3/3)
• Full Wikipedia QA 성능
19

Conclusion
• Wikipedia 문장만을 사용하여 QA 시스템을 구축.
• MRS is a challenging task for researchers to focus on.
• MRS (Machine reading at scale)
• Search, Distant supervision, and Multitask learning 등 기법을 통합하여 Open-Domain
QA 시스템을 제안.
20

Future work
• DrQA 성능을 향상시킬 수 있는 방법을 고민.
• (i) 훈련 과정에서 Document Reader가 paragraphs와 documents에서
누적한 Fact를 이용. (Triples??)
• (ii) 기존 Document Retriever와 Document Reader의 Pipeline 구조에서
End-to-End Training을 해보고자 함
21

Distant supervision
• Most machine learning techniques require a set of training data. A traditional approach for
collecting training data is to have humans label a set of documents. For example, for the
marriage relation, human annotators may label the pair "Bill Clinton" and "Hillary Clinton" as a
positive training example. This approach is expensive in terms of both time and money, and if
our corpus is large, will not yield enough data for our algorithms to work with. And because
humans make errors, the resulting training data will most likely be noisy.
• An alternative approach to generating training data is distant supervision. In distant
supervision, we make use of an already existing database, such as Freebase or a domain-specific
database, to collect examples for the relation we want to extract. We then use these examples
to automatically generate our training data. For example, Freebase contains the fact that Barack
Obama and Michelle Obama are married. We take this fact, and then label each pair of "Barack
Obama" and "Michelle Obama" that appear in the same sentence as a positive example for our
marriage relation. This way we can easily generate a large amount of (possibly noisy) training
data. Applying distant supervision to get positive examples for a particular relation is easy,
but generating negative examples is more of an art than a science.
22

Lab Seminar - Reading Wikipedia to Answer Open-Domain Questions (DrQA)

More Related Content

What's hot (20)

Similar to Lab Seminar - Reading Wikipedia to Answer Open-Domain Questions (DrQA) (20)

Lab Seminar - Reading Wikipedia to Answer Open-Domain Questions (DrQA)