SlideShare a Scribd company logo
Adams Wei Yu
Deview 2018, Seoul
Quoc
Le
Thang
Luong
Rui Zhao Mohammad
Norouzi
Kai Chen
Collaborators
David Dohan
Bio
Adams Wei Yu
● Ph.D Candidate @ MLD, CMU
○ Advisor: Jaime Carbonell, Alex Smola
○ Large scale optimization
○ Machine reading comprehension
Question Answering
Concrete Answer No clear answer
Early Success
http://www.aaai.org/Magazine/Watson/watson.php
Watson: complex multi-stage system
Moving towards end-to-end systems
● Translation
● Question Answering
Lots of Datasets Available
TriviaQA
Narrative QA
MS Marco
Stanford Question Answer Dataset (SQuAD)
In education, teachers facilitate student learning, often in a school or
academy or perhaps in another environment such as outdoors. A teacher
who teaches on an individual basis may be described as a tutor.
Passage:
What is the role of teachers in education?Question:
facilitate student learningGroundtruth:
facilitate student learningPrediction 1: EM = 1, F1 = 1
student learningPrediction 2: EM = 0, F1 = 0.8
teachers facilitate student learningPrediction 3: EM = 0, F1 = 0.86
Data: Crowdsourced 100k question-answer pairs on 500 Wikipedia articles.
Roadmap
● Models for text
● General neural structures for QA
● Building blocks for QANet
○ Fully parallel (CNN + Self-attention)
○ data augmentation via back-translation
○ transfer learning from unsupervised tasks
That movie was awful.
That movie was awful .
embed embed embed embed embed
sum
Bag of words
hout
[246]QANet: Towards Efficient and Human-Level Reading Comprehension on SQuAD
Continuous bag-of-words and skip-gram
architectures (Mikolov et al., 2013a;
2013b)
That movie was awful .
conv conv conv conv conv
sum
embed embed embed embed embed
Bag of N-Grams hout
That movie was awful .
embed embed embed embed embed
f(x,h) f(x,h) f(x,h) f(x,h) f(x,h)
hinit hout
Recurrent Neural Networks
The quick brown fox jumped over the lazy doo
The quick brown fox jumped over the lazy dog
A feed-forward neural network language
model (Bengio et al., 2001; 2003)
<s> The quick brown fox
embed embed embed embed embed
f(x,h) f(x,h) f(x,h) f(x,h) f(x,h)hinit
project
The quick brown fox jumped
Language Models
Yes please <de> ja bitte
embed embed embed embed embed
f(x,h) f(x,h) f(x,h) f(x,h) f(x,h)hinit
project
Ja bitte </s>
Language Models Seq2Seq
Yes please <s> ja bitte
embed embed embed embed embed
f(x,h) f(x,h) f(x,h) f(x,h) f(x,h)hinit
Ja bitte </s>
Seq2Seq + Attention
Encoder Decoder
hinit
?
https://distill.pub/2016/augmented-rnns/#
https://distill.pub/2016/augmented-rnns/#
Attention: a weighted average
The cat stuck out its tongue and licked its owner
The cat stuck out its tongue and licked its owner
Convolution:
Different linear transformations by relative position.
The cat stuck out its tongue and licked its owner
The cat stuck out its tongue and licked its owner
Attention: a weighted average
The cat stuck out its tongue and licked its owner
The cat stuck out its tongue and licked its owner
Multi-head Attention
Parallel attention layers with different linear transformations on input and output.
The cat stuck out its tongue and licked its owner
The cat stuck out its tongue and licked its owner
Yes please <s> Ja
embed embed embed embed embed
f(x,h) f(x,h) f(x,h) f(x,h) f(x,h)hinit
Ja ...bitte </s>
Seq2Seq + Attention
Encoder Decoder
hinit
w1
w2
<s> The quick brown fox
embed embed embed embed embed
f(x,h) f(x,h) f(x,h) f(x,h) f(x,h)
project
The quick brown fox jumped
Language Models with attention
[246]QANet: Towards Efficient and Human-Level Reading Comprehension on SQuAD
Roadmap
● Models for text
● General neural structures for QA
● Building blocks for QANet
○ Fully parallel (CNN + Self-attention)
○ data augmentation via back-translation
○ transfer learning from unsupervised tasks
General (Doc, Question) → Answer Model
General framework neural QA Systems
Bi-directional Attention Flow (BiDAF)
[Seo et al., ICLR’17]
Base Model (BiDAF)
Similar general architectures:
● R-Net [Wang et al, ACL’17]
● DCN [Xiong et al., ICLR’17]
Base Model (BiDAF)
Similar general architectures:
● R-Net [Wang et al, ACL’17]
● DCN [Xiong et al., ICLR’17]
Base Model (BiDAF)
36
Similar general architectures:
● R-Net [Wang et al, ACL’17]
● DCN [Xiong et al., ICLR’17]
Base Model (BiDAF)
Similar general architectures:
● R-Net [Wang et al, ACL’17]
● DCN [Xiong et al., ICLR’17]
Base Model (BiDAF)
Similar general architectures:
● R-Net [Wang et al, ACL’17]
● DCN [Xiong et al., ICLR’17]
RNN RNN
RNN
RNN
Two Challenges with RNNs Remain...
Base Model (BiDAF)
First challenge: hard to capture long dependency
h1
h3
h4
h5
h6
h2
Being a long-time fan of Japanese film, I expected more than this. I can't really be
bothered to write too much, as this movie is just so poor. The story might be the cutest
romantic little something ever, pity I couldn't stand the awful acting, the mess they called
pacing, and the standard "quirky" Japanese story. If you've noticed how many Japanese
movies use characters, plots and twists that seem too "different", forcedly so, then steer
clear of this movie. Seriously, a 12-year old could have told you how this movie was
going to move along, and that's not a good thing in my book. Fans of "Beat" Takeshi: his
part in this movie is not really more than a cameo, and unless you're a rabid fan, you
don't need to suffer through this waste of film.
Second challenge: hard to compute in parallel
Strictly Sequential!
1. local context
input
hidden state
h1
h3
h4
h5
h6
h2
2. global interaction
3. Temporal info
What do RNNs Capture?
Substitution?
Roadmap
● Models for text
● General neural structures for QA
● Building blocks for QANet
○ Fully parallel (CNN + Self-attention)
○ data augmentation via back-translation
○ transfer learning from unsupervised tasks
Convolution: Capturing Local Context
0.6
0.2
0.8
0.4
0.1
0.6
0.4
0.1
0.4
0.9
0.1
0.8
0.2
0.3
0.1
0.6
0.2
0.8
0.4
0.1
0.6
0.4
0.1
0.4
0.9
0.1
0.8
0.2
0.3
0.1
The todayniceisweather
Convolution: Capturing Local Context
0.6
0.2
0.8
0.4
0.1
0.6
0.4
0.1
0.4
0.9
0.1
0.8
0.2
0.3
0.1
0.4 0.72.51.10.6
k = 2
d = 3
0.0
0.0
0.0
Convolution: Capturing Local Context
0.6
0.2
0.8
0.4
0.1
0.6
0.4
0.1
0.4
0.9
0.1
0.8
0.2
0.3
0.1
0.4 0.72.51.10.6
k = 2
d = 3
0.0
0.0
0.0
Convolution: Capturing Local Context
0.6
0.2
0.8
0.4
0.1
0.6
0.4
0.1
0.4
0.9
0.1
0.8
0.2
0.3
0.1
0.4 0.72.51.10.6
k = 2
d = 3
0.0
0.0
0.0
Convolution: Capturing Local Context
0.6
0.2
0.8
0.4
0.1
0.6
0.4
0.1
0.4
0.9
0.1
0.8
0.2
0.3
0.1
0.4 0.72.51.10.6
1.8 0.90.30.41.6
k = 3k = 2
d = 3
1.2 0.81.40.52.1
k = 3
0.0
0.0
0.0
k-gram features
Fully parallel!
How about Global Interaction?
The todayniceisweather
layer 1
layer 2
layer 3
1. May need O(logk
N) layers
2. Interaction may become weaker
N: Seq length.
k: Filter size.
The todayniceisweather
The todayniceisweather
0.6
0.2
0.8
0.4
0.1
0.6
0.4
0.1
0.4
0.9
0.1
0.8
0.2
0.3
0.1
The todayniceisweather
w1
x + w2
x + w3
x + w4
x + w5
x
1.8
2.3
0.4
The
=
w1 w2
w3
w4
w5
w1
, w2
, w3
, w4
, w5
= softmax ( )
0.6
0.2
0.8
0.4
0.1
0.6
0.4
0.1
0.4
0.9
0.1
0.8
0.2
0.3
0.1
0.6 0.2 0.8 x
The
The todayniceisweather
[Vaswani et al., NIPS’17]
The todayniceisweather
The todayniceisweather
Self-attention is fully parallel & all-to-all!
Per Unit Total
Per Layer
Sequential Op
(Path Memory)
Self-Attn O(Nd) O(N2
d) O(1)
Conv O(kd2
) O(kNd2
) O(1)
RNN O(d2
) O(Nd2
) O(N)
Complexity
Self-Attn
Conv
RNN
N: Seq length.
d: Dim. (N > d)
k: Filter size.
Explicitly Encode Temporal Info
+ ++++
RNN
Position
Embedding
1 5432
1 5432
Implicit encode
explicit encode
Position Emb
Feedforward
Layer Norm
Self Attention
Layer Norm
Convolution
Layer Norm
+
+
+
Repeat
Position Emb
Feedforward
Self Attention
Repeat
Convolution
if you want to
go deeper
QANet Encoder
[Yu et al., ICLR’18]
RNN RNN
RNN
RNN
Base Model (BiDAF) → QANet
QANet Encoder QANet Encoder
QANet Encoder
QANet Encoder
130
layers
QANet – 130+ layers (Deepest NLP NN)
QANet – First QA system with No Recurrence
● Very fast!
○ Training: 3x - 13x
○ Inference: 4x - 9x
QANet – 130+ layers (Deepest NLP NN)
● Layer normalization
● Residual connections
● L2
regularization
● Stochastic Depth
● Squeeze and Excitation
● ...
Roadmap
● Models for text
● General neural structures for QA
● Building blocks for QANet
○ Fully parallel (CNN + Self-attention)
○ data augmentation via back-translation
○ transfer learning from unsupervised tasks
Data augmentation: popular in vision & speech
More data with NMT back-translation
Input
Paraphrase
Translation
English → French
English ← French
Previously, tea had been used primarily for
Buddhist monks to stay awake during meditation.
Autrefois, le thé avait
été utilisé surtout pour
les moines bouddhistes
pour rester éveillé
pendant la méditation.
In the past, tea was used mostly for Buddhist
monks to stay awake during the meditation.
More data with NMT back-translation
Input
Paraphrase
Translation
English → French
English ← French
Previously, tea had been used primarily for
Buddhist monks to stay awake during meditation.
In the past, tea was used mostly for Buddhist
monks to stay awake during the meditation.
● More data
○ (Input, label)
○ (Paraphrase, label)
Applicable to virtually any NLP tasks!
QANet augmentation
Input
Paraphrase
Translation
English → French
English ← French
Improvement: +1.1 F1
Use 2 language pairs: English-French, English-German. 3x data.
Roadmap
● Models for text
● General neural structures for QA
● Building blocks for QANet
○ Fully parallel (CNN + Self-attention)
○ data augmentation via back-translation
○ transfer learning from unsupervised tasks
Proprietary + Confidential
[246]QANet: Towards Efficient and Human-Level Reading Comprehension on SQuAD
Transfer learning for richer presentation
<s> The quick brown fox
embed embed embed embed embed
f(x,h) f(x,h) f(x,h) f(x,h) f(x,h)hinit
project
The quick brown fox jumped
Language Models
Sebastian Ruder @ Indaba 2018
Transfer learning for richer presentation
● Pretrained language model
(ELMo, [Peters et al., NAACL’18])
○ + 4.0 F1
Transfer learning for richer presentation
71
● Pretrained language model
(ELMo, [Peters et al., NAACL’18])
○ + 4.0 F1
● Pretrained machine translation
model (CoVe [McCann, NIPS’17])
○ + 0.3 F1
QANet – 3 key ideas
● Deep Architecture without RNN
○ 130-layer (Deepest in NLP)
● Transfer Learning
○ leverage unlabeled data
● Data Augmentation
○ with back-translation
#1 on SQuAD (Mar-Aug 2018)
QA is not Solved!!
QA is not Solved!!
Thank you!

More Related Content

PDF
[241] AI 칩 개발에 사용되는 엔지니어링
PDF
OpenPOWER Foundation Overview
PDF
Natural Language Processing NLP (Transformers)
PPTX
Deep Learning Enabled Question Answering System to Automate Corporate Helpdesk
PDF
【DeepLearning研修】Transformerの基礎と応用 -- 第2回 Transformerの言語での応用
PDF
Deep Learning, Where Are You Going?
PPTX
Deep Learning for Natural Language Processing
PDF
Should we be afraid of Transformers?
[241] AI 칩 개발에 사용되는 엔지니어링
OpenPOWER Foundation Overview
Natural Language Processing NLP (Transformers)
Deep Learning Enabled Question Answering System to Automate Corporate Helpdesk
【DeepLearning研修】Transformerの基礎と応用 -- 第2回 Transformerの言語での応用
Deep Learning, Where Are You Going?
Deep Learning for Natural Language Processing
Should we be afraid of Transformers?

Similar to [246]QANet: Towards Efficient and Human-Level Reading Comprehension on SQuAD (20)

PPTX
Transformer Zoo
PDF
Ted Willke - The Brain’s Guide to Dealing with Context in Language Understanding
PPTX
Chatbot ppt
PDF
12_applications.pdf
PDF
Sequence to sequence (encoder-decoder) learning
PPTX
Deep Learning and Watson Studio
PDF
CSCE181 Big ideas in NLP
PPTX
Local Applications of Large Language Models based on RAG.pptx
PDF
[PR12] PR-036 Learning to Remember Rare Events
PDF
Frontiers of Natural Language Processing
PPTX
Image captioning
PPTX
Machine Learning - Transformers, Large Language Models and ChatGPT
PDF
From_seq2seq_to_BERT
PPTX
What Deep Learning Means for Artificial Intelligence
PPTX
BRV CTO Summit Deep Learning Talk
PDF
Transformer based approaches for visual representation learning
PDF
Deep Learning Cases: Text and Image Processing
PPTX
Deep Neural Methods for Retrieval
PPTX
Natural Question Generation using Deep Learning
PPTX
Talk from NVidia Developer Connect
Transformer Zoo
Ted Willke - The Brain’s Guide to Dealing with Context in Language Understanding
Chatbot ppt
12_applications.pdf
Sequence to sequence (encoder-decoder) learning
Deep Learning and Watson Studio
CSCE181 Big ideas in NLP
Local Applications of Large Language Models based on RAG.pptx
[PR12] PR-036 Learning to Remember Rare Events
Frontiers of Natural Language Processing
Image captioning
Machine Learning - Transformers, Large Language Models and ChatGPT
From_seq2seq_to_BERT
What Deep Learning Means for Artificial Intelligence
BRV CTO Summit Deep Learning Talk
Transformer based approaches for visual representation learning
Deep Learning Cases: Text and Image Processing
Deep Neural Methods for Retrieval
Natural Question Generation using Deep Learning
Talk from NVidia Developer Connect
Ad

More from NAVER D2 (20)

PDF
[211] 인공지능이 인공지능 챗봇을 만든다
PDF
[233] 대형 컨테이너 클러스터에서의 고가용성 Network Load Balancing: Maglev Hashing Scheduler i...
PDF
[215] Druid로 쉽고 빠르게 데이터 분석하기
PDF
[245]Papago Internals: 모델분석과 응용기술 개발
PDF
[236] 스트림 저장소 최적화 이야기: 아파치 드루이드로부터 얻은 교훈
PDF
[235]Wikipedia-scale Q&A
PDF
[244]로봇이 현실 세계에 대해 학습하도록 만들기
PDF
[243] Deep Learning to help student’s Deep Learning
PDF
[234]Fast & Accurate Data Annotation Pipeline for AI applications
PDF
Old version: [233]대형 컨테이너 클러스터에서의 고가용성 Network Load Balancing
PDF
[226]NAVER 광고 deep click prediction: 모델링부터 서빙까지
PDF
[225]NSML: 머신러닝 플랫폼 서비스하기 & 모델 튜닝 자동화하기
PDF
[224]네이버 검색과 개인화
PDF
[216]Search Reliability Engineering (부제: 지진에도 흔들리지 않는 네이버 검색시스템)
PDF
[214] Ai Serving Platform: 하루 수 억 건의 인퍼런스를 처리하기 위한 고군분투기
PDF
[213] Fashion Visual Search
PDF
[232] TensorRT를 활용한 딥러닝 Inference 최적화
PDF
[242]컴퓨터 비전을 이용한 실내 지도 자동 업데이트 방법: 딥러닝을 통한 POI 변화 탐지
PDF
[212]C3, 데이터 처리에서 서빙까지 가능한 하둡 클러스터
PDF
[223]기계독해 QA: 검색인가, NLP인가?
[211] 인공지능이 인공지능 챗봇을 만든다
[233] 대형 컨테이너 클러스터에서의 고가용성 Network Load Balancing: Maglev Hashing Scheduler i...
[215] Druid로 쉽고 빠르게 데이터 분석하기
[245]Papago Internals: 모델분석과 응용기술 개발
[236] 스트림 저장소 최적화 이야기: 아파치 드루이드로부터 얻은 교훈
[235]Wikipedia-scale Q&A
[244]로봇이 현실 세계에 대해 학습하도록 만들기
[243] Deep Learning to help student’s Deep Learning
[234]Fast & Accurate Data Annotation Pipeline for AI applications
Old version: [233]대형 컨테이너 클러스터에서의 고가용성 Network Load Balancing
[226]NAVER 광고 deep click prediction: 모델링부터 서빙까지
[225]NSML: 머신러닝 플랫폼 서비스하기 & 모델 튜닝 자동화하기
[224]네이버 검색과 개인화
[216]Search Reliability Engineering (부제: 지진에도 흔들리지 않는 네이버 검색시스템)
[214] Ai Serving Platform: 하루 수 억 건의 인퍼런스를 처리하기 위한 고군분투기
[213] Fashion Visual Search
[232] TensorRT를 활용한 딥러닝 Inference 최적화
[242]컴퓨터 비전을 이용한 실내 지도 자동 업데이트 방법: 딥러닝을 통한 POI 변화 탐지
[212]C3, 데이터 처리에서 서빙까지 가능한 하둡 클러스터
[223]기계독해 QA: 검색인가, NLP인가?
Ad

Recently uploaded (20)

PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Getting Started with Data Integration: FME Form 101
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PPTX
Machine Learning_overview_presentation.pptx
PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Assigned Numbers - 2025 - Bluetooth® Document
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Accuracy of neural networks in brain wave diagnosis of schizophrenia
PDF
Unlocking AI with Model Context Protocol (MCP)
PPTX
Big Data Technologies - Introduction.pptx
PDF
NewMind AI Weekly Chronicles - August'25-Week II
PPTX
SOPHOS-XG Firewall Administrator PPT.pptx
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PPTX
Group 1 Presentation -Planning and Decision Making .pptx
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
Encapsulation_ Review paper, used for researhc scholars
Building Integrated photovoltaic BIPV_UPV.pdf
Getting Started with Data Integration: FME Form 101
Reach Out and Touch Someone: Haptics and Empathic Computing
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
“AI and Expert System Decision Support & Business Intelligence Systems”
Machine Learning_overview_presentation.pptx
gpt5_lecture_notes_comprehensive_20250812015547.pdf
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Assigned Numbers - 2025 - Bluetooth® Document
The Rise and Fall of 3GPP – Time for a Sabbatical?
Accuracy of neural networks in brain wave diagnosis of schizophrenia
Unlocking AI with Model Context Protocol (MCP)
Big Data Technologies - Introduction.pptx
NewMind AI Weekly Chronicles - August'25-Week II
SOPHOS-XG Firewall Administrator PPT.pptx
Diabetes mellitus diagnosis method based random forest with bat algorithm
Group 1 Presentation -Planning and Decision Making .pptx
Mobile App Security Testing_ A Comprehensive Guide.pdf

[246]QANet: Towards Efficient and Human-Level Reading Comprehension on SQuAD

  • 1. Adams Wei Yu Deview 2018, Seoul Quoc Le Thang Luong Rui Zhao Mohammad Norouzi Kai Chen Collaborators David Dohan
  • 2. Bio Adams Wei Yu ● Ph.D Candidate @ MLD, CMU ○ Advisor: Jaime Carbonell, Alex Smola ○ Large scale optimization ○ Machine reading comprehension
  • 6. Moving towards end-to-end systems ● Translation ● Question Answering
  • 7. Lots of Datasets Available TriviaQA Narrative QA MS Marco
  • 8. Stanford Question Answer Dataset (SQuAD) In education, teachers facilitate student learning, often in a school or academy or perhaps in another environment such as outdoors. A teacher who teaches on an individual basis may be described as a tutor. Passage: What is the role of teachers in education?Question: facilitate student learningGroundtruth: facilitate student learningPrediction 1: EM = 1, F1 = 1 student learningPrediction 2: EM = 0, F1 = 0.8 teachers facilitate student learningPrediction 3: EM = 0, F1 = 0.86 Data: Crowdsourced 100k question-answer pairs on 500 Wikipedia articles.
  • 9. Roadmap ● Models for text ● General neural structures for QA ● Building blocks for QANet ○ Fully parallel (CNN + Self-attention) ○ data augmentation via back-translation ○ transfer learning from unsupervised tasks
  • 10. That movie was awful.
  • 11. That movie was awful . embed embed embed embed embed sum Bag of words hout
  • 13. Continuous bag-of-words and skip-gram architectures (Mikolov et al., 2013a; 2013b)
  • 14. That movie was awful . conv conv conv conv conv sum embed embed embed embed embed Bag of N-Grams hout
  • 15. That movie was awful . embed embed embed embed embed f(x,h) f(x,h) f(x,h) f(x,h) f(x,h) hinit hout Recurrent Neural Networks
  • 16. The quick brown fox jumped over the lazy doo
  • 17. The quick brown fox jumped over the lazy dog
  • 18. A feed-forward neural network language model (Bengio et al., 2001; 2003)
  • 19. <s> The quick brown fox embed embed embed embed embed f(x,h) f(x,h) f(x,h) f(x,h) f(x,h)hinit project The quick brown fox jumped Language Models
  • 20. Yes please <de> ja bitte embed embed embed embed embed f(x,h) f(x,h) f(x,h) f(x,h) f(x,h)hinit project Ja bitte </s> Language Models Seq2Seq
  • 21. Yes please <s> ja bitte embed embed embed embed embed f(x,h) f(x,h) f(x,h) f(x,h) f(x,h)hinit Ja bitte </s> Seq2Seq + Attention Encoder Decoder hinit ?
  • 24. Attention: a weighted average The cat stuck out its tongue and licked its owner The cat stuck out its tongue and licked its owner
  • 25. Convolution: Different linear transformations by relative position. The cat stuck out its tongue and licked its owner The cat stuck out its tongue and licked its owner
  • 26. Attention: a weighted average The cat stuck out its tongue and licked its owner The cat stuck out its tongue and licked its owner
  • 27. Multi-head Attention Parallel attention layers with different linear transformations on input and output. The cat stuck out its tongue and licked its owner The cat stuck out its tongue and licked its owner
  • 28. Yes please <s> Ja embed embed embed embed embed f(x,h) f(x,h) f(x,h) f(x,h) f(x,h)hinit Ja ...bitte </s> Seq2Seq + Attention Encoder Decoder hinit w1 w2
  • 29. <s> The quick brown fox embed embed embed embed embed f(x,h) f(x,h) f(x,h) f(x,h) f(x,h) project The quick brown fox jumped Language Models with attention
  • 31. Roadmap ● Models for text ● General neural structures for QA ● Building blocks for QANet ○ Fully parallel (CNN + Self-attention) ○ data augmentation via back-translation ○ transfer learning from unsupervised tasks
  • 32. General (Doc, Question) → Answer Model
  • 33. General framework neural QA Systems Bi-directional Attention Flow (BiDAF) [Seo et al., ICLR’17]
  • 34. Base Model (BiDAF) Similar general architectures: ● R-Net [Wang et al, ACL’17] ● DCN [Xiong et al., ICLR’17]
  • 35. Base Model (BiDAF) Similar general architectures: ● R-Net [Wang et al, ACL’17] ● DCN [Xiong et al., ICLR’17]
  • 36. Base Model (BiDAF) 36 Similar general architectures: ● R-Net [Wang et al, ACL’17] ● DCN [Xiong et al., ICLR’17]
  • 37. Base Model (BiDAF) Similar general architectures: ● R-Net [Wang et al, ACL’17] ● DCN [Xiong et al., ICLR’17]
  • 38. Base Model (BiDAF) Similar general architectures: ● R-Net [Wang et al, ACL’17] ● DCN [Xiong et al., ICLR’17]
  • 39. RNN RNN RNN RNN Two Challenges with RNNs Remain... Base Model (BiDAF)
  • 40. First challenge: hard to capture long dependency h1 h3 h4 h5 h6 h2 Being a long-time fan of Japanese film, I expected more than this. I can't really be bothered to write too much, as this movie is just so poor. The story might be the cutest romantic little something ever, pity I couldn't stand the awful acting, the mess they called pacing, and the standard "quirky" Japanese story. If you've noticed how many Japanese movies use characters, plots and twists that seem too "different", forcedly so, then steer clear of this movie. Seriously, a 12-year old could have told you how this movie was going to move along, and that's not a good thing in my book. Fans of "Beat" Takeshi: his part in this movie is not really more than a cameo, and unless you're a rabid fan, you don't need to suffer through this waste of film.
  • 41. Second challenge: hard to compute in parallel Strictly Sequential!
  • 42. 1. local context input hidden state h1 h3 h4 h5 h6 h2 2. global interaction 3. Temporal info What do RNNs Capture? Substitution?
  • 43. Roadmap ● Models for text ● General neural structures for QA ● Building blocks for QANet ○ Fully parallel (CNN + Self-attention) ○ data augmentation via back-translation ○ transfer learning from unsupervised tasks
  • 44. Convolution: Capturing Local Context 0.6 0.2 0.8 0.4 0.1 0.6 0.4 0.1 0.4 0.9 0.1 0.8 0.2 0.3 0.1 0.6 0.2 0.8 0.4 0.1 0.6 0.4 0.1 0.4 0.9 0.1 0.8 0.2 0.3 0.1 The todayniceisweather
  • 45. Convolution: Capturing Local Context 0.6 0.2 0.8 0.4 0.1 0.6 0.4 0.1 0.4 0.9 0.1 0.8 0.2 0.3 0.1 0.4 0.72.51.10.6 k = 2 d = 3 0.0 0.0 0.0
  • 46. Convolution: Capturing Local Context 0.6 0.2 0.8 0.4 0.1 0.6 0.4 0.1 0.4 0.9 0.1 0.8 0.2 0.3 0.1 0.4 0.72.51.10.6 k = 2 d = 3 0.0 0.0 0.0
  • 47. Convolution: Capturing Local Context 0.6 0.2 0.8 0.4 0.1 0.6 0.4 0.1 0.4 0.9 0.1 0.8 0.2 0.3 0.1 0.4 0.72.51.10.6 k = 2 d = 3 0.0 0.0 0.0
  • 48. Convolution: Capturing Local Context 0.6 0.2 0.8 0.4 0.1 0.6 0.4 0.1 0.4 0.9 0.1 0.8 0.2 0.3 0.1 0.4 0.72.51.10.6 1.8 0.90.30.41.6 k = 3k = 2 d = 3 1.2 0.81.40.52.1 k = 3 0.0 0.0 0.0 k-gram features Fully parallel!
  • 49. How about Global Interaction? The todayniceisweather layer 1 layer 2 layer 3 1. May need O(logk N) layers 2. Interaction may become weaker N: Seq length. k: Filter size.
  • 50. The todayniceisweather The todayniceisweather 0.6 0.2 0.8 0.4 0.1 0.6 0.4 0.1 0.4 0.9 0.1 0.8 0.2 0.3 0.1 The todayniceisweather w1 x + w2 x + w3 x + w4 x + w5 x 1.8 2.3 0.4 The = w1 w2 w3 w4 w5 w1 , w2 , w3 , w4 , w5 = softmax ( ) 0.6 0.2 0.8 0.4 0.1 0.6 0.4 0.1 0.4 0.9 0.1 0.8 0.2 0.3 0.1 0.6 0.2 0.8 x The The todayniceisweather [Vaswani et al., NIPS’17]
  • 52. Per Unit Total Per Layer Sequential Op (Path Memory) Self-Attn O(Nd) O(N2 d) O(1) Conv O(kd2 ) O(kNd2 ) O(1) RNN O(d2 ) O(Nd2 ) O(N) Complexity Self-Attn Conv RNN N: Seq length. d: Dim. (N > d) k: Filter size.
  • 53. Explicitly Encode Temporal Info + ++++ RNN Position Embedding 1 5432 1 5432 Implicit encode explicit encode
  • 54. Position Emb Feedforward Layer Norm Self Attention Layer Norm Convolution Layer Norm + + + Repeat Position Emb Feedforward Self Attention Repeat Convolution if you want to go deeper QANet Encoder [Yu et al., ICLR’18]
  • 55. RNN RNN RNN RNN Base Model (BiDAF) → QANet QANet Encoder QANet Encoder QANet Encoder QANet Encoder
  • 56. 130 layers QANet – 130+ layers (Deepest NLP NN)
  • 57. QANet – First QA system with No Recurrence ● Very fast! ○ Training: 3x - 13x ○ Inference: 4x - 9x
  • 58. QANet – 130+ layers (Deepest NLP NN) ● Layer normalization ● Residual connections ● L2 regularization ● Stochastic Depth ● Squeeze and Excitation ● ...
  • 59. Roadmap ● Models for text ● General neural structures for QA ● Building blocks for QANet ○ Fully parallel (CNN + Self-attention) ○ data augmentation via back-translation ○ transfer learning from unsupervised tasks
  • 60. Data augmentation: popular in vision & speech
  • 61. More data with NMT back-translation Input Paraphrase Translation English → French English ← French Previously, tea had been used primarily for Buddhist monks to stay awake during meditation. Autrefois, le thé avait été utilisé surtout pour les moines bouddhistes pour rester éveillé pendant la méditation. In the past, tea was used mostly for Buddhist monks to stay awake during the meditation.
  • 62. More data with NMT back-translation Input Paraphrase Translation English → French English ← French Previously, tea had been used primarily for Buddhist monks to stay awake during meditation. In the past, tea was used mostly for Buddhist monks to stay awake during the meditation. ● More data ○ (Input, label) ○ (Paraphrase, label) Applicable to virtually any NLP tasks!
  • 63. QANet augmentation Input Paraphrase Translation English → French English ← French Improvement: +1.1 F1 Use 2 language pairs: English-French, English-German. 3x data.
  • 64. Roadmap ● Models for text ● General neural structures for QA ● Building blocks for QANet ○ Fully parallel (CNN + Self-attention) ○ data augmentation via back-translation ○ transfer learning from unsupervised tasks
  • 67. Transfer learning for richer presentation
  • 68. <s> The quick brown fox embed embed embed embed embed f(x,h) f(x,h) f(x,h) f(x,h) f(x,h)hinit project The quick brown fox jumped Language Models
  • 69. Sebastian Ruder @ Indaba 2018
  • 70. Transfer learning for richer presentation ● Pretrained language model (ELMo, [Peters et al., NAACL’18]) ○ + 4.0 F1
  • 71. Transfer learning for richer presentation 71 ● Pretrained language model (ELMo, [Peters et al., NAACL’18]) ○ + 4.0 F1 ● Pretrained machine translation model (CoVe [McCann, NIPS’17]) ○ + 0.3 F1
  • 72. QANet – 3 key ideas ● Deep Architecture without RNN ○ 130-layer (Deepest in NLP) ● Transfer Learning ○ leverage unlabeled data ● Data Augmentation ○ with back-translation #1 on SQuAD (Mar-Aug 2018)
  • 73. QA is not Solved!!
  • 74. QA is not Solved!! Thank you!