SlideShare a Scribd company logo
Deep Reasoning
2016-03-16
Taehoon Kim
carpedm20@gmail.com
References
1. [Sukhbaatar, 2015] Sukhbaatar, Szlam, Weston, Fergus. “End-To-End Memory Networks” Advances in
Neural Information Processing Systems. 2015.
2. [Hill, 2015] Hill, Bordes, Chopra, Weston. “The Goldilocks Principle: Reading Children's Books with Explicit
Memory Representations” arXiv preprint arXiv:1511.02301 (2015).
3. [Kumar, 2015] Kumar, Irsoy, Ondruska, Iyyer, Bradbury, Gulrajani, Zhong, Paulus, Socher. “Ask Me
Anything: Dynamic Memory Networks for Natural Language Processing” arXiv preprint
arXiv:1511.06038 (2015).
4. [Xiong, 2016] Xiong, Merity, Socher. “Dynamic Memory Networks for Visual and Textual Question
Answering” arXiv preprint arXiv:1603.01417 (2016).
5. [Yin, 2015] Yin, Schütze, Xiang, Zhou. “ABCNN: Attention-Based Convolutional Neural Network for
Modeling Sentence Pairs” arXiv preprint arXiv:1512.05193 (2015).
6. [Yu, 2015] Yu, Zhang, Hang, Xiang, Zhou. “Empirical Study on Deep Learning Models for Question
Answering” arXiv preprint arXiv:1510.07526 (2015).
7. [Hermann, 2015] Hermann, Kočiský, Grefenstette, Espeholt, Will Kay, Suleyman, Blunsom. “Teaching
Machines to Read and Comprehend” arXiv preprint arXiv:1506.03340 (2015).
8. [Kadlec, 2016] Kadlec, Schmid, Bajgar, Kleindienst. “Text Understanding with the Attention Sum Reader
Network” arXiv preprint arXiv:1603.01547 (2016).
2
References
9. [Miao, 2015] Miao, Lei Yu, Blunsom. “Neural Variational Inference for Text Processing” arXiv preprint
arXiv:1511.06038 (2015).
10. [Kingma, 2013] Kingma, Diederik P., and Max Welling. "Auto-encoding variational bayes" arXiv preprint
arXiv:1312.6114 (2013).
11. [Sohn, 2015] Sohn, Kihyuk, Honglak Lee, and Xinchen Yan. "Learning Structured Output Representation
using Deep Conditional Generative Models." Advances in Neural Information Processing Systems. 2015.
3
Models
Answer selection	(WikiQA) General	QA	(CNN) Considered transitive	inference	(bAbI)
ABCNN E2E MN E2E MN
Variational Impatient	Attentive	Reader DMN
Attentive	Pooling Attentive	(Impatient)	Reader ReasoningNet
Attention	Sum Reader NTM
4
End-to-End Memory Network [Sukhbaatar, 2015]
5
End-to-End Memory Network [Sukhbaatar, 2015]
6
I go to school.
He gets ball.
…
Embed
C
Where does he go?u Embed
B
Embed
A
Attention
o
I
go
to
he
gets
ball
Input Memory
Output Memory
softmax
Inner product
weighted sum
	Σ linear
End-to-End Memory Network [Sukhbaatar, 2015]
7
I go to school.
He gets ball.
…
linear
Where does he go?
	Σ
	Σ
	Σ
Sentence representation :
𝑖 th sentence	:	𝑥% = 𝑥%',𝑥%),…, 𝑥%+
BoW	:	𝑚% = ∑ 𝐴𝑥%//
Position	Encoding	:	𝑚% = ∑ 𝑙/ 1 𝐴𝑥%//
Temporal	Encoding	:	𝑚% = ∑ 𝐴𝑥%/ + 𝑇4(𝑖)/
Training details
Linear Start (LS) help avoid local minima
- First train with softmax in each memory layer removed, making the model entirely linear except
for the final softmax
- When the validation loss stopped decreasing, the softmax layers were re-inserted and training
recommenced
RNN-style layer-wise weight tying
- The input and output embeddings are the same across different layers
Learning time invariance by injecting random noise
- Jittering the time index with random empty memories
- Add “dummy” memories to regularize 𝑇4(𝑖)
8
Example of bAbI tasks
9
The Goldilocks Principle: Reading Children's Books with Explicit
Memory Representations [Hill, 2016]
10
The Goldilocks Principle: Reading Children's Books with Explicit
Memory Representations [Hill, 2016]
11
• Context sentences : 𝑆 = 𝑠', 𝑠), … , 𝑠+ , 	 𝑠% ∶	BoW word representation
• Encoded memory : m; = 𝜙 𝑠 	∀𝑠 ∈ 𝑆
• Lexical memory
• Each word occupies a separate slot in the memory
• 𝑠 is a single word and 𝜙 𝑠 has only one non-zero feature
• Multiple hop only beneficial in this memory model
• Window memory (best)
• 𝑠 corresponds to a window of text from the context 𝑆 centered on an individual mention of a candidate 𝑐 in 𝑆
m; = 𝑤%A BA' )⁄ 	…	𝑤%	… 𝑤%D BA' )⁄
• Where 𝑤% ∈ 𝐶 which is an instance of one of the candidate words
• Sentential memory
• Same as original implementation of Memory Network
The Goldilocks Principle: Reading Children's Books with Explicit
Memory Representations [Hill, 2016]
12
Self-supervision for window memories
- Memory supervision (knowing which memories to attend to) is not provided at training time
- Making gradient steps using SGD to force the model to give a higher score to the supporting
memory 𝒎G relative to any other memory from any other candidate using:
Hard attention (training and testing) : 𝑚H' = argmax
%M',…,+
𝑐%
N
𝑞
Soft attention (testing) : 𝑚H' = ∑ 𝛼% 𝑚%%M'…+ , 𝑤𝑖𝑡ℎ	𝛼% =
ST
U
VW
∑ S
T
U
VW
X
- If 𝑚H' happens to be different from 𝑚G (memory contain true answer), then model is updated
- Can be understood as a way of achieving hard attention over memories (no need any new
label information beyond the training data)
The Goldilocks Principle: Reading Children's Books with Explicit
Memory Representations [Hill, 2016]
13
Gated Recurrent Network (GRU)
14
ℎYA'
𝑥Y 𝑥YD'
ℎY
X
𝑟Y
𝑧Y
ℎY

X
1-
X
+
𝑧Y
1 − 𝑧Y
ℎYA'
Ask Me Anything: Dynamic Memory Networks for Natural Language
Processing [Kumar, 2015]
15
Ask Me Anything: Dynamic Memory Networks for Natural Language
Processing [Kumar, 2015]
16
I go to school.
He gets ball.
…
Where does he go?
GloVe
Embed
𝑦Y
I
go
to
GloVe
Embedwh
do
he
𝑞
𝑎Y
𝑦Y
𝑞
𝑎Y
𝐺𝑅𝑈 𝐺𝑅𝑈
< 𝐸𝑂𝑆 >
ℎY 𝑒%
𝑒%
𝑮𝑹𝑼𝒍𝒊𝒔𝒉
Episodic	Memory
𝑔Y
%
Input	Module
Answer	Module
Question	Module
Ask Me Anything: Dynamic Memory Networks for Natural Language
Processing [Kumar, 2015]
17
𝑐Y
ℎY 𝑒%
I go to school.
He gets ball.
…
𝑔Y
%
Where does he go?
𝑞
GloVe
Embed
𝐺𝑅𝑈(𝐿[𝑤Y
s
], ℎYA'
u
) = ℎY
u
= 𝑐Y
𝑦Y
I
go
to
Input Memory
Episodic Memory 𝑒% 𝑚%
𝑞
GloVe
EmbedI
go
to𝑞Y = 𝐺𝑅𝑈(𝐿 𝑤Y
v
, 𝑞YA')
ℎN
%
= 𝑒%
𝑔Y
%
= 𝑮(𝑐Y, 𝑚%A'
, 𝑞)
𝑮𝑹𝑼𝒍𝒊𝒔𝒉ℎY
%
= 𝑔Y
%
𝐺𝑅𝑈 𝑐Y, ℎYA'
%
+ (1 − 𝑔Y
%
) ℎYA'
%
𝑚%
new Memory
Gate
𝑞
𝑎Y
𝑦Y
𝑞
𝑎Y
𝐺𝑅𝑈 𝐺𝑅𝑈
< 𝐸𝑂𝑆 >
𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑠𝑜𝑓𝑡𝑚𝑎𝑥
Ask Me Anything: Dynamic Memory Networks for Natural Language
Processing [Kumar, 2015]
18
ℎY 𝑒Y
I go to school.
He gets ball.
…
Where does he go?
GloVe
Embed
𝐺𝑅𝑈(𝐿[𝑤Y
s
], ℎYA'
u
) = ℎY
u
= 𝑐Y
𝑦Y
I
go
to
Input Memory
Episodic Memory 𝑒Y 𝑚%
GloVe
EmbedI
go
to𝑞Y = 𝐺𝑅𝑈(𝐿 𝑤Y
v
, 𝑞YA')
ℎN
%
= 𝑒%
𝑮𝑹𝑼𝒍𝒊𝒔𝒉ℎY
%
= 𝑔Y
%
𝐺𝑅𝑈 𝑐Y, ℎYA'
%
+ (1 − 𝑔Y
%
) ℎYA'
%
𝑚%
new Memory
𝑞
𝑎Y
𝑦Y
𝑞
𝑎Y
𝐺𝑅𝑈 𝐺𝑅𝑈
< 𝐸𝑂𝑆 >
𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑠𝑜𝑓𝑡𝑚𝑎𝑥
𝑔Y
%
= 𝑮(𝑐Y, 𝑚%A'
, 𝑞)Gate
feature vector : captures a similarities between c, m, q
G	:	two-layer	feed	forward	neural	network
Attention Mechanism
𝑞𝑞𝑐Y
𝑔Y
%
Ask Me Anything: Dynamic Memory Networks for Natural Language
Processing [Kumar, 2015]
19
𝑐Y
ℎY 𝑒Y
I go to school.
He gets ball.
…
𝑔Y
%
Where does he go?
𝑞
GloVe
Embed
𝐺𝑅𝑈(𝐿[𝑤Y
s
], ℎYA'
u
) = ℎY
u
= 𝑐Y
𝑦Y
I
go
to
Input Memory
Episodic Memory
𝑞
GloVe
EmbedI
go
to𝑞Y = 𝐺𝑅𝑈(𝐿 𝑤Y
v
, 𝑞YA')
ℎN
%
= 𝑒%
𝑮𝑹𝑼𝒍𝒊𝒔𝒉
𝑞
𝑎Y
𝑦Y
𝑞
𝑎Y
𝐺𝑅𝑈 𝐺𝑅𝑈
< 𝐸𝑂𝑆 >
𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑠𝑜𝑓𝑡𝑚𝑎𝑥
𝑔Y
%
= 𝑮(𝑐Y, 𝑚%A'
, 𝑞)Gate
ℎY
%
= 𝑔Y
%
𝐺𝑅𝑈 𝑐Y, ℎYA'
%
+ (1 − 𝑔Y
%
)	ℎYA'
%
𝑒%
= ℎNy
%
new Memory
Episodic	memory	update
𝑒% 𝑚%
𝑚%
𝑚%
= 𝐺𝑅𝑈(𝑒%
, 𝑚%A'
)
Episodic	Memory	Module
- Iterates over	input representations,	while	updating	episodic	memory 𝒆𝒊
- Attention	mechanism	+	Recurrent	network	→ Update	memory	 𝒎 𝒊
Memory	update
Ask Me Anything: Dynamic Memory Networks for Natural Language
Processing [Kumar, 2015]
20
𝑐Y
I go to school.
He gets ball.
…
𝑔Y
%
Where does he go?
𝑞
GloVe
Embed
𝐺𝑅𝑈(𝐿[𝑤Y
s
], ℎYA'
u
) = ℎY
u
= 𝑐Y
𝑦Y
I
go
to
Input Memory
Episodic Memory
GloVe
EmbedI
go
to𝑞Y = 𝐺𝑅𝑈(𝐿 𝑤Y
v
, 𝑞YA')
ℎN
%
= 𝑒%
𝑮𝑹𝑼𝒍𝒊𝒔𝒉
𝑞
𝑎Y
𝑦Y
𝑞
𝑎Y
𝐺𝑅𝑈 𝐺𝑅𝑈
< 𝐸𝑂𝑆 >
𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑠𝑜𝑓𝑡𝑚𝑎𝑥
𝑔Y
%
= 𝑮(𝑐Y, 𝑚%A'
, 𝑞)Gate
ℎY
%
= 𝑔Y
%
𝐺𝑅𝑈 𝑐Y, ℎYA'
%
+ (1 − 𝑔Y
%
)	ℎYA'
%
new Memory
ℎY 𝑒%
𝑞
𝑒% 𝑚%
𝑚%
Multiple	Episodes
- Allows	to	attend to	different inputs during	each	pass
- Allows	for	a	type	of	transitive inference,	since	the	first	
pass	may	uncover	the	need	to	retrieve	additional	facts.
Q	:	Where	is	the	football?
C1	:	John	put	down	the	football.
Only	once	the	model	sees	C1,	John	is	relevant,	
can	reason	that	the	second	iteration	should	
retrieve	where	John	was.
Criteria	for	Stopping
- Append	 a	special	end-of-passes	
representation	to	the	input	 𝒄
- Stop	if	this	representation	is	chosen by	
the	gate function
- Set	a	maximum	number	of	iterations
- This	is	why	called	Dynamic MM
Ask Me Anything: Dynamic Memory Networks for Natural Language
Processing [Kumar, 2015]
21
𝑐Y
I go to school.
He gets ball.
…
𝑔Y
%
Where does he go?
GloVe
Embed
𝐺𝑅𝑈(𝐿[𝑤Y
s
], ℎYA'
u
) = ℎY
u
= 𝑐Y
I
go
to
Input Memory
Episodic Memory
GloVe
EmbedI
go
to𝑞Y = 𝐺𝑅𝑈(𝐿 𝑤Y
v
, 𝑞YA')
ℎN
%
= 𝑒%
𝑮𝑹𝑼𝒍𝒊𝒔𝒉
𝑔Y
%
= 𝑮(𝑐Y, 𝑚%A'
, 𝑞)Gate
ℎY
%
= 𝑔Y
%
𝐺𝑅𝑈 𝑐Y, ℎYA'
%
+ (1 − 𝑔Y
%
)	ℎYA'
%
new Memory
ℎY 𝑒Y
𝑞
𝑒Y
𝑚%
𝑞
𝑦Y
𝑞
𝑎Y
𝑦Y
𝑞
𝑎Y
𝐺𝑅𝑈 𝐺𝑅𝑈
< 𝐸𝑂𝑆 >
𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑠𝑜𝑓𝑡𝑚𝑎𝑥
𝑚%
Answer	Module
- Triggered once	at	the	end	of	the	episodic	memory	or	at	
each	time	step
- Concatenate	the	last generated	word and	the	question
vector	as	the	input	at	each	time	step
- Cross-entropy	error
Training Details
- Adam optimization
- 𝐿) regularization, dropout on the word embedding (GloVe)
bAbI dataset
- Objective function : 𝐽 = 𝛼𝐸s† 𝐺𝑎𝑡𝑒𝑠 + 𝛽𝐸s†(𝐴𝑛𝑠𝑤𝑒𝑟𝑠)
- Gate supervision aims to select one sentence per pass
- Without supervision : GRU of c‰,ℎY
% and 𝑒% = ℎNy
%
- With supervision (simpler) : 𝑒% = ∑ 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑔Y
% 𝑐Y
N
YM' , where 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑔Y
% =
Š‹Œ •Ž
U
∑ Š‹Œ	(•X
U
)V
X••
and
𝑔Y
% is the value before sigmoid
- Better results, because softmax encourages sparsity & suited to picking one sentence
22
Training Details
Stanford Sentiment Treebank (Sentiment Analysis)
- Use all full sentences, subsample 50% of phrase-level labels every epoch
- Only evaluated on the full sentences
- Binary classification, neutral phrases are removed from the dataset
- Trained with GRU sequence models
23
Training Details
24
Dynamic Memory Networks for Visual and Textual Question
Answering [Xiong 2016]
25
Several	design	choices	are motivated	by	intuition	and accuracy	improvements
Input Module in DMN
- A single GRU for embedding story and store the hidden states
- GRU provides temporal component by allowing a sentence to know the content
of the sentences that came before them
- Cons:
- GRU only allows sentences to have context from sentences before them, but not after them
- Supporting sentences may be too far away from each other
- Here comes Input fusion layer
26
Input Module in DMN+
Replacing a single GRU with two different components
1. Sentence reader : responsible only for encoding the words into a sentence
embedding
• Use positional encoder (used in E2E) : 𝑓% = ∑ 𝑙/ 1 𝐴𝑥%//
• Considered GRUs LSTMs, but required more computational resources, prone to overfitting
2. Input fusion layer : interactions between sentences, allows content interaction
between sentences
• bi-directional GRU to allow information from both past and future sentences
• gradients do not need to propagate through the words between sentences
• distant supporting sentences can have a more direct interaction
27
Input Module for DMN+
28
Referenced	paper	:	A	Hierarchical	Neural	Autoencoder for	Paragraphs	and	Documents	[Li,	2015]
Episodic Memory Module in DMN+
- 𝐹⃡ = [𝑓', 𝑓), … , 𝑓“] : output of the input module
- Interactions between the fact 𝒇 𝒊 and both the question 𝒒 and episode memory
state 𝒎𝒕
29
Attention Mechanism in DMN+
Use attention to extract contextual vector 𝑐Y
based on the current focus
1. Soft attention
• A weighted summation of 𝐹⃡ : 𝑐Y
= ∑ 𝑔%
Y
𝑓%
“
%M'
• Can approximate a hard attention by selecting a single fact 𝑓%
• Cons: losses positional and ordering information
• Attention passes can retrieve some of this information, but inefficient
30
Attention Mechanism in DMN+
2. Attention based GRU (best)
- position and ordering information : RNN is proper but can’t use 𝑔%
Y
- 𝑢%: update, 𝑟%: how much retain from ℎ%A'
- Replace 𝑢% (vector) to 𝑔%
Y
(scalar)
- Allows us to easily visualize how the attention gates activate
- Use final hidden state as 𝒄 𝒕, which is used to update episodic memory 𝒎𝒕
31
𝑔%
Y
𝑔%
Y
Episode Memory Updates in DMT+
1. Untied and Tied (better) GRU
𝑚Y
= 𝐺𝑅𝑈(𝐶Y
, 𝑚YA'
)
2. Untied ReLU layer (best)
𝑚Y
= 𝑅𝑒𝐿𝑈(𝑊Y
𝑚YA'
; 𝑐Y
; 𝑞 + 𝑏)
32
Training Details
- Adam optimization
- Xavier initialization is used for all weights except for the word embeddings
- 𝐿) regularization on all weights except bias
- Dropout on the word embedding (GloVe) and answer module with 𝑝 = 0.9
33
ABCNN: Attention-Based Convolutional Neural Network for Modeling
Sentence Pairs [Yin 2015]
34
ABCNN: Attention-Based Convolutional Neural Network for Modeling
Sentence Pairs [Yin 2015]
35
• Most prior work on answer selection model each sentence separately and
neglects mutual influence
• Human focus on key parts of 𝑠Ÿ by extracting parts from 𝑠' related by
identity, synonymy, antonym etc.
• ABCNN : taking into account the interdependence between 𝑠Ÿ and 𝑠'
• Convolution layer : increase abstraction of a phrase from words
ABCNN: Attention-Based Convolutional Neural Network for Modeling
Sentence Pairs [Yin 2015]
36
1. Input embedding with word2vec
2-1. Convolution layer with wide convolution
• To make each word 𝑣% to be detected by all weights in 𝑊
2-2. Average pooling layer
• all-ap : column-wise averaging over all columns
• w-ap : column-wise averaging over windows of 𝑤
3. Output layer with logistic regression
• Forward all-ap to all non-final ap layer + final ap layer
ABCNN: Attention-Based Convolutional Neural Network for Modeling
Sentence Pairs [Yin 2015]
37
Attention on feature map (ABCNN-1)
• Attention values of row 𝑖 in 𝑨 : attention distribution of the 𝑖−th unit of 𝑠Ÿ	with respect to 𝑠'
• 𝐴%,/ = 𝑚𝑎𝑡𝑐ℎ𝑠𝑐𝑜𝑟𝑒(𝐹Ÿ,¢ :, 𝑖 , 𝐹',¢ :, 𝑗 )
• 𝑚𝑎𝑡𝑐ℎ𝑠𝑐𝑜𝑟𝑒 = 1/(1 + 𝑥 − 𝑦 )
• Generate the attention feature map 𝐹%,¦ for 𝑠%
• Cons : need more parameters
ABCNN: Attention-Based Convolutional Neural Network for Modeling
Sentence Pairs [Yin 2015]
38
Attention after convolution (ABCNN-2)
• Attention weights directly on the representation with the aim of improving the
features computed by convolution
• 𝑎Ÿ,/ = ∑𝐴 𝑗, : → col-wise, row-wise sum
• w-ap on convolution feature
ABCNN: Attention-Based Convolutional Neural Network for Modeling
Sentence Pairs [Yin 2015]
39
ABCNN-1 ABCNN-2
Indirect	impact		to	convolution
Direct	convolution	via	pooling
(weighted	attention)
Need more	features
Vulnerable	to	overfitting
No	need	features
handles	smaller-granularity	units
(ex. Word	level)
handles	larger-granularity	units
(ex. Phrase	level,	phrase	size	=	window	size)
ABCNN-3
ABCNN: Attention-Based Convolutional Neural Network for Modeling
Sentence Pairs [Yin 2015]
40
Empirical Study on Deep Learning Models for QA [Yu 2015]
41
Empirical Study on Deep Learning Models for QA [Yu 2015]
42
The first to examine Neural Turing Machines on QA problems
Split QA into two step
1. search supporting facts
2. Generate answer from relevant pieces of information
NTM
• Single-layer LSTM network as controller
• Input : word embedding
1. Support fact only
2. Fact highlighted : user marker to annotate begin and end of supporting facts
• Output : softmax layer (multiclass classification) for answer
Teaching Machines to Read and Comprehend [Hermann 2015]
43
Teaching Machines to Read and Comprehend [Hermann 2015]
44
where	𝑓% = 𝑦§ 𝑡
𝑠(𝑡) : degree to which the network
attends to a particular token in the
document when answering the query
(soft attention)
Text Understanding with the Attention Sum Reader Network [Kadlec 2016]
45
Answer should be in context
Inspired by Pinter Network
Contrast to Attentive Reader:
• We select answer from context
directly using weighted sum of
individual representation Attentive Reader
Stochastic Latent Variable
46
Z x
Zx
y
Generative	Model Conditional	GenerativeModel
𝑝 𝑥 = ¨ 𝑝(𝑥, 𝑧)
©
= ¨ 𝑝(𝑥|𝑧)𝑝(𝑧)	
©
𝑝 𝑦|𝑥 = ¨ 𝑝 𝑦 𝑧, 𝑥 𝑝(𝑧|𝑥)
©
𝑝 𝑥 = «𝑝(𝑥, 𝑧)
©
= «𝑝 𝑥 𝑧 𝑝(𝑧)
©
𝑝 𝑦|𝑥 = «𝑝 𝑦 𝑧, 𝑥 𝑝(𝑧|𝑥)
©
Variational Inference Framework
47
𝑝 𝑥, 𝑧 = 𝑝 𝑥 𝑧 𝑝 𝑧 = ¨ 𝑝 𝑥 ℎ 𝑝 ℎ 𝑧 𝑝(𝑧)
¬
log 𝑝® 𝑥, 𝑧 = log«
𝑞 ℎ
𝑞 ℎ
𝑝 𝑥 ℎ 𝑝 ℎ 𝑧 𝑝 𝑧 𝑑ℎ
¬
≥ « 𝑞 ℎ log
𝑝 𝑥 ℎ 𝑝 ℎ 𝑧 𝑝 𝑧
𝑞 ℎ
𝑑ℎ
¬
	
= « 𝑞 ℎ log
𝑝 𝑥 ℎ 𝑝 ℎ 𝑧
𝑞 ℎ
𝑑ℎ
¬
+ « 𝑞 ℎ log
𝑝 𝑧
𝑞 ℎ
𝑑ℎ
¬
= 𝐸± ¬ log 𝑝 𝑥 ℎ 𝑝 ℎ 𝑧 − log 𝑞 ℎ − 𝐷³´(𝑞(ℎ)||𝑝 𝑧 )
= 𝐸± ¬ log 𝑝 𝑥 ℎ 𝑝 ℎ 𝑧 𝑝(𝑧) − log 𝑞 ℎ
Variational Inference Framework
48
𝑝® 𝑥, 𝑧 = 𝑝® 𝑥 𝑧 𝑝 𝑧 = ¨ 𝑝® 𝑥 ℎ 𝑝® ℎ 𝑧 𝑝(𝑧)
¬
log 𝑝® 𝑥, 𝑧 = log«
𝑞 ℎ
𝑞 ℎ
𝑝® 𝑥 ℎ 𝑝® ℎ 𝑧 𝑝 𝑧 𝑑ℎ
¬
≥ « 𝑞 ℎ log
𝑝® 𝑥 ℎ 𝑝® ℎ 𝑧 𝑝 𝑧
𝑞 ℎ
𝑑ℎ
¬
	
= « 𝑞 ℎ log
𝑝® 𝑥 ℎ 𝑝® ℎ 𝑧
𝑞 ℎ
𝑑ℎ
¬
+ « 𝑞 ℎ log
𝑝 𝑧
𝑞 ℎ
𝑑ℎ
¬
= 𝐸± ¬ log 𝑝® 𝑥 ℎ 𝑝® ℎ 𝑧 − log 𝑞 ℎ − 𝐷³´(𝑞(ℎ)| 𝑝 𝑧
= 𝐸± ¬ log 𝑝® 𝑥 ℎ 𝑝® ℎ 𝑧 − log 𝑞 ℎ a	tight	lower	bound	if	𝑞 ℎ = 𝑝(ℎ|𝑥, 𝑧)
Jensen’s Inequality
Conditional Variational Inference Framework
49
𝑝® 𝑦|𝑥 = ¨ 𝑝®(𝑦, 𝑧|𝑥)	
©
= ¨ 𝑝®(𝑦|𝑥, 𝑧)𝑝µ 𝑧 𝑥
©
log 𝑝(𝑦|𝑥) = log «
𝑞 𝑧
𝑞 𝑧
𝑝 𝑦 𝑧, 𝑥 𝑝 𝑧 𝑥 𝑑𝑧
©
≥ « 𝑞 𝑧 log
𝑝 𝑦 𝑧, 𝑥 𝑝 𝑧 𝑥
𝑞 𝑧
𝑑𝑧
©
	
= « 𝑞 𝑧 log
𝑝 𝑦 𝑧, 𝑥
𝑞 𝑧
𝑑𝑧
©
+ « 𝑞 𝑧 log
𝑝 𝑧 𝑥
𝑞 𝑧
𝑑𝑧
¬
= «𝑞 𝑧 log 𝑝 𝑦 𝑧, 𝑥 𝑑𝑧
©
− «𝑞 𝑧 log 𝑞(𝑧) 𝑑𝑧
©
+ « 𝑞 𝑧 log
𝑝 𝑧 𝑥
𝑞 𝑧
𝑑𝑧
¬
= 𝐸± © log 𝑝(𝑦|𝑧, 𝑥) − log 𝑞 𝑧 − 𝐷³´(𝑞 𝑧 ∥ 𝑝 𝑧|𝑥 )
= 𝐸± © log 𝑝(𝑦|𝑧, 𝑥) − log 𝑞 𝑧 a	tight	lower	bound	if	𝑞 𝑧 = 𝑝(𝑧|𝑥)
Jensen’s Inequality
Neural Variational Inference Framework
log 𝑝® 𝑥, 𝑧 ≥ 𝐸± © log 𝑝 𝑦 𝑧, 𝑥 − log 𝑞 𝑧 − 𝐷³´ 𝑞 𝑧 ∥ 𝑝 𝑧|𝑥 = ℒ
1. Vector representations of the observed variables
𝑢 = 𝑓© 𝑧 , 𝑣 = 𝑓¸(𝑥)
2. Joint representation (concatenation)
𝜋 = 𝑔(𝑢, 𝑣)
3. Parameterize the variational distribution
𝜇 = 𝑙' 𝜋 , 𝜎 = 𝑙)(𝜋)
50
Neural Variational Document Model [Miao, 2015]
51

More Related Content

PDF
오토인코더의 모든 것
PDF
Dueling network architectures for deep reinforcement learning
PPTX
Deep learning study 2
PDF
Deep robotics
PPTX
Gan seminar
PDF
Generative adversarial networks
PPTX
Back propagation
PPTX
Capsule networks
오토인코더의 모든 것
Dueling network architectures for deep reinforcement learning
Deep learning study 2
Deep robotics
Gan seminar
Generative adversarial networks
Back propagation
Capsule networks

What's hot (20)

PDF
[GAN by Hung-yi Lee]Part 2: The application of GAN to speech and text processing
PDF
Distributed Deep Q-Learning
PDF
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
PDF
Lecture 4 neural networks
PPTX
Neural Learning to Rank
PDF
Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLcon...
PDF
Deep Generative Models
PPTX
Introduction to Neural Networks and Deep Learning from Scratch
PDF
Lecture 5: Neural Networks II
PDF
Reinforcement learning
PDF
Supervised Prediction of Graph Summaries
PPTX
0415_seminar_DeepDPG
PPTX
Generative Adversarial Networks (GAN)
PPTX
Reinforcement Learning : A Beginners Tutorial
PPTX
Variational Autoencoder Tutorial
PDF
Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLcon...
PPTX
Deep Reinforcement Learning
PDF
方策勾配型強化学習の基礎と応用
PDF
Backpropagation (DLAI D3L1 2017 UPC Deep Learning for Artificial Intelligence)
PDF
Introduction to Generative Adversarial Networks
[GAN by Hung-yi Lee]Part 2: The application of GAN to speech and text processing
Distributed Deep Q-Learning
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Lecture 4 neural networks
Neural Learning to Rank
Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLcon...
Deep Generative Models
Introduction to Neural Networks and Deep Learning from Scratch
Lecture 5: Neural Networks II
Reinforcement learning
Supervised Prediction of Graph Summaries
0415_seminar_DeepDPG
Generative Adversarial Networks (GAN)
Reinforcement Learning : A Beginners Tutorial
Variational Autoencoder Tutorial
Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLcon...
Deep Reinforcement Learning
方策勾配型強化学習の基礎と応用
Backpropagation (DLAI D3L1 2017 UPC Deep Learning for Artificial Intelligence)
Introduction to Generative Adversarial Networks
Ad

Similar to Deep Reasoning (20)

PPTX
160805 End-to-End Memory Networks
PPTX
Study of End to End memory networks
PDF
Ted Willke - The Brain’s Guide to Dealing with Context in Language Understanding
PDF
Episodic Memory Reader: Learning What to Remember for Question Answering from...
PDF
IRJET- Chatbot Using Gated End-to-End Memory Networks
PDF
Memory Networks, Neural Turing Machines, and Question Answering
PPTX
Natural Question Generation using Deep Learning
PPTX
Deep Learning Enabled Question Answering System to Automate Corporate Helpdesk
PDF
EXPERIMENTS ON DIFFERENT RECURRENT NEURAL NETWORKS FOR ENGLISH-HINDI MACHINE ...
PDF
Rnn presentation 2
PPTX
Deep Learning for Natural Language Processing
PPTX
Chatbot ppt
PDF
[246]reasoning, attention and memory toward differentiable reasoning machines
PDF
Recurrent Neural Networks (RNN) | RNN LSTM | Deep Learning Tutorial | Tensorf...
PPTX
LongT5_Efficient Text-toText Transformer for Long Sequences_san.pptx
PPTX
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
PDF
[246]QANet: Towards Efficient and Human-Level Reading Comprehension on SQuAD
PDF
Prediction of Answer Keywords using Char-RNN
PDF
Sequence to sequence (encoder-decoder) learning
PDF
Meta learning with memory augmented neural network
160805 End-to-End Memory Networks
Study of End to End memory networks
Ted Willke - The Brain’s Guide to Dealing with Context in Language Understanding
Episodic Memory Reader: Learning What to Remember for Question Answering from...
IRJET- Chatbot Using Gated End-to-End Memory Networks
Memory Networks, Neural Turing Machines, and Question Answering
Natural Question Generation using Deep Learning
Deep Learning Enabled Question Answering System to Automate Corporate Helpdesk
EXPERIMENTS ON DIFFERENT RECURRENT NEURAL NETWORKS FOR ENGLISH-HINDI MACHINE ...
Rnn presentation 2
Deep Learning for Natural Language Processing
Chatbot ppt
[246]reasoning, attention and memory toward differentiable reasoning machines
Recurrent Neural Networks (RNN) | RNN LSTM | Deep Learning Tutorial | Tensorf...
LongT5_Efficient Text-toText Transformer for Long Sequences_san.pptx
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
[246]QANet: Towards Efficient and Human-Level Reading Comprehension on SQuAD
Prediction of Answer Keywords using Char-RNN
Sequence to sequence (encoder-decoder) learning
Meta learning with memory augmented neural network
Ad

More from Taehoon Kim (15)

PDF
LLM에서 배우는 이미지 생성 모델 ZERO부터 학습하기 Training Large-Scale Diffusion Model from Scr...
PDF
상상을 현실로 만드는, 이미지 생성 모델을 위한 엔지니어링
PDF
머신러닝 해외 취업 준비: 닳고 닳은 이력서와 고통스러웠던 면접을 돌아보며 SNU 2018
PDF
Random Thoughts on Paper Implementations [KAIST 2018]
PDF
책 읽어주는 딥러닝: 배우 유인나가 해리포터를 읽어준다면 DEVIEW 2017
PDF
알아두면 쓸데있는 신기한 강화학습 NAVER 2017
PDF
카카오톡으로 여친 만들기 2013.06.29
PDF
Differentiable Neural Computer
PDF
딥러닝과 강화 학습으로 나보다 잘하는 쿠키런 AI 구현하기 DEVIEW 2016
PDF
지적 대화를 위한 깊고 넓은 딥러닝 PyCon APAC 2016
PDF
강화 학습 기초 Reinforcement Learning an introduction
PDF
Continuous control with deep reinforcement learning (DDPG)
PDF
텐서플로우 설치도 했고 튜토리얼도 봤고 기초 예제도 짜봤다면 TensorFlow KR Meetup 2016
PDF
쉽게 쓰여진 Django
PDF
영화 서비스에 대한 생각
LLM에서 배우는 이미지 생성 모델 ZERO부터 학습하기 Training Large-Scale Diffusion Model from Scr...
상상을 현실로 만드는, 이미지 생성 모델을 위한 엔지니어링
머신러닝 해외 취업 준비: 닳고 닳은 이력서와 고통스러웠던 면접을 돌아보며 SNU 2018
Random Thoughts on Paper Implementations [KAIST 2018]
책 읽어주는 딥러닝: 배우 유인나가 해리포터를 읽어준다면 DEVIEW 2017
알아두면 쓸데있는 신기한 강화학습 NAVER 2017
카카오톡으로 여친 만들기 2013.06.29
Differentiable Neural Computer
딥러닝과 강화 학습으로 나보다 잘하는 쿠키런 AI 구현하기 DEVIEW 2016
지적 대화를 위한 깊고 넓은 딥러닝 PyCon APAC 2016
강화 학습 기초 Reinforcement Learning an introduction
Continuous control with deep reinforcement learning (DDPG)
텐서플로우 설치도 했고 튜토리얼도 봤고 기초 예제도 짜봤다면 TensorFlow KR Meetup 2016
쉽게 쓰여진 Django
영화 서비스에 대한 생각

Recently uploaded (20)

PDF
Encapsulation_ Review paper, used for researhc scholars
PPTX
TLE Review Electricity (Electricity).pptx
PDF
Microsoft Solutions Partner Drive Digital Transformation with D365.pdf
PDF
August Patch Tuesday
PDF
WOOl fibre morphology and structure.pdf for textiles
PDF
Enhancing emotion recognition model for a student engagement use case through...
PDF
Approach and Philosophy of On baking technology
PDF
Encapsulation theory and applications.pdf
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
Mushroom cultivation and it's methods.pdf
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PPTX
A Presentation on Touch Screen Technology
PDF
A comparative study of natural language inference in Swahili using monolingua...
PDF
Hindi spoken digit analysis for native and non-native speakers
PDF
Assigned Numbers - 2025 - Bluetooth® Document
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
Univ-Connecticut-ChatGPT-Presentaion.pdf
PDF
Zenith AI: Advanced Artificial Intelligence
PDF
1 - Historical Antecedents, Social Consideration.pdf
PDF
Heart disease approach using modified random forest and particle swarm optimi...
Encapsulation_ Review paper, used for researhc scholars
TLE Review Electricity (Electricity).pptx
Microsoft Solutions Partner Drive Digital Transformation with D365.pdf
August Patch Tuesday
WOOl fibre morphology and structure.pdf for textiles
Enhancing emotion recognition model for a student engagement use case through...
Approach and Philosophy of On baking technology
Encapsulation theory and applications.pdf
MIND Revenue Release Quarter 2 2025 Press Release
Mushroom cultivation and it's methods.pdf
Building Integrated photovoltaic BIPV_UPV.pdf
A Presentation on Touch Screen Technology
A comparative study of natural language inference in Swahili using monolingua...
Hindi spoken digit analysis for native and non-native speakers
Assigned Numbers - 2025 - Bluetooth® Document
Programs and apps: productivity, graphics, security and other tools
Univ-Connecticut-ChatGPT-Presentaion.pdf
Zenith AI: Advanced Artificial Intelligence
1 - Historical Antecedents, Social Consideration.pdf
Heart disease approach using modified random forest and particle swarm optimi...

Deep Reasoning

  • 2. References 1. [Sukhbaatar, 2015] Sukhbaatar, Szlam, Weston, Fergus. “End-To-End Memory Networks” Advances in Neural Information Processing Systems. 2015. 2. [Hill, 2015] Hill, Bordes, Chopra, Weston. “The Goldilocks Principle: Reading Children's Books with Explicit Memory Representations” arXiv preprint arXiv:1511.02301 (2015). 3. [Kumar, 2015] Kumar, Irsoy, Ondruska, Iyyer, Bradbury, Gulrajani, Zhong, Paulus, Socher. “Ask Me Anything: Dynamic Memory Networks for Natural Language Processing” arXiv preprint arXiv:1511.06038 (2015). 4. [Xiong, 2016] Xiong, Merity, Socher. “Dynamic Memory Networks for Visual and Textual Question Answering” arXiv preprint arXiv:1603.01417 (2016). 5. [Yin, 2015] Yin, Schütze, Xiang, Zhou. “ABCNN: Attention-Based Convolutional Neural Network for Modeling Sentence Pairs” arXiv preprint arXiv:1512.05193 (2015). 6. [Yu, 2015] Yu, Zhang, Hang, Xiang, Zhou. “Empirical Study on Deep Learning Models for Question Answering” arXiv preprint arXiv:1510.07526 (2015). 7. [Hermann, 2015] Hermann, Kočiský, Grefenstette, Espeholt, Will Kay, Suleyman, Blunsom. “Teaching Machines to Read and Comprehend” arXiv preprint arXiv:1506.03340 (2015). 8. [Kadlec, 2016] Kadlec, Schmid, Bajgar, Kleindienst. “Text Understanding with the Attention Sum Reader Network” arXiv preprint arXiv:1603.01547 (2016). 2
  • 3. References 9. [Miao, 2015] Miao, Lei Yu, Blunsom. “Neural Variational Inference for Text Processing” arXiv preprint arXiv:1511.06038 (2015). 10. [Kingma, 2013] Kingma, Diederik P., and Max Welling. "Auto-encoding variational bayes" arXiv preprint arXiv:1312.6114 (2013). 11. [Sohn, 2015] Sohn, Kihyuk, Honglak Lee, and Xinchen Yan. "Learning Structured Output Representation using Deep Conditional Generative Models." Advances in Neural Information Processing Systems. 2015. 3
  • 4. Models Answer selection (WikiQA) General QA (CNN) Considered transitive inference (bAbI) ABCNN E2E MN E2E MN Variational Impatient Attentive Reader DMN Attentive Pooling Attentive (Impatient) Reader ReasoningNet Attention Sum Reader NTM 4
  • 5. End-to-End Memory Network [Sukhbaatar, 2015] 5
  • 6. End-to-End Memory Network [Sukhbaatar, 2015] 6 I go to school. He gets ball. … Embed C Where does he go?u Embed B Embed A Attention o I go to he gets ball Input Memory Output Memory softmax Inner product weighted sum Σ linear
  • 7. End-to-End Memory Network [Sukhbaatar, 2015] 7 I go to school. He gets ball. … linear Where does he go? Σ Σ Σ Sentence representation : 𝑖 th sentence : 𝑥% = 𝑥%',𝑥%),…, 𝑥%+ BoW : 𝑚% = ∑ 𝐴𝑥%// Position Encoding : 𝑚% = ∑ 𝑙/ 1 𝐴𝑥%// Temporal Encoding : 𝑚% = ∑ 𝐴𝑥%/ + 𝑇4(𝑖)/
  • 8. Training details Linear Start (LS) help avoid local minima - First train with softmax in each memory layer removed, making the model entirely linear except for the final softmax - When the validation loss stopped decreasing, the softmax layers were re-inserted and training recommenced RNN-style layer-wise weight tying - The input and output embeddings are the same across different layers Learning time invariance by injecting random noise - Jittering the time index with random empty memories - Add “dummy” memories to regularize 𝑇4(𝑖) 8
  • 9. Example of bAbI tasks 9
  • 10. The Goldilocks Principle: Reading Children's Books with Explicit Memory Representations [Hill, 2016] 10
  • 11. The Goldilocks Principle: Reading Children's Books with Explicit Memory Representations [Hill, 2016] 11 • Context sentences : 𝑆 = 𝑠', 𝑠), … , 𝑠+ , 𝑠% ∶ BoW word representation • Encoded memory : m; = 𝜙 𝑠 ∀𝑠 ∈ 𝑆 • Lexical memory • Each word occupies a separate slot in the memory • 𝑠 is a single word and 𝜙 𝑠 has only one non-zero feature • Multiple hop only beneficial in this memory model • Window memory (best) • 𝑠 corresponds to a window of text from the context 𝑆 centered on an individual mention of a candidate 𝑐 in 𝑆 m; = 𝑤%A BA' )⁄ … 𝑤% … 𝑤%D BA' )⁄ • Where 𝑤% ∈ 𝐶 which is an instance of one of the candidate words • Sentential memory • Same as original implementation of Memory Network
  • 12. The Goldilocks Principle: Reading Children's Books with Explicit Memory Representations [Hill, 2016] 12 Self-supervision for window memories - Memory supervision (knowing which memories to attend to) is not provided at training time - Making gradient steps using SGD to force the model to give a higher score to the supporting memory 𝒎G relative to any other memory from any other candidate using: Hard attention (training and testing) : 𝑚H' = argmax %M',…,+ 𝑐% N 𝑞 Soft attention (testing) : 𝑚H' = ∑ 𝛼% 𝑚%%M'…+ , 𝑤𝑖𝑡ℎ 𝛼% = ST U VW ∑ S T U VW X - If 𝑚H' happens to be different from 𝑚G (memory contain true answer), then model is updated - Can be understood as a way of achieving hard attention over memories (no need any new label information beyond the training data)
  • 13. The Goldilocks Principle: Reading Children's Books with Explicit Memory Representations [Hill, 2016] 13
  • 14. Gated Recurrent Network (GRU) 14 ℎYA' 𝑥Y 𝑥YD' ℎY X 𝑟Y 𝑧Y ℎY X 1- X + 𝑧Y 1 − 𝑧Y ℎYA'
  • 15. Ask Me Anything: Dynamic Memory Networks for Natural Language Processing [Kumar, 2015] 15
  • 16. Ask Me Anything: Dynamic Memory Networks for Natural Language Processing [Kumar, 2015] 16 I go to school. He gets ball. … Where does he go? GloVe Embed 𝑦Y I go to GloVe Embedwh do he 𝑞 𝑎Y 𝑦Y 𝑞 𝑎Y 𝐺𝑅𝑈 𝐺𝑅𝑈 < 𝐸𝑂𝑆 > ℎY 𝑒% 𝑒% 𝑮𝑹𝑼𝒍𝒊𝒔𝒉 Episodic Memory 𝑔Y % Input Module Answer Module Question Module
  • 17. Ask Me Anything: Dynamic Memory Networks for Natural Language Processing [Kumar, 2015] 17 𝑐Y ℎY 𝑒% I go to school. He gets ball. … 𝑔Y % Where does he go? 𝑞 GloVe Embed 𝐺𝑅𝑈(𝐿[𝑤Y s ], ℎYA' u ) = ℎY u = 𝑐Y 𝑦Y I go to Input Memory Episodic Memory 𝑒% 𝑚% 𝑞 GloVe EmbedI go to𝑞Y = 𝐺𝑅𝑈(𝐿 𝑤Y v , 𝑞YA') ℎN % = 𝑒% 𝑔Y % = 𝑮(𝑐Y, 𝑚%A' , 𝑞) 𝑮𝑹𝑼𝒍𝒊𝒔𝒉ℎY % = 𝑔Y % 𝐺𝑅𝑈 𝑐Y, ℎYA' % + (1 − 𝑔Y % ) ℎYA' % 𝑚% new Memory Gate 𝑞 𝑎Y 𝑦Y 𝑞 𝑎Y 𝐺𝑅𝑈 𝐺𝑅𝑈 < 𝐸𝑂𝑆 > 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑠𝑜𝑓𝑡𝑚𝑎𝑥
  • 18. Ask Me Anything: Dynamic Memory Networks for Natural Language Processing [Kumar, 2015] 18 ℎY 𝑒Y I go to school. He gets ball. … Where does he go? GloVe Embed 𝐺𝑅𝑈(𝐿[𝑤Y s ], ℎYA' u ) = ℎY u = 𝑐Y 𝑦Y I go to Input Memory Episodic Memory 𝑒Y 𝑚% GloVe EmbedI go to𝑞Y = 𝐺𝑅𝑈(𝐿 𝑤Y v , 𝑞YA') ℎN % = 𝑒% 𝑮𝑹𝑼𝒍𝒊𝒔𝒉ℎY % = 𝑔Y % 𝐺𝑅𝑈 𝑐Y, ℎYA' % + (1 − 𝑔Y % ) ℎYA' % 𝑚% new Memory 𝑞 𝑎Y 𝑦Y 𝑞 𝑎Y 𝐺𝑅𝑈 𝐺𝑅𝑈 < 𝐸𝑂𝑆 > 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑔Y % = 𝑮(𝑐Y, 𝑚%A' , 𝑞)Gate feature vector : captures a similarities between c, m, q G : two-layer feed forward neural network Attention Mechanism 𝑞𝑞𝑐Y 𝑔Y %
  • 19. Ask Me Anything: Dynamic Memory Networks for Natural Language Processing [Kumar, 2015] 19 𝑐Y ℎY 𝑒Y I go to school. He gets ball. … 𝑔Y % Where does he go? 𝑞 GloVe Embed 𝐺𝑅𝑈(𝐿[𝑤Y s ], ℎYA' u ) = ℎY u = 𝑐Y 𝑦Y I go to Input Memory Episodic Memory 𝑞 GloVe EmbedI go to𝑞Y = 𝐺𝑅𝑈(𝐿 𝑤Y v , 𝑞YA') ℎN % = 𝑒% 𝑮𝑹𝑼𝒍𝒊𝒔𝒉 𝑞 𝑎Y 𝑦Y 𝑞 𝑎Y 𝐺𝑅𝑈 𝐺𝑅𝑈 < 𝐸𝑂𝑆 > 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑔Y % = 𝑮(𝑐Y, 𝑚%A' , 𝑞)Gate ℎY % = 𝑔Y % 𝐺𝑅𝑈 𝑐Y, ℎYA' % + (1 − 𝑔Y % ) ℎYA' % 𝑒% = ℎNy % new Memory Episodic memory update 𝑒% 𝑚% 𝑚% 𝑚% = 𝐺𝑅𝑈(𝑒% , 𝑚%A' ) Episodic Memory Module - Iterates over input representations, while updating episodic memory 𝒆𝒊 - Attention mechanism + Recurrent network → Update memory 𝒎 𝒊 Memory update
  • 20. Ask Me Anything: Dynamic Memory Networks for Natural Language Processing [Kumar, 2015] 20 𝑐Y I go to school. He gets ball. … 𝑔Y % Where does he go? 𝑞 GloVe Embed 𝐺𝑅𝑈(𝐿[𝑤Y s ], ℎYA' u ) = ℎY u = 𝑐Y 𝑦Y I go to Input Memory Episodic Memory GloVe EmbedI go to𝑞Y = 𝐺𝑅𝑈(𝐿 𝑤Y v , 𝑞YA') ℎN % = 𝑒% 𝑮𝑹𝑼𝒍𝒊𝒔𝒉 𝑞 𝑎Y 𝑦Y 𝑞 𝑎Y 𝐺𝑅𝑈 𝐺𝑅𝑈 < 𝐸𝑂𝑆 > 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑔Y % = 𝑮(𝑐Y, 𝑚%A' , 𝑞)Gate ℎY % = 𝑔Y % 𝐺𝑅𝑈 𝑐Y, ℎYA' % + (1 − 𝑔Y % ) ℎYA' % new Memory ℎY 𝑒% 𝑞 𝑒% 𝑚% 𝑚% Multiple Episodes - Allows to attend to different inputs during each pass - Allows for a type of transitive inference, since the first pass may uncover the need to retrieve additional facts. Q : Where is the football? C1 : John put down the football. Only once the model sees C1, John is relevant, can reason that the second iteration should retrieve where John was. Criteria for Stopping - Append a special end-of-passes representation to the input 𝒄 - Stop if this representation is chosen by the gate function - Set a maximum number of iterations - This is why called Dynamic MM
  • 21. Ask Me Anything: Dynamic Memory Networks for Natural Language Processing [Kumar, 2015] 21 𝑐Y I go to school. He gets ball. … 𝑔Y % Where does he go? GloVe Embed 𝐺𝑅𝑈(𝐿[𝑤Y s ], ℎYA' u ) = ℎY u = 𝑐Y I go to Input Memory Episodic Memory GloVe EmbedI go to𝑞Y = 𝐺𝑅𝑈(𝐿 𝑤Y v , 𝑞YA') ℎN % = 𝑒% 𝑮𝑹𝑼𝒍𝒊𝒔𝒉 𝑔Y % = 𝑮(𝑐Y, 𝑚%A' , 𝑞)Gate ℎY % = 𝑔Y % 𝐺𝑅𝑈 𝑐Y, ℎYA' % + (1 − 𝑔Y % ) ℎYA' % new Memory ℎY 𝑒Y 𝑞 𝑒Y 𝑚% 𝑞 𝑦Y 𝑞 𝑎Y 𝑦Y 𝑞 𝑎Y 𝐺𝑅𝑈 𝐺𝑅𝑈 < 𝐸𝑂𝑆 > 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑚% Answer Module - Triggered once at the end of the episodic memory or at each time step - Concatenate the last generated word and the question vector as the input at each time step - Cross-entropy error
  • 22. Training Details - Adam optimization - 𝐿) regularization, dropout on the word embedding (GloVe) bAbI dataset - Objective function : 𝐽 = 𝛼𝐸s† 𝐺𝑎𝑡𝑒𝑠 + 𝛽𝐸s†(𝐴𝑛𝑠𝑤𝑒𝑟𝑠) - Gate supervision aims to select one sentence per pass - Without supervision : GRU of c‰,ℎY % and 𝑒% = ℎNy % - With supervision (simpler) : 𝑒% = ∑ 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑔Y % 𝑐Y N YM' , where 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑔Y % = Š‹Œ •Ž U ∑ Š‹Œ (•X U )V X•• and 𝑔Y % is the value before sigmoid - Better results, because softmax encourages sparsity & suited to picking one sentence 22
  • 23. Training Details Stanford Sentiment Treebank (Sentiment Analysis) - Use all full sentences, subsample 50% of phrase-level labels every epoch - Only evaluated on the full sentences - Binary classification, neutral phrases are removed from the dataset - Trained with GRU sequence models 23
  • 25. Dynamic Memory Networks for Visual and Textual Question Answering [Xiong 2016] 25 Several design choices are motivated by intuition and accuracy improvements
  • 26. Input Module in DMN - A single GRU for embedding story and store the hidden states - GRU provides temporal component by allowing a sentence to know the content of the sentences that came before them - Cons: - GRU only allows sentences to have context from sentences before them, but not after them - Supporting sentences may be too far away from each other - Here comes Input fusion layer 26
  • 27. Input Module in DMN+ Replacing a single GRU with two different components 1. Sentence reader : responsible only for encoding the words into a sentence embedding • Use positional encoder (used in E2E) : 𝑓% = ∑ 𝑙/ 1 𝐴𝑥%// • Considered GRUs LSTMs, but required more computational resources, prone to overfitting 2. Input fusion layer : interactions between sentences, allows content interaction between sentences • bi-directional GRU to allow information from both past and future sentences • gradients do not need to propagate through the words between sentences • distant supporting sentences can have a more direct interaction 27
  • 28. Input Module for DMN+ 28 Referenced paper : A Hierarchical Neural Autoencoder for Paragraphs and Documents [Li, 2015]
  • 29. Episodic Memory Module in DMN+ - 𝐹⃡ = [𝑓', 𝑓), … , 𝑓“] : output of the input module - Interactions between the fact 𝒇 𝒊 and both the question 𝒒 and episode memory state 𝒎𝒕 29
  • 30. Attention Mechanism in DMN+ Use attention to extract contextual vector 𝑐Y based on the current focus 1. Soft attention • A weighted summation of 𝐹⃡ : 𝑐Y = ∑ 𝑔% Y 𝑓% “ %M' • Can approximate a hard attention by selecting a single fact 𝑓% • Cons: losses positional and ordering information • Attention passes can retrieve some of this information, but inefficient 30
  • 31. Attention Mechanism in DMN+ 2. Attention based GRU (best) - position and ordering information : RNN is proper but can’t use 𝑔% Y - 𝑢%: update, 𝑟%: how much retain from ℎ%A' - Replace 𝑢% (vector) to 𝑔% Y (scalar) - Allows us to easily visualize how the attention gates activate - Use final hidden state as 𝒄 𝒕, which is used to update episodic memory 𝒎𝒕 31 𝑔% Y 𝑔% Y
  • 32. Episode Memory Updates in DMT+ 1. Untied and Tied (better) GRU 𝑚Y = 𝐺𝑅𝑈(𝐶Y , 𝑚YA' ) 2. Untied ReLU layer (best) 𝑚Y = 𝑅𝑒𝐿𝑈(𝑊Y 𝑚YA' ; 𝑐Y ; 𝑞 + 𝑏) 32
  • 33. Training Details - Adam optimization - Xavier initialization is used for all weights except for the word embeddings - 𝐿) regularization on all weights except bias - Dropout on the word embedding (GloVe) and answer module with 𝑝 = 0.9 33
  • 34. ABCNN: Attention-Based Convolutional Neural Network for Modeling Sentence Pairs [Yin 2015] 34
  • 35. ABCNN: Attention-Based Convolutional Neural Network for Modeling Sentence Pairs [Yin 2015] 35 • Most prior work on answer selection model each sentence separately and neglects mutual influence • Human focus on key parts of 𝑠Ÿ by extracting parts from 𝑠' related by identity, synonymy, antonym etc. • ABCNN : taking into account the interdependence between 𝑠Ÿ and 𝑠' • Convolution layer : increase abstraction of a phrase from words
  • 36. ABCNN: Attention-Based Convolutional Neural Network for Modeling Sentence Pairs [Yin 2015] 36 1. Input embedding with word2vec 2-1. Convolution layer with wide convolution • To make each word 𝑣% to be detected by all weights in 𝑊 2-2. Average pooling layer • all-ap : column-wise averaging over all columns • w-ap : column-wise averaging over windows of 𝑤 3. Output layer with logistic regression • Forward all-ap to all non-final ap layer + final ap layer
  • 37. ABCNN: Attention-Based Convolutional Neural Network for Modeling Sentence Pairs [Yin 2015] 37 Attention on feature map (ABCNN-1) • Attention values of row 𝑖 in 𝑨 : attention distribution of the 𝑖−th unit of 𝑠Ÿ with respect to 𝑠' • 𝐴%,/ = 𝑚𝑎𝑡𝑐ℎ𝑠𝑐𝑜𝑟𝑒(𝐹Ÿ,¢ :, 𝑖 , 𝐹',¢ :, 𝑗 ) • 𝑚𝑎𝑡𝑐ℎ𝑠𝑐𝑜𝑟𝑒 = 1/(1 + 𝑥 − 𝑦 ) • Generate the attention feature map 𝐹%,¦ for 𝑠% • Cons : need more parameters
  • 38. ABCNN: Attention-Based Convolutional Neural Network for Modeling Sentence Pairs [Yin 2015] 38 Attention after convolution (ABCNN-2) • Attention weights directly on the representation with the aim of improving the features computed by convolution • 𝑎Ÿ,/ = ∑𝐴 𝑗, : → col-wise, row-wise sum • w-ap on convolution feature
  • 39. ABCNN: Attention-Based Convolutional Neural Network for Modeling Sentence Pairs [Yin 2015] 39 ABCNN-1 ABCNN-2 Indirect impact to convolution Direct convolution via pooling (weighted attention) Need more features Vulnerable to overfitting No need features handles smaller-granularity units (ex. Word level) handles larger-granularity units (ex. Phrase level, phrase size = window size) ABCNN-3
  • 40. ABCNN: Attention-Based Convolutional Neural Network for Modeling Sentence Pairs [Yin 2015] 40
  • 41. Empirical Study on Deep Learning Models for QA [Yu 2015] 41
  • 42. Empirical Study on Deep Learning Models for QA [Yu 2015] 42 The first to examine Neural Turing Machines on QA problems Split QA into two step 1. search supporting facts 2. Generate answer from relevant pieces of information NTM • Single-layer LSTM network as controller • Input : word embedding 1. Support fact only 2. Fact highlighted : user marker to annotate begin and end of supporting facts • Output : softmax layer (multiclass classification) for answer
  • 43. Teaching Machines to Read and Comprehend [Hermann 2015] 43
  • 44. Teaching Machines to Read and Comprehend [Hermann 2015] 44 where 𝑓% = 𝑦§ 𝑡 𝑠(𝑡) : degree to which the network attends to a particular token in the document when answering the query (soft attention)
  • 45. Text Understanding with the Attention Sum Reader Network [Kadlec 2016] 45 Answer should be in context Inspired by Pinter Network Contrast to Attentive Reader: • We select answer from context directly using weighted sum of individual representation Attentive Reader
  • 46. Stochastic Latent Variable 46 Z x Zx y Generative Model Conditional GenerativeModel 𝑝 𝑥 = ¨ 𝑝(𝑥, 𝑧) © = ¨ 𝑝(𝑥|𝑧)𝑝(𝑧) © 𝑝 𝑦|𝑥 = ¨ 𝑝 𝑦 𝑧, 𝑥 𝑝(𝑧|𝑥) © 𝑝 𝑥 = «𝑝(𝑥, 𝑧) © = «𝑝 𝑥 𝑧 𝑝(𝑧) © 𝑝 𝑦|𝑥 = «𝑝 𝑦 𝑧, 𝑥 𝑝(𝑧|𝑥) ©
  • 47. Variational Inference Framework 47 𝑝 𝑥, 𝑧 = 𝑝 𝑥 𝑧 𝑝 𝑧 = ¨ 𝑝 𝑥 ℎ 𝑝 ℎ 𝑧 𝑝(𝑧) ¬ log 𝑝® 𝑥, 𝑧 = log« 𝑞 ℎ 𝑞 ℎ 𝑝 𝑥 ℎ 𝑝 ℎ 𝑧 𝑝 𝑧 𝑑ℎ ¬ ≥ « 𝑞 ℎ log 𝑝 𝑥 ℎ 𝑝 ℎ 𝑧 𝑝 𝑧 𝑞 ℎ 𝑑ℎ ¬ = « 𝑞 ℎ log 𝑝 𝑥 ℎ 𝑝 ℎ 𝑧 𝑞 ℎ 𝑑ℎ ¬ + « 𝑞 ℎ log 𝑝 𝑧 𝑞 ℎ 𝑑ℎ ¬ = 𝐸± ¬ log 𝑝 𝑥 ℎ 𝑝 ℎ 𝑧 − log 𝑞 ℎ − 𝐷³´(𝑞(ℎ)||𝑝 𝑧 ) = 𝐸± ¬ log 𝑝 𝑥 ℎ 𝑝 ℎ 𝑧 𝑝(𝑧) − log 𝑞 ℎ
  • 48. Variational Inference Framework 48 𝑝® 𝑥, 𝑧 = 𝑝® 𝑥 𝑧 𝑝 𝑧 = ¨ 𝑝® 𝑥 ℎ 𝑝® ℎ 𝑧 𝑝(𝑧) ¬ log 𝑝® 𝑥, 𝑧 = log« 𝑞 ℎ 𝑞 ℎ 𝑝® 𝑥 ℎ 𝑝® ℎ 𝑧 𝑝 𝑧 𝑑ℎ ¬ ≥ « 𝑞 ℎ log 𝑝® 𝑥 ℎ 𝑝® ℎ 𝑧 𝑝 𝑧 𝑞 ℎ 𝑑ℎ ¬ = « 𝑞 ℎ log 𝑝® 𝑥 ℎ 𝑝® ℎ 𝑧 𝑞 ℎ 𝑑ℎ ¬ + « 𝑞 ℎ log 𝑝 𝑧 𝑞 ℎ 𝑑ℎ ¬ = 𝐸± ¬ log 𝑝® 𝑥 ℎ 𝑝® ℎ 𝑧 − log 𝑞 ℎ − 𝐷³´(𝑞(ℎ)| 𝑝 𝑧 = 𝐸± ¬ log 𝑝® 𝑥 ℎ 𝑝® ℎ 𝑧 − log 𝑞 ℎ a tight lower bound if 𝑞 ℎ = 𝑝(ℎ|𝑥, 𝑧) Jensen’s Inequality
  • 49. Conditional Variational Inference Framework 49 𝑝® 𝑦|𝑥 = ¨ 𝑝®(𝑦, 𝑧|𝑥) © = ¨ 𝑝®(𝑦|𝑥, 𝑧)𝑝µ 𝑧 𝑥 © log 𝑝(𝑦|𝑥) = log « 𝑞 𝑧 𝑞 𝑧 𝑝 𝑦 𝑧, 𝑥 𝑝 𝑧 𝑥 𝑑𝑧 © ≥ « 𝑞 𝑧 log 𝑝 𝑦 𝑧, 𝑥 𝑝 𝑧 𝑥 𝑞 𝑧 𝑑𝑧 © = « 𝑞 𝑧 log 𝑝 𝑦 𝑧, 𝑥 𝑞 𝑧 𝑑𝑧 © + « 𝑞 𝑧 log 𝑝 𝑧 𝑥 𝑞 𝑧 𝑑𝑧 ¬ = «𝑞 𝑧 log 𝑝 𝑦 𝑧, 𝑥 𝑑𝑧 © − «𝑞 𝑧 log 𝑞(𝑧) 𝑑𝑧 © + « 𝑞 𝑧 log 𝑝 𝑧 𝑥 𝑞 𝑧 𝑑𝑧 ¬ = 𝐸± © log 𝑝(𝑦|𝑧, 𝑥) − log 𝑞 𝑧 − 𝐷³´(𝑞 𝑧 ∥ 𝑝 𝑧|𝑥 ) = 𝐸± © log 𝑝(𝑦|𝑧, 𝑥) − log 𝑞 𝑧 a tight lower bound if 𝑞 𝑧 = 𝑝(𝑧|𝑥) Jensen’s Inequality
  • 50. Neural Variational Inference Framework log 𝑝® 𝑥, 𝑧 ≥ 𝐸± © log 𝑝 𝑦 𝑧, 𝑥 − log 𝑞 𝑧 − 𝐷³´ 𝑞 𝑧 ∥ 𝑝 𝑧|𝑥 = ℒ 1. Vector representations of the observed variables 𝑢 = 𝑓© 𝑧 , 𝑣 = 𝑓¸(𝑥) 2. Joint representation (concatenation) 𝜋 = 𝑔(𝑢, 𝑣) 3. Parameterize the variational distribution 𝜇 = 𝑙' 𝜋 , 𝜎 = 𝑙)(𝜋) 50
  • 51. Neural Variational Document Model [Miao, 2015] 51