SlideShare a Scribd company logo
Algorithmic	Intelligence	Laboratory
Algorithmic	Intelligence	Laboratory
EE807:	Recent	Advances	in	Deep	Learning
Lecture	19
Slide	made	by	
Sangwoo	Mo
KAIST	EE
Advanced	Models	for	Language
1
Algorithmic	Intelligence	Laboratory
1. Introduction
• Why	deep	learning	for	NLP?
• Overview	of	the	lecture
2. Network	Architecture
• Learning	long-term	dependencies
• Improve	softmax	layers
3. Training	Methods
• Reduce	exposure	bias
• Reduce	loss/evaluation	mismatch
• Extension	to	unsupervised	setting
Table	of	Contents
2
Algorithmic	Intelligence	Laboratory
1. Introduction
• Why	deep	learning	for	NLP?
• Overview	of	the	lecture
2. Network	Architecture
• Learning	long-term	dependencies
• Improve	softmax	layers
3. Training	Methods
• Reduce	exposure	bias
• Reduce	loss/evaluation	mismatch
• Extension	to	unsupervised	setting
Table	of	Contents
3
Algorithmic	Intelligence	Laboratory
Why	Deep	Learning	for	Natural	Language	Processing	(NLP)?
• Deep	learning is	now	commonly	used in	natural	language	processing	(NLP)
*Source:	Young	et	al.	“Recent	Trends	in	Deep	Learning	Based	Natural	Language	Processing”,	arXiv	2017 4
Algorithmic	Intelligence	Laboratory
Recap:	RNN	&	CNN	for	Sequence	Modeling
• Language is	sequential: It	is	natural	to	use	RNN	architectures
• RNN (or	LSTM	variants)	is	a	natural	choice	for	sequence	modelling
• Language is	translation-invariant: It	is	natural	to	use	CNN	architectures
• One	can	use	CNN [Gehring	et	al.,	2017]	for	parallelization
*Source:	https://towardsdatascience.com/introduction-to-recurrent-neural-network-27202c3945f3
Gehring	et	al.	“Convolutional	Sequence	to	Sequence	Learning”,	ICML	2017 5
Algorithmic	Intelligence	Laboratory
Limitations	of	prior	works
• However,	prior	works have	several	limitations…
• Network	architecture
• Long-term	dependencies:	Network	forgets previous	information	as	it	summarizes	
inputs	into	a	single feature	vector
• Limitations	of	softmax:	Computation linearly	increases	to	the	vocabulary	size,
and	expressivity is	bounded	by	the	feature	dimension
• Training	methods
• Exposure	bias:	Model	only	sees	true tokens	at	training,	but	it	sees	generated
tokens	at	inference	(and	noise	accumulates	sequentially)
• Loss/evaluation	mismatch:	Model	uses	MLE objective	at	training,	but	use	other	
evaluation	metrics (e.g.,	BLEU	score	[Papineni	et	al.,	2002])	at	inference
• Unsupervised	setting:	How	to	train	models	if	there	are	no	paired data?
6
Algorithmic	Intelligence	Laboratory
1. Introduction
• Why	deep	learning	for	NLP?
• Overview	of	the	lecture
2. Network	Architecture
• Learning	long-term	dependencies
• Improve	softmax	layers
3. Training	Methods
• Reduce	exposure	bias
• Reduce	loss/evaluation	mismatch
• Extension	to	unsupervised	setting
Table	of	Contents
7
Algorithmic	Intelligence	Laboratory
Attention	[Bahdanau	et	al.,	2015]
• Motivation:
• Previous	models	summarize inputs	into	a	single feature	vector
• Hence,	the	model	forgets old	inputs,	especially	for	long sequences
• Idea:
• Use	input	features,	but	attend	on	the	most	importance features
• Example)	Translate	“Ich	mochte	ein	bier” ⇔ “I’d	like	a	beer”
• Here,	when	the	model	generates	“beer”,	it	should	attend	on	“bier”
8*Source:	https://ratsgo.github.io/from%20frequency%20to%20semantics/2017/10/06/attention/
Algorithmic	Intelligence	Laboratory
Attention	[Bahdanau	et	al.,	2015]
• Method:
• Task: Translate	source	sequence	[𝑥$, … , 𝑥'] to	target	sequence	[𝑦$, … , 𝑦*]
• Now	the	decoder	hidden	state	𝑠, is	a	function	of	previous	state	𝑠,-$,	current	input	
𝑦.,-$,	and	context	vector	𝑐,,	i.e.,	𝑠, = 𝑓 𝑠,-$, 𝑦.,-$, 𝑐,
9*Source:	https://medium.com/syncedreview/a-brief-overview-of-attention-mechanism-13c578ba9129
𝑐,
𝑠,𝑠,-$
𝑦.,-$
Algorithmic	Intelligence	Laboratory
Attention	[Bahdanau	et	al.,	2015]
• Method:
• Task: Translate	source	sequence	[𝑥$, … , 𝑥'] to	target	sequence	[𝑦$, … , 𝑦*]
• Now	the	decoder	hidden	state	𝑠, is	a	function	of	previous	state	𝑠,-$,	current	input	
𝑦.,-$,	and	context	vector	𝑐,,	i.e.,	𝑠, = 𝑓 𝑠,-$, 𝑦.,-$, 𝑐,
• The	context	vector	𝑐, is	linear	combination of	input	hidden	features [ℎ$, … , ℎ']
• Here,	the	weight	𝛼,,4 is	alignment	score of	two	words	𝑦, and	𝑥4
where	score is	also	jointly	trained,	e.g.,	
10
Algorithmic	Intelligence	Laboratory
Attention	[Bahdanau	et	al.,	2015]
• Results:	Attention	shows	good	correlation between	source	and	target
11
Algorithmic	Intelligence	Laboratory
Attention	[Bahdanau	et	al.,	2015]
• Results:	Attention	improves machine	translation	performance
• RNNenc:	no	attention	/	RNNsearch:	with	attention	/	#:	max	length	of	train	data
12
No	UNK:	omit	unknown	words
*:	longer	train	until	converge
Algorithmic	Intelligence	Laboratory
Show,	Attend,	and	Tell	[Xu	et	al.,	2015]
• Motivation:	Can	apply	attention for	image	captioning?
• Task: Translate	source	image	[𝑥] to	target	sequence	[𝑦$, … , 𝑦*]
• Now	attend	on	specific	location on	the	image,	not	the	words
• Idea:	Apply	attention	to	convolutional	features [ℎ$, … , ℎ:] (with	𝐾 channels)
• Apply	deterministic	soft attention	(as	previous	one) and	stochastic	hard attention
(pick	one	ℎ4 by	sampling	multinomial	distribution	with	parameter	𝛼)
• Hard	attention picks	more	specific area	and	shows	better results,	but	training	is	
less	stable due	to	the	stochasticity and	differentiability
13
Up:	hard	attention	/	Down:	soft	attention
Algorithmic	Intelligence	Laboratory
Show,	Attend,	and	Tell	[Xu	et	al.,	2015]
• Results:	Attention	picks	visually	plausible	locations
14
Algorithmic	Intelligence	Laboratory
Show,	Attend,	and	Tell	[Xu	et	al.,	2015]
• Results:	Attention	improves the	image	captioning	performance
15
Algorithmic	Intelligence	Laboratory
Transformer	[Vaswani	et	al.,	2017]	
• Motivation:
• Prior	works	use	RNN/CNN	to	solve	sequence-to-sequence problems
• Attention already	handles	arbitrary	length of	sequences,	easy	to	parallelize,	and	
not	suffer	from	forgetting problems…	Why	should	one	use	RNN/CNN modules?
• Idea:
• Design	architecture	only	using attention modules
• To	extract	features,	the	authors	use	self-attention,	that	features	attend	on	itself
• Self-attention	has	many	advantages	over	RNN/CNN	blocks
16
𝑛: sequence	length,	𝑑:	feature	dimension,	𝑘:	(conv)	kernel	size,	𝑟:	window	size	to	consider
Maximum	path	length: maximum	traversal	between	any	two	input/outputs	(lower	is	better)
*Cf.	Now	self-attention	is	widely	used	in	other	architectures,	e.g.,	CNN	[Wang	et	al.,	2018]	or	GAN	[Zhang	et	al.,	2018]
Algorithmic	Intelligence	Laboratory
Transformer	[Vaswani	et	al.,	2017]	
• Multi-head	attention:	The	building	block	of	the	Transformer
• In	previous	slide,	we	introduced	additive attention	[Bahdanau	et	al.,	2015]
• Here,	the	context	vector	is	a	linear	combination of
• weight	𝛼,,4,	a	function	of	inputs	[𝑥4] and	output	𝑦,
• and	input	hidden	states	[ℎ4]
• In	general,	attention	is	a	function	of	key 𝐾,	value 𝑉,	and	query 𝑄
• key [𝑥4]	and	query 𝑦,	defines	weights	𝛼,,4,	which	are	applied	to	value [ℎ4]
• For	sequence	length	𝑇 and	feature	dimension	𝑑,	(𝐾, 𝑉, 𝑄) are	𝑇×𝑑,	𝑇×𝑑,	and	1×𝑑 matrices
• Transformer	use	scaled	dot-product attention
• In	addition,	transformer	use	multi-head	attention,
ensemble of	attentions
17
Algorithmic	Intelligence	Laboratory
Transformer	[Vaswani	et	al.,	2017]	
• Transformer:
• The	final	transformer model	is	built	upon	the	(multi-head)	attention	blocks
• First,	extract	features	with	self-attention	(see	lower	part	of	the	block)
• Then	decode	feature	with	usual	attention (see	middle	part	of	the	block)
• Since	the	model	don’t	have	a	sequential	structure,
the	authors	give	position	embedding	(some	handcrafted
feature	that	represents	the	location	in	sequence)
18
Algorithmic	Intelligence	Laboratory
Transformer	[Vaswani	et	al.,	2017]	
• Results:	Transformer	architecture	shows	good	performance for	languages
19
Algorithmic	Intelligence	Laboratory
BERT	[Delvin	et	al.,	2018]
• Motivation:
• Many	success	of	CNN	comes	from	ImageNet-pretrained networks
• Can	train	a	universal	encoder for	natural	languages?
• Method:
• BERT	(bidirectional	encoder	representations	from	transformers):	Design	a	neural	
network	based	on	bidirectional	transformer,	and	use	it	as	a	pretraining	model
• Pretrain	with	two	tasks (masked	language	model,	next	sentence	prediction)	
• Use	fixed	BERT	encoder,	and	fine-tune	simple	1-layer	decoder for	each	task
20
Sentence	classification Question	answering
Algorithmic	Intelligence	Laboratory
BERT	[Delvin	et	al.,	2018]
• Results:
• Even	without task-specific	complex	architectures,	BERT	achieves	SOTA	for	11	NLP	
tasks,	including	classification,	question	answering,	tagging,	etc.
21
Algorithmic	Intelligence	Laboratory
1. Introduction
• Why	deep	learning	for	NLP?
• Overview	of	the	lecture
2. Network	Architecture
• Learning	long-term	dependencies
• Improve	softmax	layers
3. Training	Methods
• Reduce	exposure	bias
• Reduce	loss/evaluation	mismatch
• Extension	to	unsupervised	setting
Table	of	Contents
22
Algorithmic	Intelligence	Laboratory
Adaptive	Softmax	[Grave	et	al.,	2017]
• Motivation:	
• Computation	of	softmax is	expensive,	especially	for	large	vocabularies
• Hierarchical	softmax	[Mnih	&	Hinton,	2009]:
• Cluster	𝑘 words	into	balanced 𝑘 groups,	which	reduces	the	complexity	to	𝑂( 𝑘)
• For	hidden	state	ℎ,	word	𝑤,	and	cluster	𝐶 𝑤 ,
• One	can	repeat	clustering	for	subtrees	(i.e.,	build	a	balanced	𝑛-ary tree),	which	
reduces	the	complexity	to	𝑂(log 𝑘)
23*Source:	http://opendatastructures.org/versions/edition-0.1d/ods-java/node40.html
Algorithmic	Intelligence	Laboratory
Adaptive	Softmax	[Grave	et	al.,	2017]
• Limitation	of	prior	works	&	Proposed	idea:
• Cluster	𝑘 words	into	balanced 𝑘 groups,	which	reduces	the	complexity	to	𝑂( 𝑘)
• One	can	repeat	clustering	for	subtrees,	which	reduces	the	complexity	to	𝑂(log 𝑘)
• However,	putting	all	words	to	the	leaves drop	the	performance	(around	5-10%)
• Instead,	one	can	put	frequent	words	in	front (similar	to	Huffman	coding)
• Put	top	𝒌 𝒉 words	(𝑝Q of	frequencies)	and	token	“NEXT-𝒊”	in	the	first	layer,	and
put	𝑘4 words	(𝑝4 of	frequencies)	in	the	next	layers
24
Algorithmic	Intelligence	Laboratory
Adaptive	Softmax	[Grave	et	al.,	2017]
• Limitation	of	prior	works	&	Proposed	idea:
• Put	top	𝒌 𝒉 words	(𝑝Q of	frequencies)	and	token	“NEXT-𝒊”	in	the	first	layer,	and
put	𝑘4 words	(𝑝4 of	frequencies)	in	the	next	layers
• Let	𝑔(𝑘, 𝐵) be	the	computation	time	for	𝑘 words	and	batch	size	𝐵
• Then	the	computation	time of	adaptive	softmax	(with	𝐽 clusters)	is
• For	𝑘, 𝐵 larger	than	some	threshold,	one	can	simply	assume	𝑔 𝑘, 𝐵 = 𝑘𝐵 (see	paper	for	details)
• By	solving	the	optimization	problem	(for	𝑘4 and	𝐽),	the	model	is	3-5x	faster than	
the	original	softmax	(in	practice,	𝐽 = 5 works	well)
25
Algorithmic	Intelligence	Laboratory
Adaptive	Softmax	[Grave	et	al.,	2017]
• Results:	Adaptive	softmax	shows	comparable	results to	the	original	softmax	
(while	much	faster)
26
ppl:	perplexity	(lower	is	better)
Algorithmic	Intelligence	Laboratory
Mixture	of	Softmax	[Yang	et	al.,	2018]
• Motivation:
• Rank of	softmax	layer	is	bounded by	the	feature	dimension	𝑑
• Recall: By	definition	of	softmax
we	have																																																																(which	is	called	logit)
• Let	𝑁 be	number	of	possible	contexts,	and	𝑀 be	vocabulary	size,	then
which	implies	that	softmax	can	represent	at	most	rank	𝒅 (real	𝐀 can	be	larger)
27*Source:	https://www.facebook.com/iclr.cc/videos/2127071060655282/
Algorithmic	Intelligence	Laboratory
Mixture	of	Softmax	[Yang	et	al.,	2018]
• Motivation:
• Rank of	softmax	layer	is	bounded by	the	feature	dimension	𝑑
• Naïvely	increasing	dimension	𝑑 to	vocab	size	𝑀 is	inefficient
• Idea:
• Use	mixture	of	softmaxes (MoS)
• It	is	easily	implemented	by	defining	𝜋,] and	𝐡,] as	a	function	of	original	𝐡
• Note	that	now
is	a	nonlinear	(log-sum-exp)	function	of	𝐡 and	𝐰,	hence	can	represent	high	rank
28
Algorithmic	Intelligence	Laboratory
Mixture	of	Softmax	[Yang	et	al.,	2018]
• Results:	MoS	learns	full	rank (=	vocab	size)	while	softmax	is	bounded	by	𝑑
• Measured	empirical	rank,	collect	every	empirical	contexts	&	outputs
29
MoC:	mixture	of	contexts
(mixture	before softmax)
𝑑 = 400, 280, 280 for
Softmax,	MoC,	MoS,	respectively
Note	that	9981	is	full	rank
as	vocab	size	=	9981
Algorithmic	Intelligence	Laboratory
Mixture	of	Softmax	[Yang	et	al.,	2018]
• Results:	Simply	changing	Softmax	to	MoS	improves the	performance
• By	applying	MoS	to	SOTA	models,	the	authors	achieved	new	SOTA	records
30
Algorithmic	Intelligence	Laboratory
1. Introduction
• Why	deep	learning	for	NLP?
• Overview	of	the	lecture
2. Network	Architecture
• Learning	long-term	dependencies
• Improve	softmax	layers
3. Training	Methods
• Reduce	exposure	bias
• Reduce	loss/evaluation	mismatch
• Extension	to	unsupervised	setting
Table	of	Contents
31
Algorithmic	Intelligence	Laboratory
Scheduled	Sampling	[Bengio	et	al.,	2015]
• Motivation:	
• Teacher	forcing [Williams	et	al.,	1989]	is	widely	used	for	sequential	training
• It	use	real previous	token	and	current	state	to	predict	current	output
32
*Source:	https://satopirka.com/2018/02/encoder-decoder%E3%83%A2%E3%83%87%E3%83%AB%E3%81%A8
teacher-forcingscheduled-samplingprofessor-forcing/
Algorithmic	Intelligence	Laboratory
Scheduled	Sampling	[Bengio	et	al.,	2015]
• Motivation:	
• Teacher	forcing [Williams	et	al.,	1989]	is	widely	used	for	sequential	training
• It	use	real previous	token	and	current	state	to	predict	current	output
• However,	the	model	use	predicted token	at	inference	(a.k.a.	exposure	bias)
33
*Source:	https://satopirka.com/2018/02/encoder-decoder%E3%83%A2%E3%83%87%E3%83%AB%E3%81%A8
teacher-forcingscheduled-samplingprofessor-forcing/
Algorithmic	Intelligence	Laboratory
Scheduled	Sampling	[Bengio	et	al.,	2015]
• Motivation:	
• Teacher	forcing [Williams	et	al.,	1989]	is	widely	used	for	sequential	training
• It	use	real previous	token	and	current	state	to	predict	current	output
• However,	the	model	use	predicted token	at	inference	(a.k.a.	exposure	bias)
• Training	with	predicted	token	is	not	trivial,	since	(a)	training	is	unstable,	and	(b)	as	
previous	token	is	changed,	target	also	should	be	changed
• Idea:	Apply	curriculum	learning
• At	beginning,	use	real tokens,	and	slowly	move	to	predicted tokens
34
Algorithmic	Intelligence	Laboratory
Scheduled	Sampling	[Bengio	et	al.,	2015]
• Results:	Scheduled	sampling	improves	baseline for	many	tasks	
35
Image	captioning
Constituency	parsing
Algorithmic	Intelligence	Laboratory
Professor	Forcing	[Lamb	et	al.,	2016]
• Motivation:
• Scheduled	sampling	(SS)	is	known	to	optimize	wrong	objective [Huszár	et	al.,	2015]
• Idea:
• Make	features	of	predicted tokens	be	similar	to	the	features	of	true tokens
• To	this	end,	train	a	discriminator classifies	features	of	true/predicted	tokens
• Teacher	forcing: use	real	tokens	/	Free	running: use	predicted	tokens
36
Algorithmic	Intelligence	Laboratory
Professor	Forcing	[Lamb	et	al.,	2016]
• Results:
• Professor	forcing	improves	the	generalization performance,	especially	for	the
long	sequences (test	samples	are	much	longer	than	training	samples)
37
NLL	for	MNIST
generation
Human	evaluation
for	handwriting
generation
Algorithmic	Intelligence	Laboratory
1. Introduction
• Why	deep	learning	for	NLP?
• Overview	of	the	lecture
2. Network	Architecture
• Learning	long-term	dependencies
• Improve	softmax	layers
3. Training	Methods
• Reduce	exposure	bias
• Reduce	loss/evaluation	mismatch
• Extension	to	unsupervised	setting
Table	of	Contents
38
Algorithmic	Intelligence	Laboratory
MIXER	[Ranzato	et	al.,	2016]
• Motivation:
• Prior	works	use	word-level objectives	(e.g.,	cross-entropy)	for	training,	but	use	
sequence-level objectives	(e.g.,	BLEU	[Papineni	et	al.,	2002])	for	evaluation
• Idea:	Directly	optimize model	with	sequence-level objective	(e.g.,	BLEU)
• Q.	How	to	backprop	(usually	not	differentiable)	sequence-level	objective?
• Sequence	generation	is	a	kind	of	RL	problem
• state:	hidden	state,	action:	output,	policy:	generation	algorithm
• Sequence-level	objective	is	the	reward of	current	algorithm
• Hence,	one	can	use	policy	gradient (e.g.,	REINFORCE)	algorithm
• However,	the	gradient	estimator	of	REINFORCE	has	high	variance
• To	reduce	variance,	MIXER	(mixed	incremental	cross-entropy	reinforce)	use
MLE	for	first	𝑇′ steps	and	REINFOCE	for	next	𝑇 − 𝑇′ steps	(𝑇′ goes	to	zero)
• Cf.	One	can	also	use	other	variance	reduction	techniques,	e.g.,	actor-critic	[Bahdanau	et	al.,	2017]
39
Algorithmic	Intelligence	Laboratory
MIXER	[Ranzato	et	al.,	2016]
• Results:
• MIXER shows	better	performance than	other	baselines
• XENT	(=	cross	entropy):	another	name	of	maximum	likelihood	estimation	(MLE)
• DAD	(=	data	as	demonstrator):	another	name	of	scheduled	sampling
• E2D	(=	end-to-end	backprop):	use	top-K	vector	as	input	(approx.	beam	search)
40
Algorithmic	Intelligence	Laboratory
SeqGAN	[Yu	et	al.,	2017]
• Motivation:
• RL-based	method	still	relies	on	handcrafted	objective (e.g.,	BLEU)
• Instead,	one	can	use	GAN	loss to	generate	realistic	sequences
• However,	it	is	not	trivial	to	apply	GAN	for	natural	languages,	since	data	is	discrete
(hence	not	differentiable)	and	sequence (hence	need	new	architecture)
• Idea:	Backprop	discriminator’s	output	with	policy	gradient
• Similar	to	actor-critic;	only	difference	is	now	the	reward	is	discriminator’s	output
• Use	LSTM-generator	and	CNN	(or	Bi-LSTM)-discriminator	architectures
41
Algorithmic	Intelligence	Laboratory
SeqGAN	[Yu	et	al.,	2017]
• Results:
• SeqGAN shows	better	performance	than	prior	methods
42
Synthetic	generation
(follow	the	oracle)
Chinese	poem	generation Obama	speech	generation
Algorithmic	Intelligence	Laboratory
1. Introduction
• Why	deep	learning	for	NLP?
• Overview	of	the	lecture
2. Network	Architecture
• Learning	long-term	dependencies
• Improve	softmax	layers
3. Training	Methods
• Reduce	exposure	bias
• Reduce	loss/evaluation	mismatch
• Extension	to	unsupervised	setting
Table	of	Contents
43
Algorithmic	Intelligence	Laboratory
UNMT	[Artetxe	et	al.,	2018]
• Motivation:
• Can	train	neural	machine	translation models	in	unsupervised way?
• Idea:	Apply	the	idea	of	domain	transfer	in	Lecture	12
• Combine	two	losses:	reconstruction loss	and	cycle-consistency loss
• Recall: Cycle-consistency	loss	forces	twice cross-domain	generated	(e.g.,	L1→L2→L1)	data	to	
become	the	original	data
44*Source:	Lample	et	al.	“Unsupervised	Machine	Translation	Using	Monolingual	Corpora	Only”,	ICLR	2018.
Model	architecture	(L1/L2:	language	1,	2)
reconstruction
cross-domain	generation
Algorithmic	Intelligence	Laboratory
UNMT	[Artetxe	et	al.,	2018]
• Results:	UNMT	produces	good translation	results
45
BPE	(byte	pair	encoding),
a	preprocessing	method
Algorithmic	Intelligence	Laboratory
Conclusion
• Deep	learning	is	widely	used	for	natural	language	processing	(NLP)
• RNN	and	CNN	were	popular	in	2014-2017
• Recently,	self-attention	based	methods	are	widely	used
• Many	new	ideas	are	proposed	to	solve	language	problems
• New	architectures	(e.g.,	self-attention,	softmax)
• New	training	methods	(e.g.,	loss,	algorithm,	unsupervised)
• Research	for	natural	languages	are	now	just	began
• Deep	learning	(especially	GAN)	is	not	widely	used	in	NLP	as	computer	vision
• Transformer	and	BERT	are	just	published	in	2017-2018
• There	are	still	many	research	opportunities	in	NLP
46
Algorithmic	Intelligence	Laboratory
Introduction
• [Papineni	et	al.,	2002]	BLEU:	a	method	for	automatic	evaluation	of	machine	translation.	ACL	2002.
link	:	https://dl.acm.org/citation.cfm?id=1073135
• [Cho	et	al.,	2014]	Learning	Phrase	Representations	using	RNN	Encoder-Decoder	for	Statistical...	EMNLP	2014.
link	:	https://arxiv.org/abs/1406.1078
• [Sutskever	et	al.,	2014]	Sequence	to	Sequence	Learning	with	Neural	Networks.	NIPS	2014.
link	:	https://arxiv.org/abs/1409.3215
• [Gehring	et	al.,	2017]	Convolutional	Sequence	to	Sequence	Learning.	ICML	2017.
link	:	https://arxiv.org/abs/1705.03122
• [Young	et	al.,	2017]	Recent	Trends	in	Deep	Learning	Based	Natural	Language	Processing.	arXiv	2017.
link	:	https://arxiv.org/abs/1708.02709
Extension	to	unsupervised	setting
• [Artetxe	et	al.,	2018]	Unsupervised	Neural	Machine	Translation.	ICLR	2018.
link	:	https://arxiv.org/abs/1710.11041
• [Lample	et	al.,	2018]	Unsupervised	Machine	Translation	Using	Monolingual	Corpora	Only.	ICLR	2018.
link	:	https://arxiv.org/abs/1711.00043
References
47
Algorithmic	Intelligence	Laboratory
Learning	long-term	dependencies
• [Bahdanau	et	al.,	2015]	Neural	Machine	Translation	by	Jointly	Learning	to	Align	and	Translate.	ICLR	2015.
link	:	https://arxiv.org/abs/1409.0473
• [Weston	et	al.,	2015]	Memory	Networks.	ICLR	2015.
link	:	https://arxiv.org/abs/1410.3916
• [Xu	et	al.,	2015]	Show,	Attend	and	Tell:	Neural	Image	Caption	Generation	with	Visual	Attention.	ICML	2015.
link	:	https://arxiv.org/abs/1502.03044
• [Sukhbaatar	et	al.,	2015]	End-To-End	Memory	Networks.	NIPS	2015.
link	:	https://arxiv.org/abs/1503.08895
• [Kumar	et	al.,	2016]	Ask	Me	Anything:	Dynamic	Memory	Networks	for	Natural	Language	Processing.	ICML	2016.
link	:	https://arxiv.org/abs/1506.07285
• [Vaswani	et	al.,	2017]	Attention	Is	All	You	Need.	NIPS	2017.
link	:	https://arxiv.org/abs/1706.03762
• [Wang	et	al.,	2018]	Non-local	Neural	Networks.	CVPR	2018.
link	:	https://arxiv.org/abs/1711.07971
• [Zhang	et	al.,	2018]	Self-Attention	Generative	Adversarial	Networks.	arXiv	2018.
link	:	https://arxiv.org/abs/1805.08318
• [Peters	et	al.,	2018]	Deep	contextualized	word	representations.	NAACL	2018.
link	:	https://arxiv.org/abs/1802.05365
• [Delvin	et	al.,	2018]	BERT:	Pre-training	of	Deep	Bidirectional	Transformers	for	Language	Understanding.	arXiv	2018.
link	:	https://arxiv.org/abs/1810.04805
References
48
Algorithmic	Intelligence	Laboratory
Improve	softmax	layers
• [Mnih	&	Hinton,	2009]	A	Scalable	Hierarchical	Distributed	Language	Model.	NIPS	2009.
link	:	https://papers.nips.cc/paper/3583-a-scalable-hierarchical-distributed-language-model
• [Grave	et	al.,	2017]	Efficient	softmax	approximation	for	GPUs.	ICML	2017.
link	:	https://arxiv.org/abs/1609.04309
• [Yang	et	al.,	2018]	Breaking	the	Softmax	Bottleneck:	A	High-Rank	RNN	Language	Model.	ICLR	2018.
link	:	https://arxiv.org/abs/1711.03953
Reduce	exposure	bias
• [Williams	et	al.,	1989]	A	Learning	Algorithm	for	Continually	Running	Fully	Recurrent...	Neural	Computation	1989.
link	:	https://ieeexplore.ieee.org/document/6795228
• [Bengio	et	al.,	2015]	Scheduled	Sampling	for	Sequence	Prediction	with	Recurrent	Neural	Networks.	NIPS	2015.
link	:	https://arxiv.org/abs/1506.03099
• [Huszár	et	al.,	2015]	How	(not)	to	Train	your	Generative	Model:	Scheduled	Sampling,	Likelihood...	arXiv	2015.
link	:	https://arxiv.org/abs/1511.05101
• [Lamb	et	al.,	2016]	Professor	Forcing:	A	New	Algorithm	for	Training	Recurrent	Networks.	NIPS	2016.
link	:	https://arxiv.org/abs/1610.09038
References
49
Algorithmic	Intelligence	Laboratory
Reduce	loss/evaluation	mismatch
• [Ranzato	et	al.,	2016]	Sequence	Level	Training	with	Recurrent	Neural	Networks.	ICLR	2016.
link	:	https://arxiv.org/abs/1511.06732
• [Bahdanau	et	al.,	2017]	An	Actor-Critic	Algorithm	for	Sequence	Prediction.	ICLR	2017.
link	:	https://arxiv.org/abs/1607.07086
• [Yu	et	al.,	2017]	SeqGAN:	Sequence	Generative	Adversarial	Nets	with	Policy	Gradient.	AAAI	2017.
link	:	https://arxiv.org/abs/1609.05473
• [Rajeswar	et	al.,	2017]	Adversarial	Generation	of	Natural	Language.	arXiv	2017.
link	:	https://arxiv.org/abs/1705.10929
• [Maddison	et	al.,	2017]	The	Concrete	Distribution:	A	Continuous	Relaxation	of	Discrete	Random...	ICLR	2017.
link	:	https://arxiv.org/abs/1611.00712
• [Jang	et	al.,	2017]	Categorical	Reparameterization	with	Gumbel-Softmax.	ICLR	2017.
link	:	https://arxiv.org/abs/1611.01144
• [Kusner	et	al.,	2016]	GANS	for	Sequences	of	Discrete	Elements	with	the	Gumbel-softmax...	NIPS	Workshop	2016.
link	:	https://arxiv.org/abs/1611.04051
• [Tucker	et	al.,	2017]	REBAR:	Low-variance,	unbiased	gradient	estimates	for	discrete	latent	variable...	NIPS	2017.
link	:	https://arxiv.org/abs/1703.07370
• [Hjelm	et	al.,	2018]	Boundary-Seeking	Generative	Adversarial	Networks.	ICLR	2018.
link	:	https://arxiv.org/abs/1702.08431
• [Zhao	et	al.,	2018]	Adversarially	Regularized	Autoencoders.	ICML	2018.
link	:	https://arxiv.org/abs/1706.04223
References
50
Algorithmic	Intelligence	Laboratory
Transformer	[Vaswani	et	al.,	2017]	
• Method:
• (Scaled	dot-product)	attention is	given	by
• Use	multi-head	attention (i.e.,	ensemble	of	attentions)
• The	final	transformer model	is	built	upon	the	attention	blocks
• First,	extract	features	with	self-attention
• Then	decode	feature	with	usual	attention
• Since	the	model	don’t	have	a	sequential	structure,
the	authors	give	position	embedding	(some	handcrafted
feature	that	represents	the	location	in	sequence)
51
*Notation:	(𝐊, 𝐕) is	(key,	value)	pair,	and	𝐐 is	query
Algorithmic	Intelligence	Laboratory
Adaptive	Softmax	[Grave	et	al.,	2017]
• Limitation	of	prior	works	&	Proposed	idea:
• Put	top	𝒌 𝒉 words	(𝑝Q of	frequencies)	and	a	token	“NEXT”	in	the	first	layer,	and
put	𝑘, = 𝑘 − 𝑘Q words	(𝑝, = 1 − 𝑝Q of	frequencies)	in	the	next	layer
• Let	𝑔(𝑘, 𝐵) be	a	computation	time	for	𝑘 vocabularies	and	batch	size	𝐵
• Then	the	computation	time of	the	proposed	method	is
• Here,	𝑔(𝑘, 𝐵) is	a	threshold	function (due	to	the	initial	setup	of	GPU)
52
Algorithmic	Intelligence	Laboratory
Adaptive	Softmax	[Grave	et	al.,	2017]
• Limitation	of	prior	works	&	Proposed	idea:
• The	computation	time of	the	proposed	method	is
• Hence,	give	a	constraint that	𝑘𝐵 ≥ 𝑘l 𝐵l (for	efficient	usage	for	GPU)
• Also,	extend	the	model	to	multi-cluster setting	(with	𝐽 clusters):
• By	solving	the	optimization	problem	(for	𝑘4 and	𝐽),	the	model	is	3-5x	faster than	
the	original	softmax	(in	practice,	𝐽 = 5 shows	good	computation/performance		trade-off)
53
Algorithmic	Intelligence	Laboratory
Professor	Forcing	[Lamb	et	al.,	2016]
• Motivation:
• Scheduled	sampling	(SS)	is	known	to	optimize	wrong	objective [Huszár	et	al.,	2015]
• Let	𝑃 and	𝑄 be	data	and	model	distribution,	respectively
• Assume	length	2	sequence	𝑥$ 𝑥n,	and	let	𝜖 be	the	ratio	of	real	sample
• Then	the	objective of	scheduled	sampling	is
• If	𝜖 = 1,	it	is	usual	MLE	objective,	but	as	𝜖 → 0,	it	pushes	the	conditional	
distribution	𝑄pq|ps
to	the	marginal	distribution	𝑃pq
instead	of	𝑃pq|ps
• Hence,	the	factorized	𝑄∗
= 𝑃ps
𝑃pq
can	minimize	the	objective
54
Algorithmic	Intelligence	Laboratory
More	Methods	for	Discrete	GAN
• Gumbel-Softmax	(a.k.a.	concrete	distribution):
• Gradient	estimator	of	REINFORCE	has	high	variance
• One	can	apply	reparameterization	trick…	but	how	for	discrete variables?
• One	can	use	Gumbel-softmax	trick [Jang	et	al.,	2017];	[Maddison	et	al.,	2017]	to	
achieve	a	biased	but	low	variance gradient	estimator
• One	can	also	get	unbiased estimator	using	Gumbel-softmax	estimator	as	a	control	
variate	for	REINFORCE,	called	REBAR [Tucker	et	al.,	2017]
• Discrete	GAN	is	still	an	active	research	area
• BSGAN	[Hjelm	et	al.,	2018],	ARAE	[Zhao	et	al.,	2018],	etc.
• However,	GAN	is	not	popular for	sequences	(natural	languages)	as	images	yet
55

More Related Content

PDF
Large Language Models - Chat AI.pdf
PDF
Deep learning - A Visual Introduction
PPTX
Deep Learning Applications | Deep Learning Applications In Real Life | Deep l...
PDF
Deep Learning
PPTX
Dbscan algorithom
PPTX
Deep Learning Explained
PDF
Deep Learning: Application & Opportunity
PDF
An introduction to Machine Learning
Large Language Models - Chat AI.pdf
Deep learning - A Visual Introduction
Deep Learning Applications | Deep Learning Applications In Real Life | Deep l...
Deep Learning
Dbscan algorithom
Deep Learning Explained
Deep Learning: Application & Opportunity
An introduction to Machine Learning

What's hot (20)

PPTX
Introduction to Linear Discriminant Analysis
PPTX
Deep neural networks
PDF
Explainable AI (XAI) - A Perspective
PDF
NLP using transformers
PPTX
Introduction to Deep Learning
PDF
Internet of Things (IoT) and Big Data
PPTX
What Is Deep Learning? | Introduction to Deep Learning | Deep Learning Tutori...
PDF
Use Case Patterns for LLM Applications (1).pdf
PDF
Natural Language Processing (NLP)
PPTX
Deep Learning Tutorial | Deep Learning Tutorial For Beginners | What Is Deep ...
PDF
Large Language Models Bootcamp
PDF
XGBoost & LightGBM
PDF
What is Deep Learning | Deep Learning Simplified | Deep Learning Tutorial | E...
PPTX
Generative models
PDF
AI Vs ML Vs DL PowerPoint Presentation Slide Templates Complete Deck
PDF
Natural language processing
PPTX
Transformers AI PPT.pptx
PPTX
Presentation on supervised learning
PDF
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
PDF
Machine Learning Algorithms
Introduction to Linear Discriminant Analysis
Deep neural networks
Explainable AI (XAI) - A Perspective
NLP using transformers
Introduction to Deep Learning
Internet of Things (IoT) and Big Data
What Is Deep Learning? | Introduction to Deep Learning | Deep Learning Tutori...
Use Case Patterns for LLM Applications (1).pdf
Natural Language Processing (NLP)
Deep Learning Tutorial | Deep Learning Tutorial For Beginners | What Is Deep ...
Large Language Models Bootcamp
XGBoost & LightGBM
What is Deep Learning | Deep Learning Simplified | Deep Learning Tutorial | E...
Generative models
AI Vs ML Vs DL PowerPoint Presentation Slide Templates Complete Deck
Natural language processing
Transformers AI PPT.pptx
Presentation on supervised learning
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Machine Learning Algorithms
Ad

Similar to Deep Learning for Natural Language Processing (20)

PDF
Deep learning 1.0 and Beyond, Part 2
PDF
DL Classe 0 - You can do it
PDF
Deep Learning Class #0 - You Can Do It
PPTX
Introduction to Deep Learning for Non-Programmers
PPTX
"An Introduction to AI and Deep Learning"
PPTX
FDP_atal_on transformer_NLP_by_example.pptx
PDF
AI Applications in Education - Current and Future Trends - Updated December 2...
PDF
Deep Learning for AI - Yoshua Bengio, Mila
PPTX
Deep learning introduction
PDF
Deep Learning, an interactive introduction for NLP-ers
PDF
MIT Deep Learning Basics: Introduction and Overview by Lex Fridman
PPTX
AI for Everyone: Master the Basics
PPTX
Exploring-Deep-Learning detailed and very important note
PDF
Deep analytics via learning to reason
PDF
Deep learning and applications in non-cognitive domains III
PDF
Ted Willke - The Brain’s Guide to Dealing with Context in Language Understanding
PDF
An Introduction to Deep Learning
PDF
module 3 Artificial Intelligence and ML.
PDF
True Artificial Intelligence Will Change Everything
PDF
Deep Neural Networks for Machine Learning
Deep learning 1.0 and Beyond, Part 2
DL Classe 0 - You can do it
Deep Learning Class #0 - You Can Do It
Introduction to Deep Learning for Non-Programmers
"An Introduction to AI and Deep Learning"
FDP_atal_on transformer_NLP_by_example.pptx
AI Applications in Education - Current and Future Trends - Updated December 2...
Deep Learning for AI - Yoshua Bengio, Mila
Deep learning introduction
Deep Learning, an interactive introduction for NLP-ers
MIT Deep Learning Basics: Introduction and Overview by Lex Fridman
AI for Everyone: Master the Basics
Exploring-Deep-Learning detailed and very important note
Deep analytics via learning to reason
Deep learning and applications in non-cognitive domains III
Ted Willke - The Brain’s Guide to Dealing with Context in Language Understanding
An Introduction to Deep Learning
module 3 Artificial Intelligence and ML.
True Artificial Intelligence Will Change Everything
Deep Neural Networks for Machine Learning
Ad

More from Sangwoo Mo (20)

PDF
Brief History of Visual Representation Learning
PDF
Learning Visual Representations from Uncurated Data
PDF
Hyperbolic Deep Reinforcement Learning
PDF
A Unified Framework for Computer Vision Tasks: (Conditional) Generative Model...
PDF
Self-supervised Learning Lecture Note
PDF
Deep Learning Theory Seminar (Chap 3, part 2)
PDF
Deep Learning Theory Seminar (Chap 1-2, part 1)
PDF
Introduction to Diffusion Models
PDF
Object-Region Video Transformers
PDF
Deep Implicit Layers: Learning Structured Problems with Neural Networks
PDF
Learning Theory 101 ...and Towards Learning the Flat Minima
PDF
Sharpness-aware minimization (SAM)
PDF
Explicit Density Models
PDF
Score-Based Generative Modeling through Stochastic Differential Equations
PDF
Self-Attention with Linear Complexity
PDF
Meta-Learning with Implicit Gradients
PDF
Challenging Common Assumptions in the Unsupervised Learning of Disentangled R...
PDF
Generative Models for General Audiences
PDF
Bayesian Model-Agnostic Meta-Learning
PDF
Domain Transfer and Adaptation Survey
Brief History of Visual Representation Learning
Learning Visual Representations from Uncurated Data
Hyperbolic Deep Reinforcement Learning
A Unified Framework for Computer Vision Tasks: (Conditional) Generative Model...
Self-supervised Learning Lecture Note
Deep Learning Theory Seminar (Chap 3, part 2)
Deep Learning Theory Seminar (Chap 1-2, part 1)
Introduction to Diffusion Models
Object-Region Video Transformers
Deep Implicit Layers: Learning Structured Problems with Neural Networks
Learning Theory 101 ...and Towards Learning the Flat Minima
Sharpness-aware minimization (SAM)
Explicit Density Models
Score-Based Generative Modeling through Stochastic Differential Equations
Self-Attention with Linear Complexity
Meta-Learning with Implicit Gradients
Challenging Common Assumptions in the Unsupervised Learning of Disentangled R...
Generative Models for General Audiences
Bayesian Model-Agnostic Meta-Learning
Domain Transfer and Adaptation Survey

Recently uploaded (20)

PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PPTX
Machine Learning_overview_presentation.pptx
PDF
Accuracy of neural networks in brain wave diagnosis of schizophrenia
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PPTX
1. Introduction to Computer Programming.pptx
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PPTX
Big Data Technologies - Introduction.pptx
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Approach and Philosophy of On baking technology
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Getting Started with Data Integration: FME Form 101
PDF
Electronic commerce courselecture one. Pdf
PPT
Teaching material agriculture food technology
PPTX
A Presentation on Artificial Intelligence
PDF
NewMind AI Weekly Chronicles - August'25-Week II
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
Reach Out and Touch Someone: Haptics and Empathic Computing
Machine Learning_overview_presentation.pptx
Accuracy of neural networks in brain wave diagnosis of schizophrenia
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
1. Introduction to Computer Programming.pptx
MIND Revenue Release Quarter 2 2025 Press Release
Big Data Technologies - Introduction.pptx
Per capita expenditure prediction using model stacking based on satellite ima...
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Spectral efficient network and resource selection model in 5G networks
Approach and Philosophy of On baking technology
Building Integrated photovoltaic BIPV_UPV.pdf
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Getting Started with Data Integration: FME Form 101
Electronic commerce courselecture one. Pdf
Teaching material agriculture food technology
A Presentation on Artificial Intelligence
NewMind AI Weekly Chronicles - August'25-Week II
Diabetes mellitus diagnosis method based random forest with bat algorithm

Deep Learning for Natural Language Processing