SlideShare a Scribd company logo
Attention
DLAI	– MARTA	R.	COSTA-JUSSÀ
SLIDES	ADAPTED	FROM	GRAHAM	NEUBIG’S	LECTURES
What	advancements	excite	you	most	in	the	field?
I	am	very	excited	by	the	recently	introduced	attention	models,	due	to	their	
simplicity	and	due	to	the	fact	that	they	work	so	well.	Although	these	
models	are	new,	I	have	no	doubt	that	they	are	here	to	stay,	and	that	they	
will	play	a	very	important	role	in	the	future	of	deep	learning.
ILYA	SUTSKEVER, RESEARCH	DIRECTOR	AND	COFUNDER	OF	OPENAI
2
Outline
1.	Sequence	modeling	&	Sequence-to-sequence	models	[WRAP-UP	FROM	PREVIOUS	RNN’s	SESSION]
2.	Attention-based	mechanism
3.	Attention	varieties
4.	Attention	Improvements
5.	Applications
6.	“Attention	is	all	you	need”
7.	Summary	
3
Sequence	modeling
Model	the	probability	of	sequences	of	words	
From	previous	lecture…	we	model	sequences
ith	RNNs
p(I’m) p(fine|I’m) p(.|fine) EOS
I’m fine .<s>
4
Sequence-to-sequence	models
how are you ?
Cómo estás EOS
encoder decoder
¿ Cómo estás
?
?
¿
<s>
THOUGHT/
CONTEXT
VECTOR
5
Any	problem	with	these	
models?
6
7
2.	Attention-based	
mechanism
8
Motivation	in	the	case	of	MT
9
Motivation	in	the	case	of	MT
10
Attention
encoder
decoder
+
Attention allows to use multiple vectors, based on
the length of the input
11
Attention	Key	Ideas
•	Encode	each	word	in	the	input	and	output	sentence	into	a	vector
•	When	decoding,	perform	a	linear	combination	of	these	vectors,	weighted	by	“attention	
weights”
•	Use	this	combination	in	picking	the	next	word
12
Attention	computation	I
•	Use	“query”	vector	(decoder	state)	and	
“key”	vectors	(all	encoder	states)
•	For	each	query-key	pair,	calculate	weight
•	Normalize	to	add	to	one	using	softmax
Query Vector
Key Vectors
a1=2.1 a2=-0.1 a3=0.3 a4=-1.0
softmax
a1=0.5 a2=0.3 a3=0.1 a4=0.1
13
Attention	computation	II
• Combine	together	value	vectors	(usually	
encoder	states,	like	key	vectors)	by	taking	
the	weighted	sum
Value Vectors
a1=0.5 a2=0.3 a3=0.1 a4=0.1
* * * *
14
Attention	Score	Functions
q	is	the	query	and	k	is	the	key
Reference
Multi-layer	
Perceptron
𝑎 𝑞, 𝑘 = tanh	( 𝒲- 𝑞, 𝑘 ) Flexible,	often	very	good	with	
large	data
Bahdanau et	al.,	2015
Bilinear 𝑎 𝑞, 𝑘 = 𝑞/ 𝒲𝑘 Luong	et	al	2015
Dot	Product 𝑎 𝑞, 𝑘 = 𝑞/ 𝑘 No	parameters!	But	requires	
sizes	to	be	the	same
Luong	et	al.	2015
Scaled	Dot	Product
𝑎 𝑞, 𝑘 =
𝑞/ 𝑘
|𝑘|
Scale	by	size	of	the	vector Vaswani et	al.	2017
15
Attention	Integration	
16
Attention	Integration	
17
3.	Attention	Varieties
18
Hard	Attention
*	Instead	of	a	soft	interpolation,	make	a	zero-one	decision	about	where	to	attend	(Xu	et	al.	
2015)
19
Monotonic		Attention
This	approach	"softly"	prevents	the	model	from	assigning	attention	probability	before	where	
it	attended	at	a	previous	timestep by	taking	into	account	the	attention	at	the	previous	
timestep.
20
ENCODER STATE E
Intra-Attention	/	Self- Attention
Each	element	in	the	sentence	attends	to	other	elements	from	the	SAME	sentence	à context	
sensitive	encodings!
21
Multiple	Sources
Attend	to	multiple	sentences	(Zoph et	al.,	2015)
Attend	to	a	sentence	and	an	image	(Huang	et	al.	2016)
22
Multi-headed	Attention	I
Multiple	attention	“heads”	focus	on	different	parts	of	the	sentence
𝑎 𝑞, 𝑘 =
𝑞/ 𝑘
|𝑘|
23
Multi-headed	Attention	II
Multiple	attention	“heads”	focus	on	different	parts	of	the	sentence
E.g.	Multiple	independently	learned	heads	(Vaswani et	al.	2017)
𝑎 𝑞, 𝑘 =
𝑞/ 𝑘
|𝑘|
24
4.Improvements	in	
Attention
IN	THE	CONTEXT	OF	MT
25
Coverage
Problem:	Neural	models	tends	to	drop	or	repeat	content
In	MT,	
1.Over-translation:	some	words	are	unnecessarily	translated	for	multiple	times;
2.	Under-translation:	some	words	are	mistakenly	untranslated.
SRC:	Señor Presidente,	abre la	sesión.
TRG:	Mr President	Mr President	Mr President.
Solution:	Model	how	many	times	words	have	been	covered	e.g.	maintaining	a	coverage	vector	
to	keep	track	of	the	attention	history	(Tu et	al.,	2016)
26
Incorporating	Markov	Properties
Intuition:	Attention	from	last	time	tends	to	
be	correlated	with	attention	this	time
Approach:	Add	information	about	the	last	
attention	when	making	the	next	decision
27
Bidirectional	Training
-Background:	Established	that	for	latent	
variable	translation	models	the	alignments	
improve	if	both	directional	models	are	
combined	(koehn et	al,	2005)
-Approach:	joint	training	of	two	directional	
models	
28
Supervised	Training
Sometimes	we	can	get	“gold	standard”	alignments	a	–priori
◦ Manual	alignments
◦ Pre-trained	with	strong	alignment	model
Train	the	model	to	match	these	strong	alignments
29
5.	Applications
30
Chatbots
a	computer	program	that	conducts	a	conversation
Human: what is your job
Enc-dec: i’m a lawyer
Human: what do you do ?
Enc-dec: i’m a doctor .
what is your job
I’m a EOS
<s> I’m a
lawyer
lawyer
+
attention
31
Natural	Language	Inference
32
Other	NLP	Tasks
Text summarization: process of shortening a text document with
software to create a summary with the major points of the original
document.
Question Answering: automatically producing an answer to a
question given a corresponding document.
Semantic Parsing: mapping natural language into a logical form that
can be executed on a knowledge base and return an answer
Syntactic Parsing: process of analysing a string of symbols, either in
natural language or in computer languages, conforming to the rules
of a formal grammar
33
Image	captioning	I
decoder
encoder A cat on the mat
a cat
<s> a
on the mat
cat on the
34
Image	Captioning	II
35
Other	Computer	Vision	Tasks	with	
Attention
Visual Question Answering: given an image and a natural language
question about the image, the task is to provide an accurate natural
language answer.
Video Caption Generation: attempts to generate a complete and
natural sentence, enriching the single label as in video classification,
to capture the most informative dynamics in videos.
36
Speech	recognition	/	translation
37
6.	“Attention	is	all	you	
need”
SLIDES	BASED	ON	
HTTPS://RESEARCH.GOOGLEBLOG.COM/2017/08/TRANSFORMER-
NOVEL-NEURAL-NETWORK.HTML
38
Motivation	
Sequential	nature	of	RNNs	-à difficult	to	take	advantage	of	modern	computing	devices	such	
as	TPUs	(Tensor	Processing	Units)
39
Transformer
I	arrived	at the	bank	after	crossing	the	river
40
Transformer	I
Decoder
Encoder
41
Transformer	II
42
Transformer	results
43
Attention	weights
44
Attention	weights
45
7.	Summary
46
RNNs	and	Attention
RNNs	are	used	to	model	sequences
Attention	is	used	to	enhance	modeling	long	sequences
Versatility	of	these	models	allows	to	apply	them	to	a	wide	range	of	applications
47
Implementations	of	Encoder-Decoder
LSTM CNN
48
Attention-based	mechanisms
Soft	vs	Hard:	soft	attention	weights	all	pixels,	hard	attention	crops	the	image	and	forces	
attention	only	on	the	kept	part.
Global	vs	Local: a	global approach which	always	attends	to	all	source	words	and	a	local	one	
that	only	looks	at	a	subset	of	source	words	at	a	time.
Intra	vs	External:	intra	attention	is	within	the	encoder’s	input	sentence,	external	attention	is	
across	sentences.
49
One	large	encoder-decoder
•Text,	speech,	image…	is	all	converging to	a	signal	paradigm?
•If	you	know	how	to	build	a	neural	MT	system,	you	may	easily	learn	
how	to	build	a	speech-to-text	recognition	system...
•Or	you	may	train	them	together	to	achieve	zero-shot AI.
*And other references on this research
direction….
50
51
Research going on… Interested?	
marta.ruiz@upc.edu
Q&A	?
Quizz
1.Mark	all	statements	that	are	true
A.	Sequence	modeling	only	refers	to	language	applications
B.	The	attention	mechanism	can	be	applied	to	an	encoder-decoder	architecture
C.	Neural	machine	translation	systems	require	recurrent	neural	networks
D.	If	we	want	to	have	a	fixed	representation	(thought	vector),	we	can	not	apply	attention-based	
mechanisms	
2.	Given	the	query	vector	q=[],	the	key	vector	1	k1=[]	and	the	key	vector	2	k2=[].	
A.	What	are	the	attention	weights	1	&	2	computing	the	dot	product?
B.	And	when	computing	the	scaled	dot	product?
C.	To	what	key	vector	are	we	giving	more	attention?
D.	What	is	the	advantage	of	computing	the	scaled	dot	product?
52

More Related Content

PPTX
PDF
Attention is all you need (UPC Reading Group 2018, by Santi Pascual)
PDF
Attention scores and mechanisms
PPTX
Word embeddings, RNN, GRU and LSTM
PPTX
Attention in Deep Learning
PDF
Introduction to Transformers for NLP - Olga Petrova
PDF
Attention mechanism 소개 자료
PPTX
Introduction to CNN
Attention is all you need (UPC Reading Group 2018, by Santi Pascual)
Attention scores and mechanisms
Word embeddings, RNN, GRU and LSTM
Attention in Deep Learning
Introduction to Transformers for NLP - Olga Petrova
Attention mechanism 소개 자료
Introduction to CNN

What's hot (20)

PPTX
Introduction to Transformer Model
PPTX
TensorFlow Tutorial | Deep Learning With TensorFlow | TensorFlow Tutorial For...
PDF
Attention is All You Need (Transformer)
PPTX
Attention Is All You Need
PDF
BERT: Bidirectional Encoder Representations from Transformers
PDF
NLP using transformers
PPTX
Show and tell: A Neural Image caption generator
PDF
Deep learning for NLP and Transformer
PPTX
[Paper Reading] Attention is All You Need
PDF
Recurrent neural networks rnn
PPT
Lec 07 image enhancement in frequency domain i
PDF
Recurrent Neural Networks, LSTM and GRU
PDF
Deep Natural Language Processing for Search and Recommender Systems
PDF
Natural Language Processing NLP (Transformers)
PDF
PR-409: Denoising Diffusion Probabilistic Models
PPTX
Model compression
PPTX
Survey of Attention mechanism & Use in Computer Vision
PDF
物件偵測與辨識技術
PDF
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain Ratio
PDF
Convolutional Neural Network Models - Deep Learning
Introduction to Transformer Model
TensorFlow Tutorial | Deep Learning With TensorFlow | TensorFlow Tutorial For...
Attention is All You Need (Transformer)
Attention Is All You Need
BERT: Bidirectional Encoder Representations from Transformers
NLP using transformers
Show and tell: A Neural Image caption generator
Deep learning for NLP and Transformer
[Paper Reading] Attention is All You Need
Recurrent neural networks rnn
Lec 07 image enhancement in frequency domain i
Recurrent Neural Networks, LSTM and GRU
Deep Natural Language Processing for Search and Recommender Systems
Natural Language Processing NLP (Transformers)
PR-409: Denoising Diffusion Probabilistic Models
Model compression
Survey of Attention mechanism & Use in Computer Vision
物件偵測與辨識技術
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain Ratio
Convolutional Neural Network Models - Deep Learning
Ad

Similar to Attention-based Models (DLAI D8L 2017 UPC Deep Learning for Artificial Intelligence) (20)

PDF
NMT with Attention-1.pdfhhhhhhhhhhhhhhhh
PDF
M5 Topic 2 - Attention Mechanism-JEC.pdf
PPTX
240318_JW_labseminar[Attention Is All You Need].pptx
PPTX
240115_Attention Is All You Need (2017 NIPS).pptx
PDF
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020
PDF
attention is all you need.pdf attention is all you need.pdfattention is all y...
PDF
Transformer_tutorial.pdf
PPTX
FDP_atal_on transformer_NLP_by_example.pptx
PDF
From_seq2seq_to_BERT
PDF
Sequence Modelling with Deep Learning
PPTX
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
PDF
Use CNN for Sequence Modeling
PPTX
Survey of Attention mechanism
PDF
Deep Learning for Computer Vision: Attention Models (UPC 2016)
PPTX
A Tour of Neural Sequence Generators
PPTX
15_NEW-2020-ATTENTION-ENC-DEC-TRANSFORMERS-Lect15.pptx
PDF
05-transformers.pdf
PDF
Attention Models (D3L6 2017 UPC Deep Learning for Computer Vision)
PDF
Deep Learning to Text
PPTX
240122_Attention Is All You Need (2017 NIPS)2.pptx
NMT with Attention-1.pdfhhhhhhhhhhhhhhhh
M5 Topic 2 - Attention Mechanism-JEC.pdf
240318_JW_labseminar[Attention Is All You Need].pptx
240115_Attention Is All You Need (2017 NIPS).pptx
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020
attention is all you need.pdf attention is all you need.pdfattention is all y...
Transformer_tutorial.pdf
FDP_atal_on transformer_NLP_by_example.pptx
From_seq2seq_to_BERT
Sequence Modelling with Deep Learning
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Use CNN for Sequence Modeling
Survey of Attention mechanism
Deep Learning for Computer Vision: Attention Models (UPC 2016)
A Tour of Neural Sequence Generators
15_NEW-2020-ATTENTION-ENC-DEC-TRANSFORMERS-Lect15.pptx
05-transformers.pdf
Attention Models (D3L6 2017 UPC Deep Learning for Computer Vision)
Deep Learning to Text
240122_Attention Is All You Need (2017 NIPS)2.pptx
Ad

More from Universitat Politècnica de Catalunya (20)

PDF
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
PDF
Deep Generative Learning for All
PDF
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
PDF
Towards Sign Language Translation & Production | Xavier Giro-i-Nieto
PDF
The Transformer - Xavier Giró - UPC Barcelona 2021
PDF
Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...
PDF
Open challenges in sign language translation and production
PPTX
Generation of Synthetic Referring Expressions for Object Segmentation in Videos
PPTX
Discovery and Learning of Navigation Goals from Pixels in Minecraft
PDF
Learn2Sign : Sign language recognition and translation using human keypoint e...
PDF
Intepretability / Explainable AI for Deep Neural Networks
PDF
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020
PDF
Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...
PDF
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...
PDF
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020
PDF
Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)
PDF
Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...
PDF
Curriculum Learning for Recurrent Video Object Segmentation
PDF
Deep Self-supervised Learning for All - Xavier Giro - X-Europe 2020
PDF
Deep Learning Representations for All - Xavier Giro-i-Nieto - IRI Barcelona 2020
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
Towards Sign Language Translation & Production | Xavier Giro-i-Nieto
The Transformer - Xavier Giró - UPC Barcelona 2021
Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...
Open challenges in sign language translation and production
Generation of Synthetic Referring Expressions for Object Segmentation in Videos
Discovery and Learning of Navigation Goals from Pixels in Minecraft
Learn2Sign : Sign language recognition and translation using human keypoint e...
Intepretability / Explainable AI for Deep Neural Networks
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020
Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020
Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)
Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...
Curriculum Learning for Recurrent Video Object Segmentation
Deep Self-supervised Learning for All - Xavier Giro - X-Europe 2020
Deep Learning Representations for All - Xavier Giro-i-Nieto - IRI Barcelona 2020

Recently uploaded (20)

PPTX
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
PPT
ISS -ESG Data flows What is ESG and HowHow
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PPTX
Introduction to Knowledge Engineering Part 1
PPT
Quality review (1)_presentation of this 21
PPTX
Computer network topology notes for revision
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PPTX
1_Introduction to advance data techniques.pptx
PPTX
climate analysis of Dhaka ,Banglades.pptx
PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
PPTX
Introduction to machine learning and Linear Models
PPTX
Database Infoormation System (DBIS).pptx
PPTX
IB Computer Science - Internal Assessment.pptx
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PPT
Reliability_Chapter_ presentation 1221.5784
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PDF
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
ISS -ESG Data flows What is ESG and HowHow
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
Introduction to Knowledge Engineering Part 1
Quality review (1)_presentation of this 21
Computer network topology notes for revision
oil_refinery_comprehensive_20250804084928 (1).pptx
Acceptance and paychological effects of mandatory extra coach I classes.pptx
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
1_Introduction to advance data techniques.pptx
climate analysis of Dhaka ,Banglades.pptx
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
Introduction to machine learning and Linear Models
Database Infoormation System (DBIS).pptx
IB Computer Science - Internal Assessment.pptx
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
Reliability_Chapter_ presentation 1221.5784
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf

Attention-based Models (DLAI D8L 2017 UPC Deep Learning for Artificial Intelligence)