Inside Transformers
Suman Debnath
Principal Developer Advocate, India
Why Transformers?
• RNN and LSTM
• Machine Translation
• Text Generation
• Next word prediction
• More…
• Sequential processing and learning
• Challenge
• Recurrent Model
• Long-term dependency
• Bidirectional context is not taken into
account
• Context understanding
« The bat flew past my window » vs. « He hit the baseball with the bat »
• Overcome this limitation of RNNs
• SOTA model for several NLP tasks
• Paved the way for new
revolutionary architectures such as
BERT, GPT-3 and more
• Based entirely on the attention
mechanism and completely gets
rid of recurrence
Attention Is All You Need
Let's understand how the transformer works
(language translation task)
• Encoder-Decoder architecture
• Feed the input sentence (source sentence)
to the encoder
• Encoder learns the representation of the
input sentence and sends the
representation to the decoder
• The decoder receives the representation
learned by the encoder as input and
generates the output sentence (target
sentence).
Encoder Decoder
I am good
je vais bien
Representation
The encoder of the transformer
• Stack of N number of encoders
• The output of one encoder is sent as
input to the encoder above it.
• Questions?
• How exactly does the encoder work ?
• How is it generating the representation for
the given source sentence (input sentence)?
Encoder Layer 1
Encoder Layer 2
I am good
Representation
Encoder Layer N
How exactly does the encoder work ?
• All the encoder blocks are
identical
• Each encoder block consists
of two sublayers
• Multi-head attention
• Feedforward network(FNN)
Encoder Layer 2
Encoder Layer 1
FFN
FFN
Representation
Multi-head attention
Multi-head attention
I am good
Self-attention mechanism
• The pronoun it could mean either dog
or food
• We know, it implies the dog and not
food
• But how our model can understand ?
• Here is where the self-attention
mechanism helps us
• Model relates the word it to ALL the
words in the sentence
“A dog ate the food because it was hungry”
How exactly does this work?
I am good
• The embedding of the word I is :
• The embedding of the word am is :
• The embedding of the word good is :
*embedding dimension be 512
3 New Matrices : {query, key, value}
Q K V
WQ WK WV
Weight Matrices
WQ
WK
WV
(randomly initialized,
learned during training)
3 New Matrices : {query, key, value}
Q K V
WQ WK WV
Weight Matrices
WQ
WK
WV
(randomly initialized,
learned during training)
Implies the
query, key, and
value vectors of
the word
“I”
*the dimension of the {query, key, value} vector(dk) is 64.
Why are we computing this?
What is the use of query, key, and value matrices?
How is this going to help us?
Self-attention mechanism
• We learnt to compute matrices : query(Q),
key(K), and value(V)
• They are obtained from the Input matrix, X
• How they are used in the self-attention
mechanism
• REMEMBER: self-attention mechanism relates
the word to all the words in the given sentence
• This understanding helps to learn better
representation
• Next, we will see how these 3 matrices,
helps to get a better representation
4 Steps
Step 1
(”Dot product” between the query matrix, Q, and the key matrix, KT )
What exactly Q.KT does signify?
“how similar they are”
Step 2
(”Divide” the matrix by the square root of the dimension of the key vector)
Why this is needed?
“useful in obtaining stable gradients”
*the dimension of the key vector(dk) is 64.
Step 3
(”Normalize” the matrix by the square root of the dimension of the key
vector)
We normalize them using the softmax function
(bringing the score in the range of 0 to 1 and the sum equals to 1)
Word “I” is related to:
- itself by 90%
- am by 7%
- good by 3%
Step 4
(final step in the self-attention mechanism is to compute the “attention matrix”)
The attention matrix(Z) contains the attention
values for each word in the sentence
(sum of the value vectors weighted by the scores)
Self-attention of the word “I” is computed as the sum of the value vectors weighted by the scores
Using a self attention mechanism,
we can understand how a word is related to all other words in the
sentence.
Step 1
Step 2
Step 3
Step 4
Let’s recall
Encoder Layer 2
Encoder Layer 1
FFN
FFN
I am good
Representation
Multi-head attention
Multi-head attention
Multi-head attention mechanism
• Instead of having a single attention head, we can use multiple
attention heads
• Instead of computing a single attention matrix, Z, we can compute
multiple attention matrices
• This will be useful only in circumstances where the meaning of the
actual word is ambiguous, e.g.
“A dog ate the food because it was hungry”
Positional encoding
• Transformer network, we don't follow the recurrence mechanism
• How will it understand the meaning of the sentence if the word order is not
retained ?
• We should give some information about the word order to the
transformer so that it can understand the sentence
+ =
Positional encoding
-1
-0.25
0.91
-0.25
Positional Encoding for the 30th Word
Dim 1
Dim 2
Dim 3
Dim 4
Let’s recall
FFN
FFN
I am good
Representation
Multi-head attention
Multi-head attention
FFN
I am good
Representation
Multi-head attention
Positional
Encoding
Feedforward network
• The feedforward network consists of two dense layers with ReLU
activations
• The parameters of the feedforward network are the same over the
different positions of the sentence and different over the encoder
blocks.
Add and norm component
• It connects the input and output of a
sublayer.
• Connects the input of the multi-head attention
sublayer to its output
• Connects the input of the feedforward
sublayer to its output
• Residual connection followed by Layer
normalization
• Layer normalization promotes faster
training by preventing the values in each
layer from changing heavily
Introduction to Transformers
Encoder Decoder
I am good
je vais bien
Representation
The decoder of the transformer
• Stake of N number of decoders
• The output of one decoder is sent as
input to the decoder above it
• Decoder receives two inputs:
• The previous decoder’s output
• The encoder's representation
• Question?
• How exactly does the decoder generate
the target sentence?
At time step t=1 At time step t=2 At time step t=3
Similarly, on every time step, the decoder combines the newly generated word to the input and predicts the next word.
At time step t=4
How exactly does the decoder work?
• All the encoder blocks are
identical
• Each encoder block consists
of three sublayers
• Masked multi-head attention
• Multi-head attention
• Feedforward network(FNN)
Decoder Block
FFN
Representation
Masked
multi-head attention
Multi-head attention
Masked multi-head attention
• During training:
• Since we have the right target sentence
• We can just feed the whole target sentence as input
to the decoder but with a small modification
• During testing:
• Decoder predicts the target sentence word by word
in each time step
• We learned that the decoder takes the input <sos>
as the first token
• Keep predicting the target sentence until the <eos>
token is reached
Training Data set
Masked multi-head attention
Multi-head attention
Inputs
Attention Matrix Encoder Representation
Introduction to Transformers
Feedforward network
• The feedforward layer in the decoder works exactly the same as what
we learned in the encoder
Add and norm component
• The add and norm component in the decoder works exactly the same as
what we learned in the encoder
Linear and softmax layers
• Feed the output obtained from the topmost decoder to the linear and
softmax layers
Introduction to Transformers
The complete transformer architecture
Few great books
• Resources
• Data Science on AWS
• Getting Started with Google BERT
Suman Debnath
Developer Advocate, India
/in/suman-d/

More Related Content

PDF
Transformers and BERT with SageMaker
PDF
An introduction to the Transformers architecture and BERT
PPTX
BERT introduction
PDF
BERT - Part 1 Learning Notes of Senthil Kumar
PDF
BERT - Part 2 Learning Notes
PPTX
Latest trends in NLP - Exploring BERT
PPTX
[Paper review] BERT
PDF
BERT: Bidirectional Encoder Representations from Transformers
Transformers and BERT with SageMaker
An introduction to the Transformers architecture and BERT
BERT introduction
BERT - Part 1 Learning Notes of Senthil Kumar
BERT - Part 2 Learning Notes
Latest trends in NLP - Exploring BERT
[Paper review] BERT
BERT: Bidirectional Encoder Representations from Transformers

What's hot (20)

PDF
Nlp and transformer (v3s)
PPTX
PPTX
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
PDF
D3 dhanalakshmi
PDF
An Introduction to Pre-training General Language Representations
PPTX
Electra
PDF
Transformer Introduction (Seminar Material)
PDF
Moving to neural machine translation at google - gopro-meetup
PDF
AN ADVANCED APPROACH FOR RULE BASED ENGLISH TO BENGALI MACHINE TRANSLATION
PPTX
Thomas Wolf "Transfer learning in NLP"
PDF
A Multiscale Visualization of Attention in the Transformer Model
PDF
Hyoung-Gyu Lee - 2015 - NAVER Machine Translation System for WAT 2015
DOCX
Bt8903, c# programming
DOC
Perceptron working
PPTX
DLBLR talk
PDF
Lecture 9 Perceptron
PDF
text summarization using amr
PPTX
Evaluation of hindi english mt systems, challenges and solutions
PDF
DETERMINING CUSTOMER SATISFACTION IN-ECOMMERCE
PDF
Recurrent Neural Networks, LSTM and GRU
Nlp and transformer (v3s)
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
D3 dhanalakshmi
An Introduction to Pre-training General Language Representations
Electra
Transformer Introduction (Seminar Material)
Moving to neural machine translation at google - gopro-meetup
AN ADVANCED APPROACH FOR RULE BASED ENGLISH TO BENGALI MACHINE TRANSLATION
Thomas Wolf "Transfer learning in NLP"
A Multiscale Visualization of Attention in the Transformer Model
Hyoung-Gyu Lee - 2015 - NAVER Machine Translation System for WAT 2015
Bt8903, c# programming
Perceptron working
DLBLR talk
Lecture 9 Perceptron
text summarization using amr
Evaluation of hindi english mt systems, challenges and solutions
DETERMINING CUSTOMER SATISFACTION IN-ECOMMERCE
Recurrent Neural Networks, LSTM and GRU
Ad

Similar to Introduction to Transformers (20)

PPTX
Demystifying NLP Transformers: Understanding the Power and Architecture behin...
PDF
M5 Topic 1 - Encoder Decoder MODEL-JEC.pdf
PPTX
Smart Reply - Word-level Sequence to sequence.pptx
PPTX
Vision Transformers (ViTs) in Computer Vision: A Transformer-Based Approach f...
PPTX
A Detailed Exploration of Vision Transformer (ViT) and Its Role in Deep Learn...
PPTX
Dataworkz odsc london 2018
PDF
NLP using transformers
PDF
Engineering Intelligent NLP Applications Using Deep Learning – Part 2
PDF
Building a Neural Machine Translation System From Scratch
PPTX
NLP Bootcamp
PPTX
wordembedding.pptx
PPTX
Transformers in vision and its challenges and comparision with CNN
PPTX
Word embedding
PDF
NLP Bootcamp 2018 : Representation Learning of text for NLP
PDF
Sequencing and Attention Models - 2nd Version
PDF
MPI - 4
PPTX
Natural Language Processing Advancements By Deep Learning: A Survey
PDF
05-transformers.pdf
PDF
lecture2 - text classification prof cho NYU
PDF
5_RNN_LSTM.pdf
 
Demystifying NLP Transformers: Understanding the Power and Architecture behin...
M5 Topic 1 - Encoder Decoder MODEL-JEC.pdf
Smart Reply - Word-level Sequence to sequence.pptx
Vision Transformers (ViTs) in Computer Vision: A Transformer-Based Approach f...
A Detailed Exploration of Vision Transformer (ViT) and Its Role in Deep Learn...
Dataworkz odsc london 2018
NLP using transformers
Engineering Intelligent NLP Applications Using Deep Learning – Part 2
Building a Neural Machine Translation System From Scratch
NLP Bootcamp
wordembedding.pptx
Transformers in vision and its challenges and comparision with CNN
Word embedding
NLP Bootcamp 2018 : Representation Learning of text for NLP
Sequencing and Attention Models - 2nd Version
MPI - 4
Natural Language Processing Advancements By Deep Learning: A Survey
05-transformers.pdf
lecture2 - text classification prof cho NYU
5_RNN_LSTM.pdf
 
Ad

More from Suman Debnath (14)

PDF
LambdaMongoDB.pdf
PPTX
OpenSourceIndia-Suman.pptx
PPTX
Develop a Graph Based Recommendation System in Python on AWS
PDF
EFS_Integration.pdf
PDF
AWS DynamoDB
PDF
Introduction to AWS
PDF
Data engineering
PDF
Deploy PyTorch models in Production on AWS with TorchServe
PDF
Docker on AWS
PDF
Introduction to k-Nearest Neighbors and Amazon SageMaker
PPTX
AWS Serverless with Chalice
PDF
Introduction to ML and Decision Tree
PDF
AWS AI Services 101
PDF
Introduction to AI/ML with AWS
LambdaMongoDB.pdf
OpenSourceIndia-Suman.pptx
Develop a Graph Based Recommendation System in Python on AWS
EFS_Integration.pdf
AWS DynamoDB
Introduction to AWS
Data engineering
Deploy PyTorch models in Production on AWS with TorchServe
Docker on AWS
Introduction to k-Nearest Neighbors and Amazon SageMaker
AWS Serverless with Chalice
Introduction to ML and Decision Tree
AWS AI Services 101
Introduction to AI/ML with AWS

Recently uploaded (20)

PDF
How ambidextrous entrepreneurial leaders react to the artificial intelligence...
PPTX
TEXTILE technology diploma scope and career opportunities
PDF
Zenith AI: Advanced Artificial Intelligence
PDF
sustainability-14-14877-v2.pddhzftheheeeee
PPTX
MicrosoftCybserSecurityReferenceArchitecture-April-2025.pptx
PPTX
Configure Apache Mutual Authentication
PDF
Produktkatalog für HOBO Datenlogger, Wetterstationen, Sensoren, Software und ...
PDF
Consumable AI The What, Why & How for Small Teams.pdf
PDF
Five Habits of High-Impact Board Members
PDF
UiPath Agentic Automation session 1: RPA to Agents
PPT
Module 1.ppt Iot fundamentals and Architecture
PDF
A review of recent deep learning applications in wood surface defect identifi...
PDF
Convolutional neural network based encoder-decoder for efficient real-time ob...
PDF
OpenACC and Open Hackathons Monthly Highlights July 2025
PDF
NewMind AI Weekly Chronicles – August ’25 Week III
DOCX
search engine optimization ppt fir known well about this
PDF
1 - Historical Antecedents, Social Consideration.pdf
PPTX
2018-HIPAA-Renewal-Training for executives
PDF
Hybrid horned lizard optimization algorithm-aquila optimizer for DC motor
PDF
A proposed approach for plagiarism detection in Myanmar Unicode text
How ambidextrous entrepreneurial leaders react to the artificial intelligence...
TEXTILE technology diploma scope and career opportunities
Zenith AI: Advanced Artificial Intelligence
sustainability-14-14877-v2.pddhzftheheeeee
MicrosoftCybserSecurityReferenceArchitecture-April-2025.pptx
Configure Apache Mutual Authentication
Produktkatalog für HOBO Datenlogger, Wetterstationen, Sensoren, Software und ...
Consumable AI The What, Why & How for Small Teams.pdf
Five Habits of High-Impact Board Members
UiPath Agentic Automation session 1: RPA to Agents
Module 1.ppt Iot fundamentals and Architecture
A review of recent deep learning applications in wood surface defect identifi...
Convolutional neural network based encoder-decoder for efficient real-time ob...
OpenACC and Open Hackathons Monthly Highlights July 2025
NewMind AI Weekly Chronicles – August ’25 Week III
search engine optimization ppt fir known well about this
1 - Historical Antecedents, Social Consideration.pdf
2018-HIPAA-Renewal-Training for executives
Hybrid horned lizard optimization algorithm-aquila optimizer for DC motor
A proposed approach for plagiarism detection in Myanmar Unicode text

Introduction to Transformers

  • 2. Why Transformers? • RNN and LSTM • Machine Translation • Text Generation • Next word prediction • More… • Sequential processing and learning • Challenge • Recurrent Model • Long-term dependency • Bidirectional context is not taken into account • Context understanding « The bat flew past my window » vs. « He hit the baseball with the bat »
  • 3. • Overcome this limitation of RNNs • SOTA model for several NLP tasks • Paved the way for new revolutionary architectures such as BERT, GPT-3 and more • Based entirely on the attention mechanism and completely gets rid of recurrence Attention Is All You Need
  • 4. Let's understand how the transformer works (language translation task) • Encoder-Decoder architecture • Feed the input sentence (source sentence) to the encoder • Encoder learns the representation of the input sentence and sends the representation to the decoder • The decoder receives the representation learned by the encoder as input and generates the output sentence (target sentence). Encoder Decoder I am good je vais bien Representation
  • 5. The encoder of the transformer • Stack of N number of encoders • The output of one encoder is sent as input to the encoder above it. • Questions? • How exactly does the encoder work ? • How is it generating the representation for the given source sentence (input sentence)? Encoder Layer 1 Encoder Layer 2 I am good Representation Encoder Layer N
  • 6. How exactly does the encoder work ? • All the encoder blocks are identical • Each encoder block consists of two sublayers • Multi-head attention • Feedforward network(FNN) Encoder Layer 2 Encoder Layer 1 FFN FFN Representation Multi-head attention Multi-head attention I am good
  • 7. Self-attention mechanism • The pronoun it could mean either dog or food • We know, it implies the dog and not food • But how our model can understand ? • Here is where the self-attention mechanism helps us • Model relates the word it to ALL the words in the sentence “A dog ate the food because it was hungry” How exactly does this work?
  • 8. I am good • The embedding of the word I is : • The embedding of the word am is : • The embedding of the word good is : *embedding dimension be 512
  • 9. 3 New Matrices : {query, key, value} Q K V WQ WK WV Weight Matrices WQ WK WV (randomly initialized, learned during training)
  • 10. 3 New Matrices : {query, key, value} Q K V WQ WK WV Weight Matrices WQ WK WV (randomly initialized, learned during training) Implies the query, key, and value vectors of the word “I” *the dimension of the {query, key, value} vector(dk) is 64.
  • 11. Why are we computing this? What is the use of query, key, and value matrices? How is this going to help us?
  • 12. Self-attention mechanism • We learnt to compute matrices : query(Q), key(K), and value(V) • They are obtained from the Input matrix, X • How they are used in the self-attention mechanism • REMEMBER: self-attention mechanism relates the word to all the words in the given sentence • This understanding helps to learn better representation • Next, we will see how these 3 matrices, helps to get a better representation
  • 14. Step 1 (”Dot product” between the query matrix, Q, and the key matrix, KT ) What exactly Q.KT does signify? “how similar they are”
  • 15. Step 2 (”Divide” the matrix by the square root of the dimension of the key vector) Why this is needed? “useful in obtaining stable gradients” *the dimension of the key vector(dk) is 64.
  • 16. Step 3 (”Normalize” the matrix by the square root of the dimension of the key vector) We normalize them using the softmax function (bringing the score in the range of 0 to 1 and the sum equals to 1) Word “I” is related to: - itself by 90% - am by 7% - good by 3%
  • 17. Step 4 (final step in the self-attention mechanism is to compute the “attention matrix”) The attention matrix(Z) contains the attention values for each word in the sentence (sum of the value vectors weighted by the scores)
  • 18. Self-attention of the word “I” is computed as the sum of the value vectors weighted by the scores
  • 19. Using a self attention mechanism, we can understand how a word is related to all other words in the sentence. Step 1 Step 2 Step 3 Step 4
  • 20. Let’s recall Encoder Layer 2 Encoder Layer 1 FFN FFN I am good Representation Multi-head attention Multi-head attention
  • 21. Multi-head attention mechanism • Instead of having a single attention head, we can use multiple attention heads • Instead of computing a single attention matrix, Z, we can compute multiple attention matrices • This will be useful only in circumstances where the meaning of the actual word is ambiguous, e.g. “A dog ate the food because it was hungry”
  • 22. Positional encoding • Transformer network, we don't follow the recurrence mechanism • How will it understand the meaning of the sentence if the word order is not retained ? • We should give some information about the word order to the transformer so that it can understand the sentence + =
  • 23. Positional encoding -1 -0.25 0.91 -0.25 Positional Encoding for the 30th Word Dim 1 Dim 2 Dim 3 Dim 4
  • 24. Let’s recall FFN FFN I am good Representation Multi-head attention Multi-head attention FFN I am good Representation Multi-head attention Positional Encoding
  • 25. Feedforward network • The feedforward network consists of two dense layers with ReLU activations • The parameters of the feedforward network are the same over the different positions of the sentence and different over the encoder blocks.
  • 26. Add and norm component • It connects the input and output of a sublayer. • Connects the input of the multi-head attention sublayer to its output • Connects the input of the feedforward sublayer to its output • Residual connection followed by Layer normalization • Layer normalization promotes faster training by preventing the values in each layer from changing heavily
  • 28. Encoder Decoder I am good je vais bien Representation
  • 29. The decoder of the transformer • Stake of N number of decoders • The output of one decoder is sent as input to the decoder above it • Decoder receives two inputs: • The previous decoder’s output • The encoder's representation • Question? • How exactly does the decoder generate the target sentence?
  • 30. At time step t=1 At time step t=2 At time step t=3 Similarly, on every time step, the decoder combines the newly generated word to the input and predicts the next word. At time step t=4
  • 31. How exactly does the decoder work? • All the encoder blocks are identical • Each encoder block consists of three sublayers • Masked multi-head attention • Multi-head attention • Feedforward network(FNN) Decoder Block FFN Representation Masked multi-head attention Multi-head attention
  • 32. Masked multi-head attention • During training: • Since we have the right target sentence • We can just feed the whole target sentence as input to the decoder but with a small modification • During testing: • Decoder predicts the target sentence word by word in each time step • We learned that the decoder takes the input <sos> as the first token • Keep predicting the target sentence until the <eos> token is reached Training Data set
  • 35. Attention Matrix Encoder Representation
  • 37. Feedforward network • The feedforward layer in the decoder works exactly the same as what we learned in the encoder
  • 38. Add and norm component • The add and norm component in the decoder works exactly the same as what we learned in the encoder
  • 39. Linear and softmax layers • Feed the output obtained from the topmost decoder to the linear and softmax layers
  • 41. The complete transformer architecture
  • 42. Few great books • Resources • Data Science on AWS • Getting Started with Google BERT
  • 43. Suman Debnath Developer Advocate, India /in/suman-d/