SlideShare a Scribd company logo
Inside Transformers and BERT
Suman Debnath
Principal Developer Advocate, India
Why Transformers?
• RNN and LSTM
• Machine Translation
• Text Generation
• Next word prediction
• More…
• Sequential processing and learning
• Challenge
• Recurrent Model
• Long-term dependency
• Bidirectional context is not taken into
account
• Context understanding
« The bat flew past my window » vs. « He hit the baseball with the bat »
• Overcome this limitation of RNNs
• SOTA model for several NLP tasks
• Paved the way for new
revolutionary architectures such as
BERT, GPT-3 and more
• Based entirely on the attention
mechanism and completely gets
rid of recurrence
Attention Is All You Need
Let's understand how the transformer works
(language translation task)
• Encoder-Decoder architecture
• Feed the input sentence (source sentence)
to the encoder
• Encoder learns the representation of the
input sentence and sends the
representation to the decoder
• The decoder receives the representation
learned by the encoder as input and
generates the output sentence (target
sentence).
Encoder Decoder
I am good
je vais bien
Representation
The encoder of the transformer
• Stack of N number of encoders
• The output of one encoder is sent as
input to the encoder above it.
• Questions?
• How exactly does the encoder work ?
• How is it generating the representation for
the given source sentence (input sentence)?
Encoder Layer 1
Encoder Layer 2
I am good
Representation
Encoder Layer N
How exactly does the encoder work ?
• All the encoder blocks are
identical
• Each encoder block consists
of two sublayers
• Multi-head attention
• Feedforward network(FNN)
Encoder Layer 2
Encoder Layer 1
FFN
FFN
Representation
Multi-head attention
Multi-head attention
I am good
Self-attention mechanism
• The pronoun it could mean either dog
or food
• We know, it implies the dog and not
food
• But how our model can understand ?
• Here is where the self-attention
mechanism helps us
• Model relates the word it to ALL the
words in the sentence
“A dog ate the food because it was hungry”
How exactly does this work?
I am good
• The embedding of the word I is :
• The embedding of the word am is :
• The embedding of the word good is :
*embedding dimension be 512
3 New Matrices : {query, key, value}
Q K V
WQ WK WV
Weight Matrices
WQ
WK
WV
(randomly initialized,
learned during training)
3 New Matrices : {query, key, value}
Q K V
WQ WK WV
Weight Matrices
WQ
WK
WV
(randomly initialized,
learned during training)
Implies the
query, key, and
value vectors of
the word
“I”
*the dimension of the {query, key, value} vector(dk) is 64.
Why are we computing this?
What is the use of query, key, and value matrices?
How is this going to help us?
Self-attention mechanism
• We learnt to compute matrices : query(Q),
key(K), and value(V)
• They are obtained from the Input matrix, X
• How they are used in the self-attention
mechanism
• REMEMBER: self-attention mechanism relates
the word to all the words in the given sentence
• This understanding helps to learn better
representation
• Next, we will see how these 3 matrices,
helps to get a better representation
4 Steps
Step 1
(”Dot product” between the query matrix, Q, and the key matrix, KT )
What exactly Q.KT does signify?
“how similar they are”
Step 2
(”Divide” the matrix by the square root of the dimension of the key vector)
Why this is needed?
“useful in obtaining stable gradients”
*the dimension of the key vector(dk) is 64.
Step 3
(”Normalize” the matrix by the square root of the dimension of the key
vector)
We normalize them using the softmax function
(bringing the score in the range of 0 to 1 and the sum equals to 1)
Word “I” is related to:
- itself by 90%
- am by 7%
- good by 3%
Step 4
(final step in the self-attention mechanism is to compute the “attention matrix”)
The attention matrix(Z) contains the attention
values for each word in the sentence
(sum of the value vectors weighted by the scores)
Self-attention of the word “I” is computed as the sum of the value vectors weighted by the scores
Using a self attention mechanism,
we can understand how a word is related to all other words in the
sentence.
Step 1
Step 2
Step 3
Step 4
Let’s recall
Encoder Layer 2
Encoder Layer 1
FFN
FFN
I am good
Representation
Multi-head attention
Multi-head attention
Multi-head attention mechanism
• Instead of having a single attention head, we can use multiple
attention heads
• Instead of computing a single attention matrix, Z, we can compute
multiple attention matrices
• This will be useful only in circumstances where the meaning of the
actual word is ambiguous, e.g.
“A dog ate the food because it was hungry”
Positional encoding
• Transformer network, we don't follow the recurrence mechanism
• How will it understand the meaning of the sentence if the word order is not
retained ?
• We should give some information about the word order to the
transformer so that it can understand the sentence
+ =
Positional encoding
-1
-0.25
0.91
-0.25
Positional Encoding for the 30th Word
Dim 1
Dim 2
Dim 3
Dim 4
Let’s recall
FFN
FFN
I am good
Representation
Multi-head attention
Multi-head attention
FFN
I am good
Representation
Multi-head attention
Positional
Encoding
Feedforward network
• The feedforward network consists of two dense layers with ReLU
activations
• The parameters of the feedforward network are the same over the
different positions of the sentence and different over the encoder
blocks.
Add and norm component
• It connects the input and output of a
sublayer.
• Connects the input of the multi-head attention
sublayer to its output
• Connects the input of the feedforward
sublayer to its output
• Residual connection followed by Layer
normalization
• Layer normalization promotes faster
training by preventing the values in each
layer from changing heavily
Transformers and BERT with SageMaker
Encoder Decoder
I am good
je vais bien
Representation
BERT(Bidirectional Encoder Representation from Transformer)
Multi-head attention
Multi-head attention
Multi-head attention
FFN
Encoder
1
Python is my favorite programming language
Encoder N
Encoder 2
RPython Ris Rmy Rfavorite Rprogramming Rlanguage
• BERT is supposed to do:
• Masked Language Model (MLM)
• Next Sentence Prediction (NSP)
• BERT was trained to perform these two tasks purely as a way to force it to develop a sophisticated
understanding of language, e.g.
“Kolkata is a beautiful city. I love Kolkata”
• Here’s what BERT is supposed to do:
• MLM - Predict the crossed out word
(Correct answer is “city”).
• NSP - Was sentence B found immediately after sentence A , or from somewhere else?
(Correct answer is that they are consecutive).
How BERT works ?
Pre-Training Tasks
How BERT works ?
[CLS] Kolkata is a beautiful [MASK] [SEP] I love Kolkata [SEP]
Encoder Layer 1
Encoder Layer 1
Encoder Layer 12
E[CLS] EKolkata Eis Ea Ebeautiful E[SEP] EI Elove EKolkata E[SEP]
Token Embeddings
EA EA EA EA EA EA EB EB EB EB
Segment Embeddings
E0 E1 E2 E3 E4 E6 EB7 E8 E9 E10
Position Embeddings
INPUT
E[MASK]
EA
E5
R[CLS] RKolkata Ris Ra Rbeautiful R[MASK] R[SEP] R[SEP]
RI Rlove RKolkata
OUTPUT
(Enhanced Embedding)
Pre-Training Tasks
How BERT works ?
Pre-Training & Fine-Tuning
[CLS] Suman loves Kolkata [SEP]
Encoder Layer 1
Encoder Layer 1
Encoder Layer 12
INPUT
R[CLS] R[SEP]
RSuman Rloves RKolkata
OUTPUT
(Enhanced Embedding)
How BERT works ?
Pre-Training & Fine-Tuning
.9
.1
Positive
Negative
FFN
+
Softmax
(Sentiment Analysis)
[CLS] Suman loves Kolkata [SEP]
Encoder Layer 1
Encoder Layer 1
Encoder Layer 12
INPUT
R[CLS] R[SEP]
RSuman Rloves RKolkata
OUTPUT
(Enhanced Embedding)
How BERT works ?
Pre-Training & Fine-Tuning
[CLS] Suman loves Kolkata [SEP]
Encoder Layer 1
Encoder Layer 1
Encoder Layer 12
INPUT
OUTPUT
(Enhanced Embedding)
R[CLS] R[SEP]
RSuman Rloves RKolkata
Classifier
PERSON LOCATION
(Name entity recognition)
.9
.1
Positive
Negative
FFN
+
Softmax
(Sentiment Analysis)
[CLS] Suman loves Kolkata [SEP]
Encoder Layer 1
Encoder Layer 1
Encoder Layer 12
INPUT
R[CLS] R[SEP]
RSuman Rloves RKolkata
OUTPUT
(Enhanced Embedding)
© 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Demo
Few great books
• Resources
• Data Science on AWS
• Getting Started with Google BERT
Suman Debnath
Developer Advocate, India
/in/suman-d/
The decoder of the transformer
• Stake of N number of decoders
• The output of one decoder is sent as
input to the decoder above it
• Decoder receives two inputs:
• The previous decoder’s output
• The encoder's representation
• Question?
• How exactly does the decoder generate
the target sentence?
At time step t=1 At time step t=2 At time step t=3
Similarly, on every time step, the decoder combines the newly generated word to the input and predicts the next word.
At time step t=4
How exactly does the decoder work?
• All the encoder blocks are
identical
• Each encoder block consists
of three sublayers
• Masked multi-head attention
• Multi-head attention
• Feedforward network(FNN)
Decoder Block
FFN
Representation
Masked
multi-head attention
Multi-head attention
Masked multi-head attention
• During training:
• Since we have the right target sentence
• We can just feed the whole target sentence as input
to the decoder but with a small modification
• During testing:
• Decoder predicts the target sentence word by word
in each time step
• We learned that the decoder takes the input <sos>
as the first token
• Keep predicting the target sentence until the <eos>
token is reached
Training Data set
Masked multi-head attention
Multi-head attention
Inputs
Attention Matrix Encoder Representation
Transformers and BERT with SageMaker
Feedforward network
• The feedforward layer in the decoder works exactly the same as what
we learned in the encoder
Add and norm component
• The add and norm component in the decoder works exactly the same as
what we learned in the encoder
Linear and softmax layers
• Feed the output obtained from the topmost decoder to the linear and
softmax layers
Transformers and BERT with SageMaker
The complete transformer architecture

More Related Content

PPTX
Attention Is All You Need
PPTX
Transformers AI PPT.pptx
PDF
An introduction to the Transformers architecture and BERT
PPTX
[Paper review] BERT
PDF
BERT Finetuning Webinar Presentation
PDF
BERTology のススメ
PDF
Transformer Introduction (Seminar Material)
Attention Is All You Need
Transformers AI PPT.pptx
An introduction to the Transformers architecture and BERT
[Paper review] BERT
BERT Finetuning Webinar Presentation
BERTology のススメ
Transformer Introduction (Seminar Material)

What's hot (20)

PPTX
NLP State of the Art | BERT
PDF
Deep learning for NLP and Transformer
PDF
Deeplearning輪読会
PDF
Attention is All You Need (Transformer)
PPTX
Introduction For seq2seq(sequence to sequence) and RNN
DOCX
Credit card fraud detection using random forest &amp; cart algorithm
PDF
NLP using transformers
PDF
BERT: Bidirectional Encoder Representations from Transformers
PPTX
Bert.pptx
PPTX
Presentation on Text Classification
PPTX
Introduction to Transformer Model
PDF
Natural language processing (NLP) introduction
PPTX
Enriching Word Vectors with Subword Information
PPTX
Natural language processing and transformer models
PDF
NLP with Deep Learning
PPTX
Natural language processing: feature extraction
PDF
A Review of Deep Contextualized Word Representations (Peters+, 2018)
PDF
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
PPTX
BERT introduction
PPTX
[DL輪読会]BERT: Pre-training of Deep Bidirectional Transformers for Language Und...
NLP State of the Art | BERT
Deep learning for NLP and Transformer
Deeplearning輪読会
Attention is All You Need (Transformer)
Introduction For seq2seq(sequence to sequence) and RNN
Credit card fraud detection using random forest &amp; cart algorithm
NLP using transformers
BERT: Bidirectional Encoder Representations from Transformers
Bert.pptx
Presentation on Text Classification
Introduction to Transformer Model
Natural language processing (NLP) introduction
Enriching Word Vectors with Subword Information
Natural language processing and transformer models
NLP with Deep Learning
Natural language processing: feature extraction
A Review of Deep Contextualized Word Representations (Peters+, 2018)
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT introduction
[DL輪読会]BERT: Pre-training of Deep Bidirectional Transformers for Language Und...
Ad

Similar to Transformers and BERT with SageMaker (20)

PDF
Introduction to Transformers
PPTX
Demystifying NLP Transformers: Understanding the Power and Architecture behin...
PDF
BERT - Part 2 Learning Notes
PDF
05-transformers.pdf
PPTX
Transformer Zoo
PPTX
Machine Learning - Transformers, Large Language Models and ChatGPT
PPTX
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
PPTX
[Paper Reading] Attention is All You Need
PDF
M5 Topic 1 - Encoder Decoder MODEL-JEC.pdf
PPTX
Introduction to Neural Information Retrieval and Large Language Models
PDF
Lecture - 10 Transformer Model, Motivation to Transformers, Principles, and ...
PPTX
Data Con LA 2022 - Transformers for NLP
PDF
Introduction to Transformers for NLP - Olga Petrova
PDF
Deep learning based drug protein interaction
PPTX
15_NEW-2020-ATTENTION-ENC-DEC-TRANSFORMERS-Lect15.pptx
PPTX
Vision Transformers (ViTs) in Computer Vision: A Transformer-Based Approach f...
PPTX
A Detailed Exploration of Vision Transformer (ViT) and Its Role in Deep Learn...
PPTX
240122_Attention Is All You Need (2017 NIPS)2.pptx
PDF
Neural Machine Translation (D2L10 Insight@DCU Machine Learning Workshop 2017)
PPTX
What Deep Learning Means for Artificial Intelligence
Introduction to Transformers
Demystifying NLP Transformers: Understanding the Power and Architecture behin...
BERT - Part 2 Learning Notes
05-transformers.pdf
Transformer Zoo
Machine Learning - Transformers, Large Language Models and ChatGPT
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
[Paper Reading] Attention is All You Need
M5 Topic 1 - Encoder Decoder MODEL-JEC.pdf
Introduction to Neural Information Retrieval and Large Language Models
Lecture - 10 Transformer Model, Motivation to Transformers, Principles, and ...
Data Con LA 2022 - Transformers for NLP
Introduction to Transformers for NLP - Olga Petrova
Deep learning based drug protein interaction
15_NEW-2020-ATTENTION-ENC-DEC-TRANSFORMERS-Lect15.pptx
Vision Transformers (ViTs) in Computer Vision: A Transformer-Based Approach f...
A Detailed Exploration of Vision Transformer (ViT) and Its Role in Deep Learn...
240122_Attention Is All You Need (2017 NIPS)2.pptx
Neural Machine Translation (D2L10 Insight@DCU Machine Learning Workshop 2017)
What Deep Learning Means for Artificial Intelligence
Ad

More from Suman Debnath (14)

PDF
LambdaMongoDB.pdf
PPTX
OpenSourceIndia-Suman.pptx
PPTX
Develop a Graph Based Recommendation System in Python on AWS
PDF
EFS_Integration.pdf
PDF
AWS DynamoDB
PDF
Introduction to AWS
PDF
Data engineering
PDF
Deploy PyTorch models in Production on AWS with TorchServe
PDF
Docker on AWS
PDF
Introduction to k-Nearest Neighbors and Amazon SageMaker
PPTX
AWS Serverless with Chalice
PDF
Introduction to ML and Decision Tree
PDF
AWS AI Services 101
PDF
Introduction to AI/ML with AWS
LambdaMongoDB.pdf
OpenSourceIndia-Suman.pptx
Develop a Graph Based Recommendation System in Python on AWS
EFS_Integration.pdf
AWS DynamoDB
Introduction to AWS
Data engineering
Deploy PyTorch models in Production on AWS with TorchServe
Docker on AWS
Introduction to k-Nearest Neighbors and Amazon SageMaker
AWS Serverless with Chalice
Introduction to ML and Decision Tree
AWS AI Services 101
Introduction to AI/ML with AWS

Recently uploaded (20)

PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Unlocking AI with Model Context Protocol (MCP)
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PPTX
A Presentation on Artificial Intelligence
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Empathic Computing: Creating Shared Understanding
PDF
Electronic commerce courselecture one. Pdf
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
NewMind AI Monthly Chronicles - July 2025
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
NewMind AI Weekly Chronicles - August'25 Week I
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Spectral efficient network and resource selection model in 5G networks
Unlocking AI with Model Context Protocol (MCP)
20250228 LYD VKU AI Blended-Learning.pptx
A Presentation on Artificial Intelligence
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Network Security Unit 5.pdf for BCA BBA.
Diabetes mellitus diagnosis method based random forest with bat algorithm
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Empathic Computing: Creating Shared Understanding
Electronic commerce courselecture one. Pdf
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Building Integrated photovoltaic BIPV_UPV.pdf
NewMind AI Monthly Chronicles - July 2025
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Reach Out and Touch Someone: Haptics and Empathic Computing
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Review of recent advances in non-invasive hemoglobin estimation
NewMind AI Weekly Chronicles - August'25 Week I

Transformers and BERT with SageMaker

  • 1. Inside Transformers and BERT Suman Debnath Principal Developer Advocate, India
  • 2. Why Transformers? • RNN and LSTM • Machine Translation • Text Generation • Next word prediction • More… • Sequential processing and learning • Challenge • Recurrent Model • Long-term dependency • Bidirectional context is not taken into account • Context understanding « The bat flew past my window » vs. « He hit the baseball with the bat »
  • 3. • Overcome this limitation of RNNs • SOTA model for several NLP tasks • Paved the way for new revolutionary architectures such as BERT, GPT-3 and more • Based entirely on the attention mechanism and completely gets rid of recurrence Attention Is All You Need
  • 4. Let's understand how the transformer works (language translation task) • Encoder-Decoder architecture • Feed the input sentence (source sentence) to the encoder • Encoder learns the representation of the input sentence and sends the representation to the decoder • The decoder receives the representation learned by the encoder as input and generates the output sentence (target sentence). Encoder Decoder I am good je vais bien Representation
  • 5. The encoder of the transformer • Stack of N number of encoders • The output of one encoder is sent as input to the encoder above it. • Questions? • How exactly does the encoder work ? • How is it generating the representation for the given source sentence (input sentence)? Encoder Layer 1 Encoder Layer 2 I am good Representation Encoder Layer N
  • 6. How exactly does the encoder work ? • All the encoder blocks are identical • Each encoder block consists of two sublayers • Multi-head attention • Feedforward network(FNN) Encoder Layer 2 Encoder Layer 1 FFN FFN Representation Multi-head attention Multi-head attention I am good
  • 7. Self-attention mechanism • The pronoun it could mean either dog or food • We know, it implies the dog and not food • But how our model can understand ? • Here is where the self-attention mechanism helps us • Model relates the word it to ALL the words in the sentence “A dog ate the food because it was hungry” How exactly does this work?
  • 8. I am good • The embedding of the word I is : • The embedding of the word am is : • The embedding of the word good is : *embedding dimension be 512
  • 9. 3 New Matrices : {query, key, value} Q K V WQ WK WV Weight Matrices WQ WK WV (randomly initialized, learned during training)
  • 10. 3 New Matrices : {query, key, value} Q K V WQ WK WV Weight Matrices WQ WK WV (randomly initialized, learned during training) Implies the query, key, and value vectors of the word “I” *the dimension of the {query, key, value} vector(dk) is 64.
  • 11. Why are we computing this? What is the use of query, key, and value matrices? How is this going to help us?
  • 12. Self-attention mechanism • We learnt to compute matrices : query(Q), key(K), and value(V) • They are obtained from the Input matrix, X • How they are used in the self-attention mechanism • REMEMBER: self-attention mechanism relates the word to all the words in the given sentence • This understanding helps to learn better representation • Next, we will see how these 3 matrices, helps to get a better representation
  • 14. Step 1 (”Dot product” between the query matrix, Q, and the key matrix, KT ) What exactly Q.KT does signify? “how similar they are”
  • 15. Step 2 (”Divide” the matrix by the square root of the dimension of the key vector) Why this is needed? “useful in obtaining stable gradients” *the dimension of the key vector(dk) is 64.
  • 16. Step 3 (”Normalize” the matrix by the square root of the dimension of the key vector) We normalize them using the softmax function (bringing the score in the range of 0 to 1 and the sum equals to 1) Word “I” is related to: - itself by 90% - am by 7% - good by 3%
  • 17. Step 4 (final step in the self-attention mechanism is to compute the “attention matrix”) The attention matrix(Z) contains the attention values for each word in the sentence (sum of the value vectors weighted by the scores)
  • 18. Self-attention of the word “I” is computed as the sum of the value vectors weighted by the scores
  • 19. Using a self attention mechanism, we can understand how a word is related to all other words in the sentence. Step 1 Step 2 Step 3 Step 4
  • 20. Let’s recall Encoder Layer 2 Encoder Layer 1 FFN FFN I am good Representation Multi-head attention Multi-head attention
  • 21. Multi-head attention mechanism • Instead of having a single attention head, we can use multiple attention heads • Instead of computing a single attention matrix, Z, we can compute multiple attention matrices • This will be useful only in circumstances where the meaning of the actual word is ambiguous, e.g. “A dog ate the food because it was hungry”
  • 22. Positional encoding • Transformer network, we don't follow the recurrence mechanism • How will it understand the meaning of the sentence if the word order is not retained ? • We should give some information about the word order to the transformer so that it can understand the sentence + =
  • 23. Positional encoding -1 -0.25 0.91 -0.25 Positional Encoding for the 30th Word Dim 1 Dim 2 Dim 3 Dim 4
  • 24. Let’s recall FFN FFN I am good Representation Multi-head attention Multi-head attention FFN I am good Representation Multi-head attention Positional Encoding
  • 25. Feedforward network • The feedforward network consists of two dense layers with ReLU activations • The parameters of the feedforward network are the same over the different positions of the sentence and different over the encoder blocks.
  • 26. Add and norm component • It connects the input and output of a sublayer. • Connects the input of the multi-head attention sublayer to its output • Connects the input of the feedforward sublayer to its output • Residual connection followed by Layer normalization • Layer normalization promotes faster training by preventing the values in each layer from changing heavily
  • 28. Encoder Decoder I am good je vais bien Representation
  • 29. BERT(Bidirectional Encoder Representation from Transformer) Multi-head attention Multi-head attention Multi-head attention FFN Encoder 1 Python is my favorite programming language Encoder N Encoder 2 RPython Ris Rmy Rfavorite Rprogramming Rlanguage
  • 30. • BERT is supposed to do: • Masked Language Model (MLM) • Next Sentence Prediction (NSP) • BERT was trained to perform these two tasks purely as a way to force it to develop a sophisticated understanding of language, e.g. “Kolkata is a beautiful city. I love Kolkata” • Here’s what BERT is supposed to do: • MLM - Predict the crossed out word (Correct answer is “city”). • NSP - Was sentence B found immediately after sentence A , or from somewhere else? (Correct answer is that they are consecutive). How BERT works ? Pre-Training Tasks
  • 31. How BERT works ? [CLS] Kolkata is a beautiful [MASK] [SEP] I love Kolkata [SEP] Encoder Layer 1 Encoder Layer 1 Encoder Layer 12 E[CLS] EKolkata Eis Ea Ebeautiful E[SEP] EI Elove EKolkata E[SEP] Token Embeddings EA EA EA EA EA EA EB EB EB EB Segment Embeddings E0 E1 E2 E3 E4 E6 EB7 E8 E9 E10 Position Embeddings INPUT E[MASK] EA E5 R[CLS] RKolkata Ris Ra Rbeautiful R[MASK] R[SEP] R[SEP] RI Rlove RKolkata OUTPUT (Enhanced Embedding) Pre-Training Tasks
  • 32. How BERT works ? Pre-Training & Fine-Tuning [CLS] Suman loves Kolkata [SEP] Encoder Layer 1 Encoder Layer 1 Encoder Layer 12 INPUT R[CLS] R[SEP] RSuman Rloves RKolkata OUTPUT (Enhanced Embedding)
  • 33. How BERT works ? Pre-Training & Fine-Tuning .9 .1 Positive Negative FFN + Softmax (Sentiment Analysis) [CLS] Suman loves Kolkata [SEP] Encoder Layer 1 Encoder Layer 1 Encoder Layer 12 INPUT R[CLS] R[SEP] RSuman Rloves RKolkata OUTPUT (Enhanced Embedding)
  • 34. How BERT works ? Pre-Training & Fine-Tuning [CLS] Suman loves Kolkata [SEP] Encoder Layer 1 Encoder Layer 1 Encoder Layer 12 INPUT OUTPUT (Enhanced Embedding) R[CLS] R[SEP] RSuman Rloves RKolkata Classifier PERSON LOCATION (Name entity recognition) .9 .1 Positive Negative FFN + Softmax (Sentiment Analysis) [CLS] Suman loves Kolkata [SEP] Encoder Layer 1 Encoder Layer 1 Encoder Layer 12 INPUT R[CLS] R[SEP] RSuman Rloves RKolkata OUTPUT (Enhanced Embedding)
  • 35. © 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved. Demo
  • 36. Few great books • Resources • Data Science on AWS • Getting Started with Google BERT
  • 37. Suman Debnath Developer Advocate, India /in/suman-d/
  • 38. The decoder of the transformer • Stake of N number of decoders • The output of one decoder is sent as input to the decoder above it • Decoder receives two inputs: • The previous decoder’s output • The encoder's representation • Question? • How exactly does the decoder generate the target sentence?
  • 39. At time step t=1 At time step t=2 At time step t=3 Similarly, on every time step, the decoder combines the newly generated word to the input and predicts the next word. At time step t=4
  • 40. How exactly does the decoder work? • All the encoder blocks are identical • Each encoder block consists of three sublayers • Masked multi-head attention • Multi-head attention • Feedforward network(FNN) Decoder Block FFN Representation Masked multi-head attention Multi-head attention
  • 41. Masked multi-head attention • During training: • Since we have the right target sentence • We can just feed the whole target sentence as input to the decoder but with a small modification • During testing: • Decoder predicts the target sentence word by word in each time step • We learned that the decoder takes the input <sos> as the first token • Keep predicting the target sentence until the <eos> token is reached Training Data set
  • 44. Attention Matrix Encoder Representation
  • 46. Feedforward network • The feedforward layer in the decoder works exactly the same as what we learned in the encoder
  • 47. Add and norm component • The add and norm component in the decoder works exactly the same as what we learned in the encoder
  • 48. Linear and softmax layers • Feed the output obtained from the topmost decoder to the linear and softmax layers
  • 50. The complete transformer architecture