Transformers and BERT with SageMaker

Inside Transformers and BERT
Suman Debnath
Principal Developer Advocate, India

Why Transformers?
• RNN and LSTM
• Machine Translation
• Text Generation
• Next word prediction
• More…
• Sequential processing and learning
• Challenge
• Recurrent Model
• Long-term dependency
• Bidirectional context is not taken into
account
• Context understanding
« The bat flew past my window » vs. « He hit the baseball with the bat »

• Overcome this limitation of RNNs
• SOTA model for several NLP tasks
• Paved the way for new
revolutionary architectures such as
BERT, GPT-3 and more
• Based entirely on the attention
mechanism and completely gets
rid of recurrence
Attention Is All You Need

Let's understand how the transformer works
(language translation task)
• Encoder-Decoder architecture
• Feed the input sentence (source sentence)
to the encoder
• Encoder learns the representation of the
input sentence and sends the
representation to the decoder
• The decoder receives the representation
learned by the encoder as input and
generates the output sentence (target
sentence).
Encoder Decoder
I am good
je vais bien
Representation

The encoder of the transformer
• Stack of N number of encoders
• The output of one encoder is sent as
input to the encoder above it.
• Questions?
• How exactly does the encoder work ?
• How is it generating the representation for
the given source sentence (input sentence)?
Encoder Layer 1
Encoder Layer 2
I am good
Representation
Encoder Layer N

How exactly does the encoder work ?
• All the encoder blocks are
identical
• Each encoder block consists
of two sublayers
• Multi-head attention
• Feedforward network(FNN)
Encoder Layer 2
Encoder Layer 1
FFN
FFN
Representation
Multi-head attention
I am good

Self-attention mechanism
• The pronoun it could mean either dog
or food
• We know, it implies the dog and not
food
• But how our model can understand ?
• Here is where the self-attention
mechanism helps us
• Model relates the word it to ALL the
words in the sentence
“A dog ate the food because it was hungry”
How exactly does this work?

I am good
• The embedding of the word I is :
• The embedding of the word am is :
• The embedding of the word good is :
*embedding dimension be 512

3 New Matrices : {query, key, value}
Q K V
WQ WK WV
Weight Matrices
WQ
WK
WV
(randomly initialized,
learned during training)

3 New Matrices : {query, key, value}
Q K V
WQ WK WV
Weight Matrices
WQ
WK
WV
(randomly initialized,
learned during training)
Implies the
query, key, and
value vectors of
the word
“I”
*the dimension of the {query, key, value} vector(dk) is 64.

Why are we computing this?
What is the use of query, key, and value matrices?
How is this going to help us?

Self-attention mechanism
• We learnt to compute matrices : query(Q),
key(K), and value(V)
• They are obtained from the Input matrix, X
• How they are used in the self-attention
mechanism
• REMEMBER: self-attention mechanism relates
the word to all the words in the given sentence
• This understanding helps to learn better
representation
• Next, we will see how these 3 matrices,
helps to get a better representation

Step 1
(”Dot product” between the query matrix, Q, and the key matrix, KT )
What exactly Q.KT does signify?
“how similar they are”

Step 2
(”Divide” the matrix by the square root of the dimension of the key vector)
Why this is needed?
“useful in obtaining stable gradients”
*the dimension of the key vector(dk) is 64.

Step 3
(”Normalize” the matrix by the square root of the dimension of the key
vector)
We normalize them using the softmax function
(bringing the score in the range of 0 to 1 and the sum equals to 1)
Word “I” is related to:
- itself by 90%
- am by 7%
- good by 3%

Step 4
(final step in the self-attention mechanism is to compute the “attention matrix”)
The attention matrix(Z) contains the attention
values for each word in the sentence
(sum of the value vectors weighted by the scores)

Self-attention of the word “I” is computed as the sum of the value vectors weighted by the scores

Using a self attention mechanism,
we can understand how a word is related to all other words in the
sentence.
Step 1
Step 2
Step 3
Step 4

Let’s recall
Encoder Layer 2
Encoder Layer 1
FFN
FFN
I am good
Representation

Multi-head attention mechanism
• Instead of having a single attention head, we can use multiple
attention heads
• Instead of computing a single attention matrix, Z, we can compute
multiple attention matrices
• This will be useful only in circumstances where the meaning of the
actual word is ambiguous, e.g.
“A dog ate the food because it was hungry”

Positional encoding
• Transformer network, we don't follow the recurrence mechanism
• How will it understand the meaning of the sentence if the word order is not
retained ?
• We should give some information about the word order to the
transformer so that it can understand the sentence
+ =

Positional encoding
-1
-0.25
0.91
-0.25
Positional Encoding for the 30th Word
Dim 1
Dim 2
Dim 3
Dim 4

Let’s recall
FFN
FFN
I am good
Representation
FFN
I am good
Representation
Positional
Encoding

Feedforward network
• The feedforward network consists of two dense layers with ReLU
activations
• The parameters of the feedforward network are the same over the
different positions of the sentence and different over the encoder
blocks.

Add and norm component
• It connects the input and output of a
sublayer.
• Connects the input of the multi-head attention
sublayer to its output
• Connects the input of the feedforward
sublayer to its output
• Residual connection followed by Layer
normalization
• Layer normalization promotes faster
training by preventing the values in each
layer from changing heavily

Transformers and BERT with SageMaker

Encoder Decoder
I am good
je vais bien
Representation

BERT(Bidirectional Encoder Representation from Transformer)
FFN
Encoder
1
Python is my favorite programming language
Encoder N
Encoder 2
RPython Ris Rmy Rfavorite Rprogramming Rlanguage

• BERT is supposed to do:
• Masked Language Model (MLM)
• Next Sentence Prediction (NSP)
• BERT was trained to perform these two tasks purely as a way to force it to develop a sophisticated
understanding of language, e.g.
“Kolkata is a beautiful city. I love Kolkata”
• Here’s what BERT is supposed to do:
• MLM - Predict the crossed out word
(Correct answer is “city”).
• NSP - Was sentence B found immediately after sentence A , or from somewhere else?
(Correct answer is that they are consecutive).
How BERT works ?
Pre-Training Tasks

How BERT works ?
[CLS] Kolkata is a beautiful [MASK] [SEP] I love Kolkata [SEP]
Encoder Layer 1
Encoder Layer 1
Encoder Layer 12
E[CLS] EKolkata Eis Ea Ebeautiful E[SEP] EI Elove EKolkata E[SEP]
Token Embeddings
EA EA EA EA EA EA EB EB EB EB
Segment Embeddings
E0 E1 E2 E3 E4 E6 EB7 E8 E9 E10
Position Embeddings
INPUT
E[MASK]
EA
E5
R[CLS] RKolkata Ris Ra Rbeautiful R[MASK] R[SEP] R[SEP]
RI Rlove RKolkata
OUTPUT
(Enhanced Embedding)
Pre-Training Tasks

How BERT works ?
Pre-Training & Fine-Tuning
[CLS] Suman loves Kolkata [SEP]
Encoder Layer 1
Encoder Layer 1
Encoder Layer 12
INPUT
R[CLS] R[SEP]
RSuman Rloves RKolkata
OUTPUT

How BERT works ?
.9
.1
Positive
Negative
FFN
+
Softmax
(Sentiment Analysis)
Encoder Layer 1
Encoder Layer 1
Encoder Layer 12
INPUT
R[CLS] R[SEP]
OUTPUT

How BERT works ?
Encoder Layer 1
Encoder Layer 1
Encoder Layer 12
INPUT
OUTPUT
R[CLS] R[SEP]
Classifier
PERSON LOCATION
(Name entity recognition)
.9
.1
Positive
Negative
FFN
+
Softmax
(Sentiment Analysis)
Encoder Layer 1
Encoder Layer 1
Encoder Layer 12
INPUT
R[CLS] R[SEP]
OUTPUT

Few great books
• Resources
• Data Science on AWS
• Getting Started with Google BERT

Suman Debnath
Developer Advocate, India
/in/suman-d/

The decoder of the transformer
• Stake of N number of decoders
• The output of one decoder is sent as
input to the decoder above it
• Decoder receives two inputs:
• The previous decoder’s output
• The encoder's representation
• Question?
• How exactly does the decoder generate
the target sentence?

At time step t=1 At time step t=2 At time step t=3
Similarly, on every time step, the decoder combines the newly generated word to the input and predicts the next word.
At time step t=4

How exactly does the decoder work?
• All the encoder blocks are
identical
• Each encoder block consists
of three sublayers
• Masked multi-head attention
• Multi-head attention
• Feedforward network(FNN)
Decoder Block
FFN
Representation
Masked
multi-head attention

Masked multi-head attention
• During training:
• Since we have the right target sentence
• We can just feed the whole target sentence as input
to the decoder but with a small modification
• During testing:
• Decoder predicts the target sentence word by word
in each time step
• We learned that the decoder takes the input <sos>
as the first token
• Keep predicting the target sentence until the <eos>
token is reached
Training Data set

Attention Matrix Encoder Representation

Feedforward network
• The feedforward layer in the decoder works exactly the same as what
we learned in the encoder

Add and norm component
• The add and norm component in the decoder works exactly the same as
what we learned in the encoder

Linear and softmax layers
• Feed the output obtained from the topmost decoder to the linear and
softmax layers

The complete transformer architecture

Transformers and BERT with SageMaker

More Related Content

What's hot (20)

Similar to Transformers and BERT with SageMaker (20)

More from Suman Debnath (14)

Recently uploaded (20)

Transformers and BERT with SageMaker