Transformers: Revolutionizing NLP with Self-Attention

Transformers
Introduction to Transformers

LLMs are built out of transformers
Transformer: a specific kind of network architecture, like a
fancier feedforward network, but based on attention
Provided proper attribution is provided, Google hereby grants permission to
reproduce the tables and figures in this paper solely for use in journalistic or
scholarly works.
Attention Is All You Need
Ashish Vaswani⇤
Google Brain
avaswani@google.com
Noam Shazeer⇤
Google Brain
noam@google.com
Niki Parmar⇤
Google Research
nikip@google.com
Jakob Uszkoreit⇤
Google Research
usz@google.com
Llion Jones⇤
Google Research
llion@google.com
Aidan N. Gomez⇤ †
University of Toronto
aidan@cs.toronto.edu
Łukasz Kaiser⇤
Google Brain
lukaszkaiser@google.com
Illia Polosukhin⇤ ‡
illia.polosukhin@gmail.com
[cs.CL]
2
Aug
2023

A very approximate timeline
1990 Static Word Embeddings
2003 Neural Language Model
2008 Multi-Task Learning
2015 Attention
2017 Transformer
2018 Contextual Word Embeddings and Pretraining
2019 Prompting

Instead of star,ng with the big picture
Stacked
Transformer
Blocks
So long and thanks for
long and thanks for
Next token all
…
…
…
U
Input tokens
x1 x2
Language
Modeling
Head
x3 x4 x5
Input
Encoding E
1
+
E
2
+
E
3
+
E
4
+
E
5
+
…
… …
…
…
U U U U
…
logits logits logits logits logits
Stacked
Transformer
Blocks
long and thanks for
Next token all
…
…
…
U
Input tokens
x1 x2
Language
Modeling
Head
x3 x4 x5
Input
Encoding E
1
+
E
2
+
E
3
+
E
4
+
E
5
+
…
… …
…
…
U U U U
…
Let's consider the embeddings for an individual word from a particular layer

Problem with static embeddings (word2vec)
They are static! The embedding for a word doesn't reflect how its
meaning changes in context.
The chicken didn't cross the road because it was too tired
What is the meaning represented in the static embedding for "it"?

Contextual Embeddings
• Intuition: a representation of meaning of a word
should be different in different contexts!
• Contextual Embedding: each word has a different
vector that expresses different meanings
depending on the surrounding words
• How to compute contextual embeddings?
• Attention

Contextual Embeddings
The chicken didn't cross the road because it
What should be the properties of "it"?
The chicken didn't cross the road because it was too tired
The chicken didn't cross the road because it was too wide
At this point in the sentence, it's probably referring to either the chicken or the street

Intui&on of a+en&on
Build up the contextual embedding from a word by
selectively integrating information from all the
neighboring words
We say that a word "attends to" some neighboring
words more than others

Intuition of attention:
test
The
chicken
didn’t
cross
the
road
because
it
was
too
tired
The
chicken
didn’t
cross
the
road
because
it
was
too
tired
Layer k+1
Layer k
self-attention distribution
columns corresponding to input tokens

Attention definition
A mechanism for helping compute the embedding for
a token by selectively attending to and integrating
information from surrounding tokens (at the previous
layer).
More formally: a method for doing a weighted sum of
vectors.

Attention is left-to-right
attention
attention
Self-Attention
Layer
attention
attention
attention
a1 a2 a3 a4 a5
x3 x4 x5
x1 x2

Simplified version of attention: a sum of prior words
weighted by their similarity with the current word
Given a sequence of token embeddings:
x1 x2 x3 x4 x5 x6 x7 xi
Produce: ai = a weighted sum of x1 through x7 (and xi)
Weighted by their similarity to xi
re words to other words? Since our representations for
make use of our old friend the dot product that we used
larity in Chapter 6, and also played a role in attention in
the result of this comparison between words i and j as a
this equation to add attention to the computation of this
erson 1: score(xi,xj) = xi ·xj (10.4)
oduct is a scalar value ranging from • to •, the larger
the vectors that are being compared. Continuing with our
Verson 1: score(xi,xj) = xi ·xj (1
esult of a dot product is a scalar value ranging from • to •, the la
the more similar the vectors that are being compared. Continuing with
the first step in computing y3 would be to compute three scores: x3
d x3 ·x3. Then to make effective use of these scores, we’ll normalize t
oftmax to create a vector of weights, aij, that indicates the proporti
of each input to the input element i that is the current focus of attentio
aij = softmax(score(xi,xj)) 8j  i (1
=
exp(score(xi,xj))
Pi
exp(score(xi,xk))
8j  i (1
ERS AND LARGE LANGUAGE MODELS
, each weighted by its a value.
ai =
X
ji
aijxj (10.7)

test
x1 x2 x3 x4 x5 x6 x7 xi
The
chicken
didn’t
cross
the
road
because
it
was
too
tired
The
chicken
didn’t
cross
the
road
because
it
was
too
tired
Layer k+1
Layer k

An Actual Attention Head: slightly more complicated
High-level idea: instead of using vectors (like xi and x4)
directly, we'll represent 3 separate roles each vector xi plays:
• query: As the current element being compared to the
preceding inputs.
• key: as a preceding input that is being compared to the
current element to determine a similarity
• value: a value of a preceding element that gets weighted
and summed

The
chicken
didn’t
cross
the
road
because
it
was
too
tired
The
chicken
didn’t
cross
the
road
because
it
was
too
tired
Layer k+1
Layer k
Attention intuition
x1 x2 x3 x4 x5 x6 x7 xi
query
values

x1 x2 x3 x4 x5 x6 x7 xi
query
values
k
v
k
v
k
v
k
v
k
v
k
v
k
v
keys k
v
The
chicken
didn’t
cross
the
road
because
it
was
too
tired
The
chicken
didn’t
cross
the
road
because
it
was
too
tired
Layer k+1
Layer k

We'll use matrices to project each vector xi into a
representa4on of its role as query, key, value:
• query: WQ
• key: WK
• value: WV
mine a similarity weight. We’ll refer to this role as a k
ally, as a value of a preceding element that gets weigh
ompute the output for the current element.
e these three different roles, transformers introduce
d WV
. These weights will project each input vector xi
le as a key, query, or value:
qi = xiWQ
; ki = xiWK
; vi = xiWV
rojections, when we are computing the similarity of

Given these 3 representation of xi
To compute similarity of current element xi with
some prior element xj
We’ll use dot product between qi and kj.
And instead of summing up xj , we'll sum up vj
To capture these three different roles, transformers introduce w
WK
, and WV
. These weights will project each input vector xi i
n of its role as a key, query, or value:
qi = xiWQ
; ki = xiWK
; vi = xiWV
n these projections, when we are computing the similarity of t
xi with some prior element xj, we’ll use the dot product betw
ent’s query vector qi and the preceding element’s key vector kj
esult of a dot product can be an arbitrarily large (positive or nega
nentiating large values can lead to numerical issues and loss of g
ing. To avoid this, we scale the dot product by a factor related to

Final equations for one attention head
i i
by summing the values of the prior elements, each weig
its key to the query from the current element:
qi = xiWQ
; kj = xjWK
; vj = xjWV
score(xi,xj) =
qi ·kj
p
dk
ai j = softmax(score(xi,xj)) 8 j  i
ai =
X
ji
ai jvj

Calculating the value of a3
6. Sum the weighted
value vectors
4. Turn into 𝛼i,j weights via softmax
a3
1. Generate
key, query, value
vectors
2. Compare x3’s query with
the keys for x1, x2, and x3
Output of self-attention
W
k
W
v
W
q
x1
k
q
v
x3
k
q
v
x2
k
q
v
×
×
W
k
W
k
W
q
W
q
W
v
W
v
5. Weigh each value vector
÷
√dk
3. Divide score by √dk
÷
√dk
÷
√dk
𝛼3,1 𝛼3,2 𝛼3,3

Actual Attention: slightly more complicated
• Instead of one attention head, we'll have lots of them!
• Intuition: each head might be attending to the context for different purposes
• Different linguistic relationships or patterns in the context
9.2 • TRANSFORMER BLOCKS 7
shows an intuition.
qc
i = xiWQc
; kc
j = xjWKc
; vc
j = xjWVc
; 8 c 1  c  h (9.14)
scorec
(xi,xj) =
qc
i ·kc
j
p
dk
(9.15)
ac
ij = softmax(scorec
(xi,xj)) 8j  i (9.16)
headc
i =
X
ji
ac
ijvc
j (9.17)
ai = (head1
head2
... headh
)WO
(9.18)
MultiHeadAttention(xi,[x1,··· ,xN]) = ai (9.19)

Multi-head attention
ai
xi-1 xi
xi-2
xi-3
WK
1
Head 1
WV
1 WQ
1
…
…
WK
2
Head 2
WV
2 WQ
2 WK
8
Head 8
WV
8 WQ
8
ai
WO [hdv x d]
[1 x dv ]
[1 x d]
[1 x d]
[1 x hdv ]
Project down to d
Concatenate Outputs
Each head
attends diﬀerently
to context
…
[1 x dv ]

Summary
Attention is a method for enriching the representation of a token by
incorporating contextual information
The result: the embedding for each word will be different in different
contexts!
Contextual embeddings: a representation of word meaning in its
context.
We'll see in the next lecture that attention can also be viewed as a
way to move information from one token to another.

Transformers
The Transformer Block

Stacked
Transformer
Blocks
long and thanks for
Next token all
…
…
…
U
Input tokens
x1 x2
Language
Modeling
Head
x3 x4 x5
Input
Encoding E
1
+
E
2
+
E
3
+
E
4
+
E
5
+
…
… …
…
…
U U U U
…
Reminder: transformer language model

The residual stream: each token gets passed up and
modified
Layer Norm
xi
+
hi-1
Layer Norm
MultiHead
Attention
Feedforward
xi-1 xi+1
hi hi+1
+
…
…

We'll need nonlineari,es, so a feedforward layer
Layer Norm
xi
+
hi-1
Layer Norm
MultiHead
Attention
Feedforward
xi-1 xi+1
hi hi+1
+
…
…
network be larger than the model dimensionality d. (For example in the orig
ransformer model, d = 512 and dff = 2048.)
FFN(xi) = ReLU(xiW1 +b1)W2 +b2 (9
Layer Norm At two stages in the transformer block we normalize the vector
et al., 2016). This process, called layer norm (short for layer normalization), is

Layer norm: the vector xi is normalized twice
Layer Norm
xi
+
hi-1
Layer Norm
MultiHead
Attention
Feedforward
xi-1 xi+1
hi hi+1
+
…
…

Layer Norm
Layer norm is a variation of the z-score from statistics, applied to a single vector in a hidden layer
token. Thus the input to layer norm is a single vector of dimensionality d
output is that vector normalized, again of dimensionality d. The first step in
ormalization is to calculate the mean, µ, and standard deviation, s, over the
ts of the vector to be normalized. Given an embedding vector x of dimen-
ty d, these values are calculated as follows.
µ =
1
d
d
X
i=1
xi (9.21)
s =
v
u
u
t1
d
d
X
i=1
(xi µ)2 (9.22)
these values, the vector components are normalized by subtracting the mean
ach and dividing by the standard deviation. The result of this computation is
vector with zero mean and a standard deviation of one.
(x µ)
µ =
1
d
d
X
i=1
xi (9.21)
s =
v
u
u
t1
d
d
X
i=1
(xi µ)2 (9.22)
values, the vector components are normalized by subtracting the mean
d dividing by the standard deviation. The result of this computation is
with zero mean and a standard deviation of one.
x̂ =
(x µ)
s
(9.23)
e standard implementation of layer normalization, two learnable param-
, representing gain and offset values, are introduced.
s =
v
u
u
t1
d
d
X
i=1
(xi µ)2 (9.22)
n these values, the vector components are normalized by subtracting the mean
each and dividing by the standard deviation. The result of this computation is
w vector with zero mean and a standard deviation of one.
x̂ =
(x µ)
s
(9.23)
lly, in the standard implementation of layer normalization, two learnable param-
, g and b, representing gain and offset values, are introduced.
LayerNorm(x) = g
(x µ)
s
+b (9.24)

Putting together a single transformer block
Putting it all together The function computed by a transforme
pressed by breaking it down with one equation for each compo
using t (of shape [1 ⇥ d]) to stand for transformer and supersc
each computation inside the block:
t1
i = LayerNorm(xi)
t2
i = MultiHeadAttention(t1
i ,
⇥
x1
1,··· ,x1
N
⇤
)
t3
i = t2
i +xi
t4
i = LayerNorm(t3
i )
t5
i = FFN(t4
i )
hi = t5
i +t3
i
Notice that the only component that takes as input information
(other residual streams) is multi-head attention, which (as we see
Layer Norm
xi
+
hi-1
Layer Norm
MultiHead
Attention
Feedforward
xi-1 xi+1
hi hi+1
+
…
…

A transformer is a stack of these blocks
so all the vectors are of the same dimensionality d
Layer Norm
xi
+
hi-1
Layer Norm
MultiHead
Attention
Feedforward
xi-1 xi+1
hi hi+1
+
…
…
Layer Norm
xi
+
hi-1
Layer Norm
MultiHead
Attention
Feedforward
xi-1 xi+1
hi hi+1
+
…
…
Block 1
Block 2

Residual streams and a.en/on
Notice that all parts of the transformer block apply to 1 residual stream (1
token).
Except attention, which takes information from other tokens
Elhage et al. (2021) show that we can view attention heads as literally moving
information from the residual stream of a neighboring token into the current
stream .
Token A
residual
stream
Token B
residual
stream

Transformers
Parallelizing Attention
Computation

Parallelizing computation using X
For attention/transformer block we've been computing a single
output at a single time step i in a single residual stream.
But we can pack the N tokens of the input sequence into a single
matrix X of size [N × d].
Each row of X is the embedding of one token of the input.
X can have 1K-32K rows, each of the dimensionality of the
embedding d (the model dimension)
9.3 • PARALLELIZING COMPUTATION USING A SINGLE MATRIX X 11
mension).
rallelizing attention Let’s first see this for a single attention head and then turn
multiple heads, and then add in the rest of the components in the transformer
ck. For one head we multiply X by the key, query, and value matrices WQ
of
pe [d ⇥dk], WK
of shape [d ⇥dk], and WV
of shape [d ⇥dv], to produce matrices
of shape [N ⇥dk], K 2 RN⇥dk , and V 2 RN⇥dv , containing all the key, query, and
ue vectors:
Q = XWQ
; K = XWK
; V = XWV
(9.31)
ven these matrices we can compute all the requisite query-key comparisons simul-
eously by multiplying Q and K| in a single matrix multiplication. The product is

QKT
Now can do a single matrix multiply to combine Q and KT
q1•k1
q2•k1 q2•k2
q4•k1 q4•k2 q4•k3 q4•k4
q3•k1 q3•k2 q3•k3
N
N
q1•k2 q1•k3 q1•k4
q2•k3 q2•k4
q3•k4

Parallelizing attention
• Scale the scores, take the softmax, and then
multiply the result by V resulting in a matrix of
shape N × d
• An attention vector for each input token
The N ⇥ N QK| matrix showing how it computes all qi · kj comparisons
rix multiple.
we have this QK| matrix, we can very efficiently scale these scores,
ax, and then multiply the result by V resulting in a matrix of shape N
mbedding representation for each token in the input. We’ve reduced
f-attention step for an entire sequence of N tokens for one head to
computation:
A = softmax
✓
mask
✓
QK|
p
dk
◆◆
V (9
out the future You may have noticed that we introduced a mask func

Masking out the future
• What is this mask function?
QKT has a score for each query dot every key,
including those that follow the query.
• Guessing the next word is pretty simple if you
already know it!
self-attention step for an entire sequence of N tokens for one head
ing computation:
A = softmax
✓
mask
✓
QK|
p
dk
◆◆
V
ng out the future You may have noticed that we introduced a mask f
9.32 above. This is because the self-attention computation as we’ve de
a problem: the calculation in QK|
results in a score for each quer
y key value, including those that follow the query. This is inapprop
ting of language modeling: guessing the next word is pretty simple
y know it! To fix this, the elements in the upper-triangular portion
are zeroed out (set to •), thus eliminating any knowledge of wo
in the sequence. This is done in practice by adding a mask matri

Masking out the future
Add –∞ to cells in upper triangle
The softmax will turn it to 0
vector embedding representation for each token in the input. We’ve reduced the
tire self-attention step for an entire sequence of N tokens for one head to the
llowing computation:
A = softmax
✓
mask
✓
QK|
p
dk
◆◆
V (9.32)
asking out the future You may have noticed that we introduced a mask function
Eq. 9.32 above. This is because the self-attention computation as we’ve described
has a problem: the calculation in QK|
results in a score for each query value
every key value, including those that follow the query. This is inappropriate in
e setting of language modeling: guessing the next word is pretty simple if you
ready know it! To fix this, the elements in the upper-triangular portion of the
atrix are zeroed out (set to •), thus eliminating any knowledge of words that
llow in the sequence. This is done in practice by adding a mask matrix M in
hich Mij = • 8j > i (i.e. for the upper-triangular portion) and Mij = 0 otherwise.
g. 9.9 shows the resulting masked QK|
matrix. (we’ll see in Chapter 11 how to
q1•k1
q2•k1 q2•k2
q4•k1 q4•k2 q4•k3 q4•k4
q3•k1 q3•k2 q3•k3
N
N
−∞ −∞
−∞ −∞
−∞
−∞

Another point: A,en-on is quadra-c in length
vector embedding representation for each token in the input. We’ve reduced the
tire self-attention step for an entire sequence of N tokens for one head to the
llowing computation:
A = softmax
✓
mask
✓
QK|
p
dk
◆◆
V (9.32)
asking out the future You may have noticed that we introduced a mask function
Eq. 9.32 above. This is because the self-attention computation as we’ve described
has a problem: the calculation in QK|
results in a score for each query value
every key value, including those that follow the query. This is inappropriate in
e setting of language modeling: guessing the next word is pretty simple if you
ready know it! To fix this, the elements in the upper-triangular portion of the
atrix are zeroed out (set to •), thus eliminating any knowledge of words that
llow in the sequence. This is done in practice by adding a mask matrix M in
hich Mij = • 8j > i (i.e. for the upper-triangular portion) and Mij = 0 otherwise.
g. 9.9 shows the resulting masked QK|
matrix. (we’ll see in Chapter 11 how to
q1•k1
q2•k1 q2•k2
q4•k1 q4•k2 q4•k3 q4•k4
q3•k1 q3•k2 q3•k3
N
N
−∞ −∞
−∞ −∞
−∞
−∞

Attention again
k1
k2
k3
k4
KT
QKT
v1
v2
v3
v4
V
q2•k2
q4•k2 q4•k3 q4•k4
q3•k2 q3•k3
−∞ −∞
−∞ −∞
−∞
−∞
q1•k1
q2•k1 q2•k2
q4•k1 q4•k2 q4•k3 q4•k4
q3•k1 q3•k2 q3•k3
q1•k2
q2•k3
q1•k3
q3•k4
q2•k4
q1•k4
=
QKT masked
=
q1•k1
q2•k1
q4•k1
q3•k1
q1•k1
q1•k1
=
x
a1
a2
a3
a4
A
Query
Token 1
Query
Token 2
Query
Token 3
Query
Token 4
Q
Q
=
Value
Token 1
Value
Token 2
Value
Token 3
Value
Token 4
V
x
WV
=
Input
Token 1
Input
Token 2
Input
Token 3
Input
Token 4
X
Key
Token 1
Key
Token 2
Key
Token 3
Key
Token 4
K
x
WK
=
Input
Token 1
Input
Token 2
Input
Token 3
Input
Token 4
X
dk x N
N x N N x N N x dv N x dv
dk
d x dk d x dv
N x dk N x d N x dk N x d N x dv
k2
k3
k4
KT
QKT
v1
v2
v3
v4
V
q2•k2
q4•k2 q4•k3 q4•k4
q3•k2 q3•k3
−∞ −∞
−∞ −∞
−∞
−∞
q1•k1
q2•k1 q2•k2
q4•k1 q4•k2 q4•k3 q4•k4
q3•k1 q3•k2 q3•k3
q1•k2
q2•k3
q1•k3
q3•k4
q2•k4
q1•k4
=
QKT masked
=
q1•k1
q2•k1
q4•k1
q3•k1
q1•k1
q1•k1
=
x
a1
a2
a3
a4
A
Query
Token 1
Query
Token 2
Query
Token 3
Query
Token 4
Q
=
Value
Token 1
Value
Token 2
Value
Token 3
Value
Token 4
V
x
WV
=
Input
Token 1
Input
Token 2
Input
Token 3
Input
Token 4
X
Key
Token 1
Key
Token 2
Key
Token 3
Key
Token 4
K
x
WK
=
Input
Token 1
Input
Token 2
Input
Token 3
Input
Token 4
X
dk x N
d x dk d x dv
k1
k2
k3
k4
KT
QKT
v1
v2
v3
v4
V
q2•k2
q4•k2 q4•k3 q4•k4
q3•k2 q3•k3
−∞ −∞
−∞ −∞
−∞
−∞
q1•k1
q2•k1 q2•k2
q4•k1 q4•k2 q4•k3 q4•k4
q3•k1 q3•k2 q3•k3
q1•k2
q2•k3
q1•k3
q3•k4
q2•k4
q1•k4
x =
QKT masked
=
q1•k1
q2•k1
q4•k1
q3•k1
q1•k1
q1•k1
=
x
a1
a2
a3
a4
A
Query
Token 1
Query
Token 2
Query
Token 3
Query
Token 4
Q
Q
=
Value
Token 1
Value
Token 2
Value
Token 3
Value
Token 4
V
x
WV
=
Input
Token 1
Input
Token 2
Input
Token 3
Input
Token 4
X
Key
Token 1
Key
Token 2
Key
Token 3
Key
Token 4
K
x
WK
=
Input
Token 1
Input
Token 2
Input
Token 3
Input
Token 4
X
dk x N
dk
d x dk d x dv
q1
q2
q3
q4
k1
k2
k3
k4
Q KT
QKT
v1
v2
v3
v4
V
q2•k2
q4•k2 q4•k3 q4•k4
q3•k2 q3•k3
−∞ −∞
−∞ −∞
−∞
−∞
q1•k1
q2•k1 q2•k2
q4•k1 q4•k2 q4•k3 q4•k4
q3•k1 q3•k2 q3•k3
q1•k2
q2•k3
q1•k3
q3•k4
q2•k4
q1•k4
x =
QKT masked
mask =
q1•k1
q2•k1
q4•k1
q3•k1
q1•k1
q1•k1
=
x
a1
a2
a3
a4
A
Query
Token 1
Query
Token 2
Query
Token 3
Query
Token 4
Q
Input
Token 1
Input
Token 2
Input
Token 3
Input
Token 4
X
x
WQ
=
Value
Token 1
Value
Token 2
Value
Token 3
Value
Token 4
V
x
WV
=
Input
Token 1
Input
Token 2
Input
Token 3
Input
Token 4
X
Key
Token 1
Key
Token 2
Key
Token 3
Key
Token 4
K
x
WK
=
Input
Token 1
Input
Token 2
Input
Token 3
Input
Token 4
X
N x dk
dk x N
d x dk
d x dk d x dv
N x d N x dk N x d N x dk N x d N x dv
q1
q2
q3
q4
k1
k2
k3
k4
Q KT
QKT
v1
v2
v3
v4
V
q2•k2
q4•k2 q4•k3 q4•k4
q3•k2 q3•k3
−∞ −∞
−∞ −∞
−∞
−∞
q1•k1
q2•k1 q2•k2
q4•k1 q4•k2 q4•k3 q4•k4
q3•k1 q3•k2 q3•k3
q1•k2
q2•k3
q1•k3
q3•k4
q2•k4
q1•k4
x =
QKT masked
mask =
q1•k1
q2•k1
q4•k1
q3•k1
q1•k1
q1•k1
=
x
a1
a2
a3
a4
A
Query
Token 1
Query
Token 2
Query
Token 3
Query
Token 4
Q
Input
Token 1
Input
Token 2
Input
Token 3
Input
Token 4
X
x
WQ
=
Value
Token 1
Value
Token 2
Value
Token 3
Value
Token 4
V
x
WV
=
Input
Token 1
Input
Token 2
Input
Token 3
Input
Token 4
X
Key
Token 1
Key
Token 2
Key
Token 3
Key
Token 4
K
x
WK
=
Input
Token 1
Input
Token 2
Input
Token 3
Input
Token 4
X
N x dk
dk x N
d x dk
d x dk d x dv
q1
q2
q3
q4
k1
k2
k3
k4
Q KT
QKT
v1
v2
v3
v4
V
q2•k2
q4•k2 q4•k3 q4•k4
q3•k2 q3•k3
−∞ −∞
−∞ −∞
−∞
−∞
q1•k1
q2•k1 q2•k2
q4•k1 q4•k2 q4•k3 q4•k4
q3•k1 q3•k2 q3•k3
q1•k2
q2•k3
q1•k3
q3•k4
q2•k4
q1•k4
x =
QKT masked
mask =
q1•k1
q2•k1
q4•k1
q3•k1
q1•k1
q1•k1
=
x
a1
a2
a3
a4
A
Query
Token 1
Query
Token 2
Query
Token 3
Query
Token 4
Q
Input
Token 1
Input
Token 2
Input
Token 3
Input
Token 4
X
x
WQ
=
Value
Token 1
Value
Token 2
Value
Token 3
Value
Token 4
V
x
WV
=
Input
Token 1
Input
Token 2
Input
Token 3
Input
Token 4
X
Key
Token 1
Key
Token 2
Key
Token 3
Key
Token 4
K
x
WK
=
Input
Token 1
Input
Token 2
Input
Token 3
Input
Token 4
X
N x dk
dk x N
d x dk
d x dk d x dv
q1
q2
q3
q4
k1
k2
k3
k4
Q KT
QKT
v1
v2
v3
v4
V
q2•k2
q4•k2 q4•k3 q4•k4
q3•k2 q3•k3
−∞ −∞
−∞ −∞
−∞
−∞
q1•k1
q2•k1 q2•k2
q4•k1 q4•k2 q4•k3 q4•k4
q3•k1 q3•k2 q3•k3
q1•k2
q2•k3
q1•k3
q3•k4
q2•k4
q1•k4
x =
QKT masked
mask =
q1•k1
q2•k1
q4•k1
q3•k1
q1•k1
q1•k1
=
x
a1
a2
a3
a4
A
Query
Token 1
Query
Token 2
Query
Token 3
Query
Token 4
Q
Input
Token 1
Input
Token 2
Input
Token 3
Input
Token 4
X
x
WQ
=
Value
Token 1
Value
Token 2
Value
Token 3
Value
Token 4
V
x
WV
=
Input
Token 1
Input
Token 2
Input
Token 3
Input
Token 4
X
Key
Token 1
Key
Token 2
Key
Token 3
Key
Token 4
K
x
WK
=
Input
Token 1
Input
Token 2
Input
Token 3
Input
Token 4
X
N x dk
dk x N
d x dk
d x dk d x dv

Parallelizing Multi-head Attention
9.4 • THE INPUT: EMBEDDINGS FOR TOKEN AND POSITION
he self-attention output A of shape [N ⇥d].
Qi
= XWQi
; Ki
= XWKi
; Vi
= XWVi
(9.
headi = SelfAttention(Qi
,Ki
,Vi
) = softmax
✓
Qi
Ki|
p
dk
◆
Vi
(9.
MultiHeadAttention(X) = (head1 head2... headh)WO
(9.
utting it all together with the parallel input matrix X The function compu
n parallel by an entire layer of N transformer block over the entire N input tok
an be expressed as:

Parallelizing Multi-head Attention
or
Putting it all together with the parallel input matrix X The function computed
in parallel by an entire layer of N transformer block over the entire N input tokens
can be expressed as:
O = LayerNorm(X+MultiHeadAttention(X)) (9.36)
H = LayerNorm(O+FFN(O)) (9.37)
Or we can break it down with one equation for each component computation, using
T (of shape [N ⇥ d]) to stand for transformer and superscripts to demarcate each
computation inside the block:
T1
= MultiHeadAttention(X) (9.38)
T2
= X+T1
(9.39)
T3
= LayerNorm(T2
) (9.40)
T4
= FFN(T3
) (9.41)
e expressed as:
O = LayerNorm(X+MultiHeadAttention(X)) (9.36)
H = LayerNorm(O+FFN(O)) (9.37)
e can break it down with one equation for each component computation, using
shape [N ⇥ d]) to stand for transformer and superscripts to demarcate each
utation inside the block:
T1
= MultiHeadAttention(X) (9.38)
T2
= X+T1
(9.39)
T3
= LayerNorm(T2
) (9.40)
T4
= FFN(T3
) (9.41)
T5
= T4
+T3
(9.42)
H = LayerNorm(T5
) (9.43)

Transformers
Input and output: Position
embeddings and the Language
Model Head

Token and Position Embeddings
The matrix X (of shape [N × d]) has an embedding for
each word in the context.
This embedding is created by adding two distinct
embedding for each input
• token embedding
• positional embedding

Token Embeddings
Embedding matrix E has shape [|V | × d ].
• One row for each of the |V | tokens in the vocabulary.
• Each word is a row vector of d dimensions
Given: string "Thanks for all the"
1. Tokenize with BPE and convert into vocab indices
w = [5,4000,10532,2224]
2. Select the corresponding rows from E, each row an embedding
• (row 5, row 4000, row 10532, row 2224).

Position Embeddings
There are many methods, but we'll just describe the simplest: absolute
position.
Goal: learn a position embedding matrix Epos of shape [1 × N ].
Start with randomly initialized embeddings
As with word embeddings, these position embeddings are learned along
with other parameters during training.

Each x is just the sum of word and posi5on embeddings
X = Composite
Embeddings
(word + position)
Transformer Block
Janet
1
will
2
back
3
Janet will back the bill
the
4
bill
5
+
+
+
+
+
Position
Embeddings
Word
Embeddings

Language modeling head
Layer L
Transformer
Block
Softmax over vocabulary V
Unembedding layer
…
1 x |V|
Logits
Word probabilities
1 x |V|
hL
1
w1 w2 wN
hL
2 hL
N
d x |V|
1 x d
Unembedding
layer = ET
y1 y2 y|V|
…
u1 u2 u|V|
…
Language Model Head
takes hL
N and outputs a
distribution over vocabulary V

Unembedding layer
…
1 x |V|
Logits
Word probabilities
1 x |V|
wN
hL
N
d x |V|
1 x d
Unembedding
layer = ET
y1 y2 y|V|
…
u1 u2 u|V|
…
Unembedding layer: linear layer projects from hL
N (shape [1 × d]) to logit vector
Why "unembedding"? Tied to ET
Weight tying, we use the same weights for
two different matrices
Unembedding layer maps from an embedding to a
1x|V| vector of logits

Unembedding layer
…
1 x |V|
Logits
Word probabilities
1 x |V|
wN
hL
N
d x |V|
1 x d
Unembedding
layer = ET
y1 y2 y|V|
…
u1 u2 u|V|
…
Logits, the score vector u
One score for each of the |V |
possible words in the vocabulary V .
Shape 1 × |V |.
Softmax turns the logits into
probabilities over vocabulary.
Shape 1 × |V |.
This linear layer can be learned, but more commonly we tie this matri
transpose of) the embedding matrix E. Recall that in weight tying, we
same weights for two different matrices in the model. Thus at the input sta
transformer the embedding matrix (of shape [|V|⇥d]) is used to map from a
vector over the vocabulary (of shape [1 ⇥ |V|]) to an embedding (of shape
And then in the language model head, ET
, the transpose of the embedding m
shape [d ⇥|V|]) is used to map back from an embedding (shape [1⇥d]) to
over the vocabulary (shape [1⇥|V|]). In the learning process, E will be opti
be good at doing both of these mappings. We therefore sometimes call the t
ET
the unembedding layer because it is performing this reverse mapping.
A softmax layer turns the logits u into the probabilities y over the voca
u = hL
N ET
y = softmax(u)

The final transformer
model
wi
Sample token to
generate at position i+1
feedforward
layer norm
attention
layer norm
U
Input token
Language
Modeling
Head
Input
Encoding E
i
+
…
logits
feedforward
layer norm
attention
layer norm
Layer 1
Layer 2
h1
i = x2
i
x1
i
h2
i = x3
i
feedforward
layer norm
attention
layer norm
hL
i
hL-1
i = xL
i
y1 y2 y|V|
…
Token probabilities
u1 u2 u|V|
…
softmax
wi+1
Layer L
Sample token to
generate at position i+1
U
Language
Modeling
Head logits
feedforward
hL
i
y1 y2 y|V|
…
Token probabilities
u1 u2 u|V|
…
softmax
wi+1

Transformers: Revolutionizing NLP with Self-Attention

Transformers: Revolutionizing NLP with Self-Attention

More Related Content

Similar to Transformers: Revolutionizing NLP with Self-Attention (20)

Recently uploaded (20)

Transformers: Revolutionizing NLP with Self-Attention