SlideShare a Scribd company logo
Transformers
Introduction to Transformers
LLMs are built out of transformers
Transformer: a specific kind of network architecture, like a
fancier feedforward network, but based on attention
Provided proper attribution is provided, Google hereby grants permission to
reproduce the tables and figures in this paper solely for use in journalistic or
scholarly works.
Attention Is All You Need
Ashish Vaswani⇤
Google Brain
avaswani@google.com
Noam Shazeer⇤
Google Brain
noam@google.com
Niki Parmar⇤
Google Research
nikip@google.com
Jakob Uszkoreit⇤
Google Research
usz@google.com
Llion Jones⇤
Google Research
llion@google.com
Aidan N. Gomez⇤ †
University of Toronto
aidan@cs.toronto.edu
Łukasz Kaiser⇤
Google Brain
lukaszkaiser@google.com
Illia Polosukhin⇤ ‡
illia.polosukhin@gmail.com
[cs.CL]
2
Aug
2023
A very approximate timeline
1990 Static Word Embeddings
2003 Neural Language Model
2008 Multi-Task Learning
2015 Attention
2017 Transformer
2018 Contextual Word Embeddings and Pretraining
2019 Prompting
Transformers
Attention
Instead of star,ng with the big picture
Stacked
Transformer
Blocks
So long and thanks for
long and thanks for
Next token all
…
…
…
U
Input tokens
x1 x2
Language
Modeling
Head
x3 x4 x5
Input
Encoding E
1
+
E
2
+
E
3
+
E
4
+
E
5
+
…
… …
…
…
U U U U
…
logits logits logits logits logits
Stacked
Transformer
Blocks
So long and thanks for
long and thanks for
Next token all
…
…
…
U
Input tokens
x1 x2
Language
Modeling
Head
x3 x4 x5
Input
Encoding E
1
+
E
2
+
E
3
+
E
4
+
E
5
+
…
… …
…
…
U U U U
…
logits logits logits logits logits
Let's consider the embeddings for an individual word from a particular layer
Problem with static embeddings (word2vec)
They are static! The embedding for a word doesn't reflect how its
meaning changes in context.
The chicken didn't cross the road because it was too tired
What is the meaning represented in the static embedding for "it"?
Contextual Embeddings
• Intuition: a representation of meaning of a word
should be different in different contexts!
• Contextual Embedding: each word has a different
vector that expresses different meanings
depending on the surrounding words
• How to compute contextual embeddings?
• Attention
Contextual Embeddings
The chicken didn't cross the road because it
What should be the properties of "it"?
The chicken didn't cross the road because it was too tired
The chicken didn't cross the road because it was too wide
At this point in the sentence, it's probably referring to either the chicken or the street
Intui&on of a+en&on
Build up the contextual embedding from a word by
selectively integrating information from all the
neighboring words
We say that a word "attends to" some neighboring
words more than others
Intuition of attention:
test
The
chicken
didn’t
cross
the
road
because
it
was
too
tired
The
chicken
didn’t
cross
the
road
because
it
was
too
tired
Layer k+1
Layer k
self-attention distribution
columns corresponding to input tokens
Attention definition
A mechanism for helping compute the embedding for
a token by selectively attending to and integrating
information from surrounding tokens (at the previous
layer).
More formally: a method for doing a weighted sum of
vectors.
Attention is left-to-right
attention
attention
Self-Attention
Layer
attention
attention
attention
a1 a2 a3 a4 a5
x3 x4 x5
x1 x2
Simplified version of attention: a sum of prior words
weighted by their similarity with the current word
Given a sequence of token embeddings:
x1 x2 x3 x4 x5 x6 x7 xi
Produce: ai = a weighted sum of x1 through x7 (and xi)
Weighted by their similarity to xi
re words to other words? Since our representations for
make use of our old friend the dot product that we used
larity in Chapter 6, and also played a role in attention in
the result of this comparison between words i and j as a
this equation to add attention to the computation of this
erson 1: score(xi,xj) = xi ·xj (10.4)
oduct is a scalar value ranging from • to •, the larger
the vectors that are being compared. Continuing with our
Verson 1: score(xi,xj) = xi ·xj (1
esult of a dot product is a scalar value ranging from • to •, the la
the more similar the vectors that are being compared. Continuing with
the first step in computing y3 would be to compute three scores: x3
d x3 ·x3. Then to make effective use of these scores, we’ll normalize t
oftmax to create a vector of weights, aij, that indicates the proporti
of each input to the input element i that is the current focus of attentio
aij = softmax(score(xi,xj)) 8j  i (1
=
exp(score(xi,xj))
Pi
exp(score(xi,xk))
8j  i (1
ERS AND LARGE LANGUAGE MODELS
, each weighted by its a value.
ai =
X
ji
aijxj (10.7)
Intuition of attention:
test
x1 x2 x3 x4 x5 x6 x7 xi
The
chicken
didn’t
cross
the
road
because
it
was
too
tired
The
chicken
didn’t
cross
the
road
because
it
was
too
tired
Layer k+1
Layer k
self-attention distribution
columns corresponding to input tokens
An Actual Attention Head: slightly more complicated
High-level idea: instead of using vectors (like xi and x4)
directly, we'll represent 3 separate roles each vector xi plays:
• query: As the current element being compared to the
preceding inputs.
• key: as a preceding input that is being compared to the
current element to determine a similarity
• value: a value of a preceding element that gets weighted
and summed
The
chicken
didn’t
cross
the
road
because
it
was
too
tired
The
chicken
didn’t
cross
the
road
because
it
was
too
tired
Layer k+1
Layer k
self-attention distribution
columns corresponding to input tokens
Attention intuition
x1 x2 x3 x4 x5 x6 x7 xi
query
values
Intuition of attention:
x1 x2 x3 x4 x5 x6 x7 xi
query
values
k
v
k
v
k
v
k
v
k
v
k
v
k
v
keys k
v
The
chicken
didn’t
cross
the
road
because
it
was
too
tired
The
chicken
didn’t
cross
the
road
because
it
was
too
tired
Layer k+1
Layer k
self-attention distribution
columns corresponding to input tokens
An Actual Attention Head: slightly more complicated
We'll use matrices to project each vector xi into a
representa4on of its role as query, key, value:
• query: WQ
• key: WK
• value: WV
mine a similarity weight. We’ll refer to this role as a k
ally, as a value of a preceding element that gets weigh
ompute the output for the current element.
e these three different roles, transformers introduce
d WV
. These weights will project each input vector xi
le as a key, query, or value:
qi = xiWQ
; ki = xiWK
; vi = xiWV
rojections, when we are computing the similarity of
An Actual Attention Head: slightly more complicated
Given these 3 representation of xi
To compute similarity of current element xi with
some prior element xj
We’ll use dot product between qi and kj.
And instead of summing up xj , we'll sum up vj
To capture these three different roles, transformers introduce w
WK
, and WV
. These weights will project each input vector xi i
n of its role as a key, query, or value:
qi = xiWQ
; ki = xiWK
; vi = xiWV
n these projections, when we are computing the similarity of t
xi with some prior element xj, we’ll use the dot product betw
ent’s query vector qi and the preceding element’s key vector kj
esult of a dot product can be an arbitrarily large (positive or nega
nentiating large values can lead to numerical issues and loss of g
ing. To avoid this, we scale the dot product by a factor related to
Final equations for one attention head
i i
by summing the values of the prior elements, each weig
its key to the query from the current element:
qi = xiWQ
; kj = xjWK
; vj = xjWV
score(xi,xj) =
qi ·kj
p
dk
ai j = softmax(score(xi,xj)) 8 j  i
ai =
X
ji
ai jvj
Calculating the value of a3
6. Sum the weighted
value vectors
4. Turn into 𝛼i,j weights via softmax
a3
1. Generate
key, query, value
vectors
2. Compare x3’s query with
the keys for x1, x2, and x3
Output of self-attention
W
k
W
v
W
q
x1
k
q
v
x3
k
q
v
x2
k
q
v
×
×
W
k
W
k
W
q
W
q
W
v
W
v
5. Weigh each value vector
÷
√dk
3. Divide score by √dk
÷
√dk
÷
√dk
𝛼3,1 𝛼3,2 𝛼3,3
Actual Attention: slightly more complicated
• Instead of one attention head, we'll have lots of them!
• Intuition: each head might be attending to the context for different purposes
• Different linguistic relationships or patterns in the context
9.2 • TRANSFORMER BLOCKS 7
shows an intuition.
qc
i = xiWQc
; kc
j = xjWKc
; vc
j = xjWVc
; 8 c 1  c  h (9.14)
scorec
(xi,xj) =
qc
i ·kc
j
p
dk
(9.15)
ac
ij = softmax(scorec
(xi,xj)) 8j  i (9.16)
headc
i =
X
ji
ac
ijvc
j (9.17)
ai = (head1
head2
... headh
)WO
(9.18)
MultiHeadAttention(xi,[x1,··· ,xN]) = ai (9.19)
Multi-head attention
ai
xi-1 xi
xi-2
xi-3
WK
1
Head 1
WV
1 WQ
1
…
…
WK
2
Head 2
WV
2 WQ
2 WK
8
Head 8
WV
8 WQ
8
ai
WO [hdv x d]
[1 x dv ]
[1 x d]
[1 x d]
[1 x hdv ]
Project down to d
Concatenate Outputs
Each head
attends differently
to context
…
[1 x dv ]
Summary
Attention is a method for enriching the representation of a token by
incorporating contextual information
The result: the embedding for each word will be different in different
contexts!
Contextual embeddings: a representation of word meaning in its
context.
We'll see in the next lecture that attention can also be viewed as a
way to move information from one token to another.
Transformers
Attention
Transformers
The Transformer Block
Stacked
Transformer
Blocks
So long and thanks for
long and thanks for
Next token all
…
…
…
U
Input tokens
x1 x2
Language
Modeling
Head
x3 x4 x5
Input
Encoding E
1
+
E
2
+
E
3
+
E
4
+
E
5
+
…
… …
…
…
U U U U
…
logits logits logits logits logits
Reminder: transformer language model
The residual stream: each token gets passed up and
modified
Layer Norm
xi
+
hi-1
Layer Norm
MultiHead
Attention
Feedforward
xi-1 xi+1
hi hi+1
+
…
…
We'll need nonlineari,es, so a feedforward layer
Layer Norm
xi
+
hi-1
Layer Norm
MultiHead
Attention
Feedforward
xi-1 xi+1
hi hi+1
+
…
…
network be larger than the model dimensionality d. (For example in the orig
ransformer model, d = 512 and dff = 2048.)
FFN(xi) = ReLU(xiW1 +b1)W2 +b2 (9
Layer Norm At two stages in the transformer block we normalize the vector
et al., 2016). This process, called layer norm (short for layer normalization), is
Layer norm: the vector xi is normalized twice
Layer Norm
xi
+
hi-1
Layer Norm
MultiHead
Attention
Feedforward
xi-1 xi+1
hi hi+1
+
…
…
Layer Norm
Layer norm is a variation of the z-score from statistics, applied to a single vec- tor in a hidden layer
token. Thus the input to layer norm is a single vector of dimensionality d
output is that vector normalized, again of dimensionality d. The first step in
ormalization is to calculate the mean, µ, and standard deviation, s, over the
ts of the vector to be normalized. Given an embedding vector x of dimen-
ty d, these values are calculated as follows.
µ =
1
d
d
X
i=1
xi (9.21)
s =
v
u
u
t1
d
d
X
i=1
(xi µ)2 (9.22)
these values, the vector components are normalized by subtracting the mean
ach and dividing by the standard deviation. The result of this computation is
vector with zero mean and a standard deviation of one.
(x µ)
µ =
1
d
d
X
i=1
xi (9.21)
s =
v
u
u
t1
d
d
X
i=1
(xi µ)2 (9.22)
values, the vector components are normalized by subtracting the mean
d dividing by the standard deviation. The result of this computation is
with zero mean and a standard deviation of one.
x̂ =
(x µ)
s
(9.23)
e standard implementation of layer normalization, two learnable param-
, representing gain and offset values, are introduced.
s =
v
u
u
t1
d
d
X
i=1
(xi µ)2 (9.22)
n these values, the vector components are normalized by subtracting the mean
each and dividing by the standard deviation. The result of this computation is
w vector with zero mean and a standard deviation of one.
x̂ =
(x µ)
s
(9.23)
lly, in the standard implementation of layer normalization, two learnable param-
, g and b, representing gain and offset values, are introduced.
LayerNorm(x) = g
(x µ)
s
+b (9.24)
Putting together a single transformer block
Putting it all together The function computed by a transforme
pressed by breaking it down with one equation for each compo
using t (of shape [1 ⇥ d]) to stand for transformer and supersc
each computation inside the block:
t1
i = LayerNorm(xi)
t2
i = MultiHeadAttention(t1
i ,
⇥
x1
1,··· ,x1
N
⇤
)
t3
i = t2
i +xi
t4
i = LayerNorm(t3
i )
t5
i = FFN(t4
i )
hi = t5
i +t3
i
Notice that the only component that takes as input information
(other residual streams) is multi-head attention, which (as we see
Layer Norm
xi
+
hi-1
Layer Norm
MultiHead
Attention
Feedforward
xi-1 xi+1
hi hi+1
+
…
…
A transformer is a stack of these blocks
so all the vectors are of the same dimensionality d
Layer Norm
xi
+
hi-1
Layer Norm
MultiHead
Attention
Feedforward
xi-1 xi+1
hi hi+1
+
…
…
Layer Norm
xi
+
hi-1
Layer Norm
MultiHead
Attention
Feedforward
xi-1 xi+1
hi hi+1
+
…
…
Block 1
Block 2
Residual streams and a.en/on
Notice that all parts of the transformer block apply to 1 residual stream (1
token).
Except attention, which takes information from other tokens
Elhage et al. (2021) show that we can view attention heads as literally moving
information from the residual stream of a neighboring token into the current
stream .
Token A
residual
stream
Token B
residual
stream
Transformers
The Transformer Block
Transformers
Parallelizing Attention
Computation
Parallelizing computation using X
For attention/transformer block we've been computing a single
output at a single time step i in a single residual stream.
But we can pack the N tokens of the input sequence into a single
matrix X of size [N × d].
Each row of X is the embedding of one token of the input.
X can have 1K-32K rows, each of the dimensionality of the
embedding d (the model dimension)
9.3 • PARALLELIZING COMPUTATION USING A SINGLE MATRIX X 11
mension).
rallelizing attention Let’s first see this for a single attention head and then turn
multiple heads, and then add in the rest of the components in the transformer
ck. For one head we multiply X by the key, query, and value matrices WQ
of
pe [d ⇥dk], WK
of shape [d ⇥dk], and WV
of shape [d ⇥dv], to produce matrices
of shape [N ⇥dk], K 2 RN⇥dk , and V 2 RN⇥dv , containing all the key, query, and
ue vectors:
Q = XWQ
; K = XWK
; V = XWV
(9.31)
ven these matrices we can compute all the requisite query-key comparisons simul-
eously by multiplying Q and K| in a single matrix multiplication. The product is
QKT
Now can do a single matrix multiply to combine Q and KT
q1•k1
q2•k1 q2•k2
q4•k1 q4•k2 q4•k3 q4•k4
q3•k1 q3•k2 q3•k3
N
N
q1•k2 q1•k3 q1•k4
q2•k3 q2•k4
q3•k4
Parallelizing attention
• Scale the scores, take the softmax, and then
multiply the result by V resulting in a matrix of
shape N × d
• An attention vector for each input token
The N ⇥ N QK| matrix showing how it computes all qi · kj comparisons
rix multiple.
we have this QK| matrix, we can very efficiently scale these scores,
ax, and then multiply the result by V resulting in a matrix of shape N
mbedding representation for each token in the input. We’ve reduced
f-attention step for an entire sequence of N tokens for one head to
computation:
A = softmax
✓
mask
✓
QK|
p
dk
◆◆
V (9
out the future You may have noticed that we introduced a mask func
Masking out the future
• What is this mask function?
QKT has a score for each query dot every key,
including those that follow the query.
• Guessing the next word is pretty simple if you
already know it!
self-attention step for an entire sequence of N tokens for one head
ing computation:
A = softmax
✓
mask
✓
QK|
p
dk
◆◆
V
ng out the future You may have noticed that we introduced a mask f
9.32 above. This is because the self-attention computation as we’ve de
a problem: the calculation in QK|
results in a score for each quer
y key value, including those that follow the query. This is inapprop
ting of language modeling: guessing the next word is pretty simple
y know it! To fix this, the elements in the upper-triangular portion
are zeroed out (set to •), thus eliminating any knowledge of wo
in the sequence. This is done in practice by adding a mask matri
Masking out the future
Add –∞ to cells in upper triangle
The softmax will turn it to 0
vector embedding representation for each token in the input. We’ve reduced the
tire self-attention step for an entire sequence of N tokens for one head to the
llowing computation:
A = softmax
✓
mask
✓
QK|
p
dk
◆◆
V (9.32)
asking out the future You may have noticed that we introduced a mask function
Eq. 9.32 above. This is because the self-attention computation as we’ve described
has a problem: the calculation in QK|
results in a score for each query value
every key value, including those that follow the query. This is inappropriate in
e setting of language modeling: guessing the next word is pretty simple if you
ready know it! To fix this, the elements in the upper-triangular portion of the
atrix are zeroed out (set to •), thus eliminating any knowledge of words that
llow in the sequence. This is done in practice by adding a mask matrix M in
hich Mij = • 8j > i (i.e. for the upper-triangular portion) and Mij = 0 otherwise.
g. 9.9 shows the resulting masked QK|
matrix. (we’ll see in Chapter 11 how to
q1•k1
q2•k1 q2•k2
q4•k1 q4•k2 q4•k3 q4•k4
q3•k1 q3•k2 q3•k3
N
N
−∞ −∞
−∞ −∞
−∞
−∞
Another point: A,en-on is quadra-c in length
vector embedding representation for each token in the input. We’ve reduced the
tire self-attention step for an entire sequence of N tokens for one head to the
llowing computation:
A = softmax
✓
mask
✓
QK|
p
dk
◆◆
V (9.32)
asking out the future You may have noticed that we introduced a mask function
Eq. 9.32 above. This is because the self-attention computation as we’ve described
has a problem: the calculation in QK|
results in a score for each query value
every key value, including those that follow the query. This is inappropriate in
e setting of language modeling: guessing the next word is pretty simple if you
ready know it! To fix this, the elements in the upper-triangular portion of the
atrix are zeroed out (set to •), thus eliminating any knowledge of words that
llow in the sequence. This is done in practice by adding a mask matrix M in
hich Mij = • 8j > i (i.e. for the upper-triangular portion) and Mij = 0 otherwise.
g. 9.9 shows the resulting masked QK|
matrix. (we’ll see in Chapter 11 how to
q1•k1
q2•k1 q2•k2
q4•k1 q4•k2 q4•k3 q4•k4
q3•k1 q3•k2 q3•k3
N
N
−∞ −∞
−∞ −∞
−∞
−∞
Attention again
k1
k2
k3
k4
KT
QKT
v1
v2
v3
v4
V
q2•k2
q4•k2 q4•k3 q4•k4
q3•k2 q3•k3
−∞ −∞
−∞ −∞
−∞
−∞
q1•k1
q2•k1 q2•k2
q4•k1 q4•k2 q4•k3 q4•k4
q3•k1 q3•k2 q3•k3
q1•k2
q2•k3
q1•k3
q3•k4
q2•k4
q1•k4
=
QKT masked
=
q1•k1
q2•k1
q4•k1
q3•k1
q1•k1
q1•k1
=
x
a1
a2
a3
a4
A
Query
Token 1
Query
Token 2
Query
Token 3
Query
Token 4
Q
Q
=
Value
Token 1
Value
Token 2
Value
Token 3
Value
Token 4
V
x
WV
=
Input
Token 1
Input
Token 2
Input
Token 3
Input
Token 4
X
Key
Token 1
Key
Token 2
Key
Token 3
Key
Token 4
K
x
WK
=
Input
Token 1
Input
Token 2
Input
Token 3
Input
Token 4
X
dk x N
N x N N x N N x dv N x dv
dk
d x dk d x dv
N x dk N x d N x dk N x d N x dv
k2
k3
k4
KT
QKT
v1
v2
v3
v4
V
q2•k2
q4•k2 q4•k3 q4•k4
q3•k2 q3•k3
−∞ −∞
−∞ −∞
−∞
−∞
q1•k1
q2•k1 q2•k2
q4•k1 q4•k2 q4•k3 q4•k4
q3•k1 q3•k2 q3•k3
q1•k2
q2•k3
q1•k3
q3•k4
q2•k4
q1•k4
=
QKT masked
=
q1•k1
q2•k1
q4•k1
q3•k1
q1•k1
q1•k1
=
x
a1
a2
a3
a4
A
Query
Token 1
Query
Token 2
Query
Token 3
Query
Token 4
Q
=
Value
Token 1
Value
Token 2
Value
Token 3
Value
Token 4
V
x
WV
=
Input
Token 1
Input
Token 2
Input
Token 3
Input
Token 4
X
Key
Token 1
Key
Token 2
Key
Token 3
Key
Token 4
K
x
WK
=
Input
Token 1
Input
Token 2
Input
Token 3
Input
Token 4
X
dk x N
N x N N x N N x dv N x dv
d x dk d x dv
N x dk N x d N x dk N x d N x dv
k1
k2
k3
k4
KT
QKT
v1
v2
v3
v4
V
q2•k2
q4•k2 q4•k3 q4•k4
q3•k2 q3•k3
−∞ −∞
−∞ −∞
−∞
−∞
q1•k1
q2•k1 q2•k2
q4•k1 q4•k2 q4•k3 q4•k4
q3•k1 q3•k2 q3•k3
q1•k2
q2•k3
q1•k3
q3•k4
q2•k4
q1•k4
x =
QKT masked
=
q1•k1
q2•k1
q4•k1
q3•k1
q1•k1
q1•k1
=
x
a1
a2
a3
a4
A
Query
Token 1
Query
Token 2
Query
Token 3
Query
Token 4
Q
Q
=
Value
Token 1
Value
Token 2
Value
Token 3
Value
Token 4
V
x
WV
=
Input
Token 1
Input
Token 2
Input
Token 3
Input
Token 4
X
Key
Token 1
Key
Token 2
Key
Token 3
Key
Token 4
K
x
WK
=
Input
Token 1
Input
Token 2
Input
Token 3
Input
Token 4
X
dk x N
N x N N x N N x dv N x dv
dk
d x dk d x dv
N x dk N x d N x dk N x d N x dv
q1
q2
q3
q4
k1
k2
k3
k4
Q KT
QKT
v1
v2
v3
v4
V
q2•k2
q4•k2 q4•k3 q4•k4
q3•k2 q3•k3
−∞ −∞
−∞ −∞
−∞
−∞
q1•k1
q2•k1 q2•k2
q4•k1 q4•k2 q4•k3 q4•k4
q3•k1 q3•k2 q3•k3
q1•k2
q2•k3
q1•k3
q3•k4
q2•k4
q1•k4
x =
QKT masked
mask =
q1•k1
q2•k1
q4•k1
q3•k1
q1•k1
q1•k1
=
x
a1
a2
a3
a4
A
Query
Token 1
Query
Token 2
Query
Token 3
Query
Token 4
Q
Input
Token 1
Input
Token 2
Input
Token 3
Input
Token 4
X
x
WQ
=
Value
Token 1
Value
Token 2
Value
Token 3
Value
Token 4
V
x
WV
=
Input
Token 1
Input
Token 2
Input
Token 3
Input
Token 4
X
Key
Token 1
Key
Token 2
Key
Token 3
Key
Token 4
K
x
WK
=
Input
Token 1
Input
Token 2
Input
Token 3
Input
Token 4
X
N x dk
dk x N
N x N N x N N x dv N x dv
d x dk
d x dk d x dv
N x d N x dk N x d N x dk N x d N x dv
q1
q2
q3
q4
k1
k2
k3
k4
Q KT
QKT
v1
v2
v3
v4
V
q2•k2
q4•k2 q4•k3 q4•k4
q3•k2 q3•k3
−∞ −∞
−∞ −∞
−∞
−∞
q1•k1
q2•k1 q2•k2
q4•k1 q4•k2 q4•k3 q4•k4
q3•k1 q3•k2 q3•k3
q1•k2
q2•k3
q1•k3
q3•k4
q2•k4
q1•k4
x =
QKT masked
mask =
q1•k1
q2•k1
q4•k1
q3•k1
q1•k1
q1•k1
=
x
a1
a2
a3
a4
A
Query
Token 1
Query
Token 2
Query
Token 3
Query
Token 4
Q
Input
Token 1
Input
Token 2
Input
Token 3
Input
Token 4
X
x
WQ
=
Value
Token 1
Value
Token 2
Value
Token 3
Value
Token 4
V
x
WV
=
Input
Token 1
Input
Token 2
Input
Token 3
Input
Token 4
X
Key
Token 1
Key
Token 2
Key
Token 3
Key
Token 4
K
x
WK
=
Input
Token 1
Input
Token 2
Input
Token 3
Input
Token 4
X
N x dk
dk x N
N x N N x N N x dv N x dv
d x dk
d x dk d x dv
N x d N x dk N x d N x dk N x d N x dv
q1
q2
q3
q4
k1
k2
k3
k4
Q KT
QKT
v1
v2
v3
v4
V
q2•k2
q4•k2 q4•k3 q4•k4
q3•k2 q3•k3
−∞ −∞
−∞ −∞
−∞
−∞
q1•k1
q2•k1 q2•k2
q4•k1 q4•k2 q4•k3 q4•k4
q3•k1 q3•k2 q3•k3
q1•k2
q2•k3
q1•k3
q3•k4
q2•k4
q1•k4
x =
QKT masked
mask =
q1•k1
q2•k1
q4•k1
q3•k1
q1•k1
q1•k1
=
x
a1
a2
a3
a4
A
Query
Token 1
Query
Token 2
Query
Token 3
Query
Token 4
Q
Input
Token 1
Input
Token 2
Input
Token 3
Input
Token 4
X
x
WQ
=
Value
Token 1
Value
Token 2
Value
Token 3
Value
Token 4
V
x
WV
=
Input
Token 1
Input
Token 2
Input
Token 3
Input
Token 4
X
Key
Token 1
Key
Token 2
Key
Token 3
Key
Token 4
K
x
WK
=
Input
Token 1
Input
Token 2
Input
Token 3
Input
Token 4
X
N x dk
dk x N
N x N N x N N x dv N x dv
d x dk
d x dk d x dv
N x d N x dk N x d N x dk N x d N x dv
q1
q2
q3
q4
k1
k2
k3
k4
Q KT
QKT
v1
v2
v3
v4
V
q2•k2
q4•k2 q4•k3 q4•k4
q3•k2 q3•k3
−∞ −∞
−∞ −∞
−∞
−∞
q1•k1
q2•k1 q2•k2
q4•k1 q4•k2 q4•k3 q4•k4
q3•k1 q3•k2 q3•k3
q1•k2
q2•k3
q1•k3
q3•k4
q2•k4
q1•k4
x =
QKT masked
mask =
q1•k1
q2•k1
q4•k1
q3•k1
q1•k1
q1•k1
=
x
a1
a2
a3
a4
A
Query
Token 1
Query
Token 2
Query
Token 3
Query
Token 4
Q
Input
Token 1
Input
Token 2
Input
Token 3
Input
Token 4
X
x
WQ
=
Value
Token 1
Value
Token 2
Value
Token 3
Value
Token 4
V
x
WV
=
Input
Token 1
Input
Token 2
Input
Token 3
Input
Token 4
X
Key
Token 1
Key
Token 2
Key
Token 3
Key
Token 4
K
x
WK
=
Input
Token 1
Input
Token 2
Input
Token 3
Input
Token 4
X
N x dk
dk x N
N x N N x N N x dv N x dv
d x dk
d x dk d x dv
N x d N x dk N x d N x dk N x d N x dv
Parallelizing Multi-head Attention
9.4 • THE INPUT: EMBEDDINGS FOR TOKEN AND POSITION
he self-attention output A of shape [N ⇥d].
Qi
= XWQi
; Ki
= XWKi
; Vi
= XWVi
(9.
headi = SelfAttention(Qi
,Ki
,Vi
) = softmax
✓
Qi
Ki|
p
dk
◆
Vi
(9.
MultiHeadAttention(X) = (head1 head2... headh)WO
(9.
utting it all together with the parallel input matrix X The function compu
n parallel by an entire layer of N transformer block over the entire N input tok
an be expressed as:
Parallelizing Multi-head Attention
or
Putting it all together with the parallel input matrix X The function computed
in parallel by an entire layer of N transformer block over the entire N input tokens
can be expressed as:
O = LayerNorm(X+MultiHeadAttention(X)) (9.36)
H = LayerNorm(O+FFN(O)) (9.37)
Or we can break it down with one equation for each component computation, using
T (of shape [N ⇥ d]) to stand for transformer and superscripts to demarcate each
computation inside the block:
T1
= MultiHeadAttention(X) (9.38)
T2
= X+T1
(9.39)
T3
= LayerNorm(T2
) (9.40)
T4
= FFN(T3
) (9.41)
e expressed as:
O = LayerNorm(X+MultiHeadAttention(X)) (9.36)
H = LayerNorm(O+FFN(O)) (9.37)
e can break it down with one equation for each component computation, using
shape [N ⇥ d]) to stand for transformer and superscripts to demarcate each
utation inside the block:
T1
= MultiHeadAttention(X) (9.38)
T2
= X+T1
(9.39)
T3
= LayerNorm(T2
) (9.40)
T4
= FFN(T3
) (9.41)
T5
= T4
+T3
(9.42)
H = LayerNorm(T5
) (9.43)
Transformers
Parallelizing Attention
Computation
Transformers
Input and output: Position
embeddings and the Language
Model Head
Token and Position Embeddings
The matrix X (of shape [N × d]) has an embedding for
each word in the context.
This embedding is created by adding two distinct
embedding for each input
• token embedding
• positional embedding
Token Embeddings
Embedding matrix E has shape [|V | × d ].
• One row for each of the |V | tokens in the vocabulary.
• Each word is a row vector of d dimensions
Given: string "Thanks for all the"
1. Tokenize with BPE and convert into vocab indices
w = [5,4000,10532,2224]
2. Select the corresponding rows from E, each row an embedding
• (row 5, row 4000, row 10532, row 2224).
Position Embeddings
There are many methods, but we'll just describe the simplest: absolute
position.
Goal: learn a position embedding matrix Epos of shape [1 × N ].
Start with randomly initialized embeddings
As with word embeddings, these position embeddings are learned along
with other parameters during training.
Each x is just the sum of word and posi5on embeddings
X = Composite
Embeddings
(word + position)
Transformer Block
Janet
1
will
2
back
3
Janet will back the bill
the
4
bill
5
+
+
+
+
+
Position
Embeddings
Word
Embeddings
Language modeling head
Layer L
Transformer
Block
Softmax over vocabulary V
Unembedding layer
…
1 x |V|
Logits
Word probabilities
1 x |V|
hL
1
w1 w2 wN
hL
2 hL
N
d x |V|
1 x d
Unembedding
layer = ET
y1 y2 y|V|
…
u1 u2 u|V|
…
Language Model Head
takes hL
N and outputs a
distribution over vocabulary V
Language modeling head
Softmax over vocabulary V
Unembedding layer
…
1 x |V|
Logits
Word probabilities
1 x |V|
wN
hL
N
d x |V|
1 x d
Unembedding
layer = ET
y1 y2 y|V|
…
u1 u2 u|V|
…
Unembedding layer: linear layer projects from hL
N (shape [1 × d]) to logit vector
Why "unembedding"? Tied to ET
Weight tying, we use the same weights for
two different matrices
Unembedding layer maps from an embedding to a
1x|V| vector of logits
Language modeling head
Softmax over vocabulary V
Unembedding layer
…
1 x |V|
Logits
Word probabilities
1 x |V|
wN
hL
N
d x |V|
1 x d
Unembedding
layer = ET
y1 y2 y|V|
…
u1 u2 u|V|
…
Logits, the score vector u
One score for each of the |V |
possible words in the vocabulary V .
Shape 1 × |V |.
Softmax turns the logits into
probabilities over vocabulary.
Shape 1 × |V |.
This linear layer can be learned, but more commonly we tie this matri
transpose of) the embedding matrix E. Recall that in weight tying, we
same weights for two different matrices in the model. Thus at the input sta
transformer the embedding matrix (of shape [|V|⇥d]) is used to map from a
vector over the vocabulary (of shape [1 ⇥ |V|]) to an embedding (of shape
And then in the language model head, ET
, the transpose of the embedding m
shape [d ⇥|V|]) is used to map back from an embedding (shape [1⇥d]) to
over the vocabulary (shape [1⇥|V|]). In the learning process, E will be opti
be good at doing both of these mappings. We therefore sometimes call the t
ET
the unembedding layer because it is performing this reverse mapping.
A softmax layer turns the logits u into the probabilities y over the voca
u = hL
N ET
y = softmax(u)
The final transformer
model
wi
Sample token to
generate at position i+1
feedforward
layer norm
attention
layer norm
U
Input token
Language
Modeling
Head
Input
Encoding E
i
+
…
logits
feedforward
layer norm
attention
layer norm
Layer 1
Layer 2
h1
i = x2
i
x1
i
h2
i = x3
i
feedforward
layer norm
attention
layer norm
hL
i
hL-1
i = xL
i
y1 y2 y|V|
…
Token probabilities
u1 u2 u|V|
…
softmax
wi+1
Layer L
Sample token to
generate at position i+1
U
Language
Modeling
Head logits
feedforward
hL
i
y1 y2 y|V|
…
Token probabilities
u1 u2 u|V|
…
softmax
wi+1
Transformers: Revolutionizing NLP with Self-Attention
Transformers
Input and output: Position
embeddings and the Language
Model Head

More Related Content

PPTX
240122_Attention Is All You Need (2017 NIPS)2.pptx
PPTX
Vision Transformers (ViTs) in Computer Vision: A Transformer-Based Approach f...
PPTX
A Detailed Exploration of Vision Transformer (ViT) and Its Role in Deep Learn...
PPTX
Demystifying NLP Transformers: Understanding the Power and Architecture behin...
PDF
The Transformer - Xavier Giró - UPC Barcelona 2021
PDF
Attention is All You Need (Transformer)
PDF
Intro to Transformers.pdf
PDF
Transformer Introduction (Seminar Material)
240122_Attention Is All You Need (2017 NIPS)2.pptx
Vision Transformers (ViTs) in Computer Vision: A Transformer-Based Approach f...
A Detailed Exploration of Vision Transformer (ViT) and Its Role in Deep Learn...
Demystifying NLP Transformers: Understanding the Power and Architecture behin...
The Transformer - Xavier Giró - UPC Barcelona 2021
Attention is All You Need (Transformer)
Intro to Transformers.pdf
Transformer Introduction (Seminar Material)

Similar to Transformers: Revolutionizing NLP with Self-Attention (20)

PPTX
[Paper Reading] Attention is All You Need
PDF
Language Model Basics - Components of a Generic Attention Mechanism
PDF
Transformer based approaches for visual representation learning
PPTX
[AIoTLab]attention mechanism.pptx
PDF
Transformers.pdf
PDF
Transformer_tutorial.pdf
PDF
05-transformers.pdf
PDF
Deep Learning for Computer Vision: Attention Models (UPC 2016)
PDF
attention is all you need.pdf attention is all you need.pdfattention is all y...
PPTX
FDP_atal_on transformer_NLP_by_example.pptx
PPTX
240318_JW_labseminar[Attention Is All You Need].pptx
PPTX
15_NEW-2020-ATTENTION-ENC-DEC-TRANSFORMERS-Lect15.pptx
PDF
Attention is all you need (UPC Reading Group 2018, by Santi Pascual)
PDF
Introduction to Transformers
PDF
Transformers and BERT with SageMaker
PDF
Deep learning based drug protein interaction
PDF
From_seq2seq_to_BERT
PPTX
The Neural Search Frontier - Doug Turnbull, OpenSource Connections
PPTX
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
PPTX
Machine Learning - Transformers, Large Language Models and ChatGPT
[Paper Reading] Attention is All You Need
Language Model Basics - Components of a Generic Attention Mechanism
Transformer based approaches for visual representation learning
[AIoTLab]attention mechanism.pptx
Transformers.pdf
Transformer_tutorial.pdf
05-transformers.pdf
Deep Learning for Computer Vision: Attention Models (UPC 2016)
attention is all you need.pdf attention is all you need.pdfattention is all y...
FDP_atal_on transformer_NLP_by_example.pptx
240318_JW_labseminar[Attention Is All You Need].pptx
15_NEW-2020-ATTENTION-ENC-DEC-TRANSFORMERS-Lect15.pptx
Attention is all you need (UPC Reading Group 2018, by Santi Pascual)
Introduction to Transformers
Transformers and BERT with SageMaker
Deep learning based drug protein interaction
From_seq2seq_to_BERT
The Neural Search Frontier - Doug Turnbull, OpenSource Connections
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Machine Learning - Transformers, Large Language Models and ChatGPT
Ad

Recently uploaded (20)

PDF
737-MAX_SRG.pdf student reference guides
PPT
Total quality management ppt for engineering students
PDF
Automation-in-Manufacturing-Chapter-Introduction.pdf
PPTX
CYBER-CRIMES AND SECURITY A guide to understanding
PDF
PPT on Performance Review to get promotions
PPTX
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
PDF
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
PPT
introduction to datamining and warehousing
PPTX
Artificial Intelligence
PDF
Unit I ESSENTIAL OF DIGITAL MARKETING.pdf
PDF
Human-AI Collaboration: Balancing Agentic AI and Autonomy in Hybrid Systems
PPT
Project quality management in manufacturing
PPTX
UNIT 4 Total Quality Management .pptx
DOCX
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
DOCX
573137875-Attendance-Management-System-original
PDF
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
PDF
III.4.1.2_The_Space_Environment.p pdffdf
PDF
Level 2 – IBM Data and AI Fundamentals (1)_v1.1.PDF
PPTX
6ME3A-Unit-II-Sensors and Actuators_Handouts.pptx
PPTX
Fundamentals of Mechanical Engineering.pptx
737-MAX_SRG.pdf student reference guides
Total quality management ppt for engineering students
Automation-in-Manufacturing-Chapter-Introduction.pdf
CYBER-CRIMES AND SECURITY A guide to understanding
PPT on Performance Review to get promotions
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
introduction to datamining and warehousing
Artificial Intelligence
Unit I ESSENTIAL OF DIGITAL MARKETING.pdf
Human-AI Collaboration: Balancing Agentic AI and Autonomy in Hybrid Systems
Project quality management in manufacturing
UNIT 4 Total Quality Management .pptx
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
573137875-Attendance-Management-System-original
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
III.4.1.2_The_Space_Environment.p pdffdf
Level 2 – IBM Data and AI Fundamentals (1)_v1.1.PDF
6ME3A-Unit-II-Sensors and Actuators_Handouts.pptx
Fundamentals of Mechanical Engineering.pptx
Ad

Transformers: Revolutionizing NLP with Self-Attention

  • 2. LLMs are built out of transformers Transformer: a specific kind of network architecture, like a fancier feedforward network, but based on attention Provided proper attribution is provided, Google hereby grants permission to reproduce the tables and figures in this paper solely for use in journalistic or scholarly works. Attention Is All You Need Ashish Vaswani⇤ Google Brain avaswani@google.com Noam Shazeer⇤ Google Brain noam@google.com Niki Parmar⇤ Google Research nikip@google.com Jakob Uszkoreit⇤ Google Research usz@google.com Llion Jones⇤ Google Research llion@google.com Aidan N. Gomez⇤ † University of Toronto aidan@cs.toronto.edu Łukasz Kaiser⇤ Google Brain lukaszkaiser@google.com Illia Polosukhin⇤ ‡ illia.polosukhin@gmail.com [cs.CL] 2 Aug 2023
  • 3. A very approximate timeline 1990 Static Word Embeddings 2003 Neural Language Model 2008 Multi-Task Learning 2015 Attention 2017 Transformer 2018 Contextual Word Embeddings and Pretraining 2019 Prompting
  • 5. Instead of star,ng with the big picture Stacked Transformer Blocks So long and thanks for long and thanks for Next token all … … … U Input tokens x1 x2 Language Modeling Head x3 x4 x5 Input Encoding E 1 + E 2 + E 3 + E 4 + E 5 + … … … … … U U U U … logits logits logits logits logits Stacked Transformer Blocks So long and thanks for long and thanks for Next token all … … … U Input tokens x1 x2 Language Modeling Head x3 x4 x5 Input Encoding E 1 + E 2 + E 3 + E 4 + E 5 + … … … … … U U U U … logits logits logits logits logits Let's consider the embeddings for an individual word from a particular layer
  • 6. Problem with static embeddings (word2vec) They are static! The embedding for a word doesn't reflect how its meaning changes in context. The chicken didn't cross the road because it was too tired What is the meaning represented in the static embedding for "it"?
  • 7. Contextual Embeddings • Intuition: a representation of meaning of a word should be different in different contexts! • Contextual Embedding: each word has a different vector that expresses different meanings depending on the surrounding words • How to compute contextual embeddings? • Attention
  • 8. Contextual Embeddings The chicken didn't cross the road because it What should be the properties of "it"? The chicken didn't cross the road because it was too tired The chicken didn't cross the road because it was too wide At this point in the sentence, it's probably referring to either the chicken or the street
  • 9. Intui&on of a+en&on Build up the contextual embedding from a word by selectively integrating information from all the neighboring words We say that a word "attends to" some neighboring words more than others
  • 11. Attention definition A mechanism for helping compute the embedding for a token by selectively attending to and integrating information from surrounding tokens (at the previous layer). More formally: a method for doing a weighted sum of vectors.
  • 13. Simplified version of attention: a sum of prior words weighted by their similarity with the current word Given a sequence of token embeddings: x1 x2 x3 x4 x5 x6 x7 xi Produce: ai = a weighted sum of x1 through x7 (and xi) Weighted by their similarity to xi re words to other words? Since our representations for make use of our old friend the dot product that we used larity in Chapter 6, and also played a role in attention in the result of this comparison between words i and j as a this equation to add attention to the computation of this erson 1: score(xi,xj) = xi ·xj (10.4) oduct is a scalar value ranging from • to •, the larger the vectors that are being compared. Continuing with our Verson 1: score(xi,xj) = xi ·xj (1 esult of a dot product is a scalar value ranging from • to •, the la the more similar the vectors that are being compared. Continuing with the first step in computing y3 would be to compute three scores: x3 d x3 ·x3. Then to make effective use of these scores, we’ll normalize t oftmax to create a vector of weights, aij, that indicates the proporti of each input to the input element i that is the current focus of attentio aij = softmax(score(xi,xj)) 8j  i (1 = exp(score(xi,xj)) Pi exp(score(xi,xk)) 8j  i (1 ERS AND LARGE LANGUAGE MODELS , each weighted by its a value. ai = X ji aijxj (10.7)
  • 14. Intuition of attention: test x1 x2 x3 x4 x5 x6 x7 xi The chicken didn’t cross the road because it was too tired The chicken didn’t cross the road because it was too tired Layer k+1 Layer k self-attention distribution columns corresponding to input tokens
  • 15. An Actual Attention Head: slightly more complicated High-level idea: instead of using vectors (like xi and x4) directly, we'll represent 3 separate roles each vector xi plays: • query: As the current element being compared to the preceding inputs. • key: as a preceding input that is being compared to the current element to determine a similarity • value: a value of a preceding element that gets weighted and summed
  • 16. The chicken didn’t cross the road because it was too tired The chicken didn’t cross the road because it was too tired Layer k+1 Layer k self-attention distribution columns corresponding to input tokens Attention intuition x1 x2 x3 x4 x5 x6 x7 xi query values
  • 17. Intuition of attention: x1 x2 x3 x4 x5 x6 x7 xi query values k v k v k v k v k v k v k v keys k v The chicken didn’t cross the road because it was too tired The chicken didn’t cross the road because it was too tired Layer k+1 Layer k self-attention distribution columns corresponding to input tokens
  • 18. An Actual Attention Head: slightly more complicated We'll use matrices to project each vector xi into a representa4on of its role as query, key, value: • query: WQ • key: WK • value: WV mine a similarity weight. We’ll refer to this role as a k ally, as a value of a preceding element that gets weigh ompute the output for the current element. e these three different roles, transformers introduce d WV . These weights will project each input vector xi le as a key, query, or value: qi = xiWQ ; ki = xiWK ; vi = xiWV rojections, when we are computing the similarity of
  • 19. An Actual Attention Head: slightly more complicated Given these 3 representation of xi To compute similarity of current element xi with some prior element xj We’ll use dot product between qi and kj. And instead of summing up xj , we'll sum up vj To capture these three different roles, transformers introduce w WK , and WV . These weights will project each input vector xi i n of its role as a key, query, or value: qi = xiWQ ; ki = xiWK ; vi = xiWV n these projections, when we are computing the similarity of t xi with some prior element xj, we’ll use the dot product betw ent’s query vector qi and the preceding element’s key vector kj esult of a dot product can be an arbitrarily large (positive or nega nentiating large values can lead to numerical issues and loss of g ing. To avoid this, we scale the dot product by a factor related to
  • 20. Final equations for one attention head i i by summing the values of the prior elements, each weig its key to the query from the current element: qi = xiWQ ; kj = xjWK ; vj = xjWV score(xi,xj) = qi ·kj p dk ai j = softmax(score(xi,xj)) 8 j  i ai = X ji ai jvj
  • 21. Calculating the value of a3 6. Sum the weighted value vectors 4. Turn into 𝛼i,j weights via softmax a3 1. Generate key, query, value vectors 2. Compare x3’s query with the keys for x1, x2, and x3 Output of self-attention W k W v W q x1 k q v x3 k q v x2 k q v × × W k W k W q W q W v W v 5. Weigh each value vector ÷ √dk 3. Divide score by √dk ÷ √dk ÷ √dk 𝛼3,1 𝛼3,2 𝛼3,3
  • 22. Actual Attention: slightly more complicated • Instead of one attention head, we'll have lots of them! • Intuition: each head might be attending to the context for different purposes • Different linguistic relationships or patterns in the context 9.2 • TRANSFORMER BLOCKS 7 shows an intuition. qc i = xiWQc ; kc j = xjWKc ; vc j = xjWVc ; 8 c 1  c  h (9.14) scorec (xi,xj) = qc i ·kc j p dk (9.15) ac ij = softmax(scorec (xi,xj)) 8j  i (9.16) headc i = X ji ac ijvc j (9.17) ai = (head1 head2 ... headh )WO (9.18) MultiHeadAttention(xi,[x1,··· ,xN]) = ai (9.19)
  • 23. Multi-head attention ai xi-1 xi xi-2 xi-3 WK 1 Head 1 WV 1 WQ 1 … … WK 2 Head 2 WV 2 WQ 2 WK 8 Head 8 WV 8 WQ 8 ai WO [hdv x d] [1 x dv ] [1 x d] [1 x d] [1 x hdv ] Project down to d Concatenate Outputs Each head attends differently to context … [1 x dv ]
  • 24. Summary Attention is a method for enriching the representation of a token by incorporating contextual information The result: the embedding for each word will be different in different contexts! Contextual embeddings: a representation of word meaning in its context. We'll see in the next lecture that attention can also be viewed as a way to move information from one token to another.
  • 27. Stacked Transformer Blocks So long and thanks for long and thanks for Next token all … … … U Input tokens x1 x2 Language Modeling Head x3 x4 x5 Input Encoding E 1 + E 2 + E 3 + E 4 + E 5 + … … … … … U U U U … logits logits logits logits logits Reminder: transformer language model
  • 28. The residual stream: each token gets passed up and modified Layer Norm xi + hi-1 Layer Norm MultiHead Attention Feedforward xi-1 xi+1 hi hi+1 + … …
  • 29. We'll need nonlineari,es, so a feedforward layer Layer Norm xi + hi-1 Layer Norm MultiHead Attention Feedforward xi-1 xi+1 hi hi+1 + … … network be larger than the model dimensionality d. (For example in the orig ransformer model, d = 512 and dff = 2048.) FFN(xi) = ReLU(xiW1 +b1)W2 +b2 (9 Layer Norm At two stages in the transformer block we normalize the vector et al., 2016). This process, called layer norm (short for layer normalization), is
  • 30. Layer norm: the vector xi is normalized twice Layer Norm xi + hi-1 Layer Norm MultiHead Attention Feedforward xi-1 xi+1 hi hi+1 + … …
  • 31. Layer Norm Layer norm is a variation of the z-score from statistics, applied to a single vec- tor in a hidden layer token. Thus the input to layer norm is a single vector of dimensionality d output is that vector normalized, again of dimensionality d. The first step in ormalization is to calculate the mean, µ, and standard deviation, s, over the ts of the vector to be normalized. Given an embedding vector x of dimen- ty d, these values are calculated as follows. µ = 1 d d X i=1 xi (9.21) s = v u u t1 d d X i=1 (xi µ)2 (9.22) these values, the vector components are normalized by subtracting the mean ach and dividing by the standard deviation. The result of this computation is vector with zero mean and a standard deviation of one. (x µ) µ = 1 d d X i=1 xi (9.21) s = v u u t1 d d X i=1 (xi µ)2 (9.22) values, the vector components are normalized by subtracting the mean d dividing by the standard deviation. The result of this computation is with zero mean and a standard deviation of one. x̂ = (x µ) s (9.23) e standard implementation of layer normalization, two learnable param- , representing gain and offset values, are introduced. s = v u u t1 d d X i=1 (xi µ)2 (9.22) n these values, the vector components are normalized by subtracting the mean each and dividing by the standard deviation. The result of this computation is w vector with zero mean and a standard deviation of one. x̂ = (x µ) s (9.23) lly, in the standard implementation of layer normalization, two learnable param- , g and b, representing gain and offset values, are introduced. LayerNorm(x) = g (x µ) s +b (9.24)
  • 32. Putting together a single transformer block Putting it all together The function computed by a transforme pressed by breaking it down with one equation for each compo using t (of shape [1 ⇥ d]) to stand for transformer and supersc each computation inside the block: t1 i = LayerNorm(xi) t2 i = MultiHeadAttention(t1 i , ⇥ x1 1,··· ,x1 N ⇤ ) t3 i = t2 i +xi t4 i = LayerNorm(t3 i ) t5 i = FFN(t4 i ) hi = t5 i +t3 i Notice that the only component that takes as input information (other residual streams) is multi-head attention, which (as we see Layer Norm xi + hi-1 Layer Norm MultiHead Attention Feedforward xi-1 xi+1 hi hi+1 + … …
  • 33. A transformer is a stack of these blocks so all the vectors are of the same dimensionality d Layer Norm xi + hi-1 Layer Norm MultiHead Attention Feedforward xi-1 xi+1 hi hi+1 + … … Layer Norm xi + hi-1 Layer Norm MultiHead Attention Feedforward xi-1 xi+1 hi hi+1 + … … Block 1 Block 2
  • 34. Residual streams and a.en/on Notice that all parts of the transformer block apply to 1 residual stream (1 token). Except attention, which takes information from other tokens Elhage et al. (2021) show that we can view attention heads as literally moving information from the residual stream of a neighboring token into the current stream . Token A residual stream Token B residual stream
  • 37. Parallelizing computation using X For attention/transformer block we've been computing a single output at a single time step i in a single residual stream. But we can pack the N tokens of the input sequence into a single matrix X of size [N × d]. Each row of X is the embedding of one token of the input. X can have 1K-32K rows, each of the dimensionality of the embedding d (the model dimension) 9.3 • PARALLELIZING COMPUTATION USING A SINGLE MATRIX X 11 mension). rallelizing attention Let’s first see this for a single attention head and then turn multiple heads, and then add in the rest of the components in the transformer ck. For one head we multiply X by the key, query, and value matrices WQ of pe [d ⇥dk], WK of shape [d ⇥dk], and WV of shape [d ⇥dv], to produce matrices of shape [N ⇥dk], K 2 RN⇥dk , and V 2 RN⇥dv , containing all the key, query, and ue vectors: Q = XWQ ; K = XWK ; V = XWV (9.31) ven these matrices we can compute all the requisite query-key comparisons simul- eously by multiplying Q and K| in a single matrix multiplication. The product is
  • 38. QKT Now can do a single matrix multiply to combine Q and KT q1•k1 q2•k1 q2•k2 q4•k1 q4•k2 q4•k3 q4•k4 q3•k1 q3•k2 q3•k3 N N q1•k2 q1•k3 q1•k4 q2•k3 q2•k4 q3•k4
  • 39. Parallelizing attention • Scale the scores, take the softmax, and then multiply the result by V resulting in a matrix of shape N × d • An attention vector for each input token The N ⇥ N QK| matrix showing how it computes all qi · kj comparisons rix multiple. we have this QK| matrix, we can very efficiently scale these scores, ax, and then multiply the result by V resulting in a matrix of shape N mbedding representation for each token in the input. We’ve reduced f-attention step for an entire sequence of N tokens for one head to computation: A = softmax ✓ mask ✓ QK| p dk ◆◆ V (9 out the future You may have noticed that we introduced a mask func
  • 40. Masking out the future • What is this mask function? QKT has a score for each query dot every key, including those that follow the query. • Guessing the next word is pretty simple if you already know it! self-attention step for an entire sequence of N tokens for one head ing computation: A = softmax ✓ mask ✓ QK| p dk ◆◆ V ng out the future You may have noticed that we introduced a mask f 9.32 above. This is because the self-attention computation as we’ve de a problem: the calculation in QK| results in a score for each quer y key value, including those that follow the query. This is inapprop ting of language modeling: guessing the next word is pretty simple y know it! To fix this, the elements in the upper-triangular portion are zeroed out (set to •), thus eliminating any knowledge of wo in the sequence. This is done in practice by adding a mask matri
  • 41. Masking out the future Add –∞ to cells in upper triangle The softmax will turn it to 0 vector embedding representation for each token in the input. We’ve reduced the tire self-attention step for an entire sequence of N tokens for one head to the llowing computation: A = softmax ✓ mask ✓ QK| p dk ◆◆ V (9.32) asking out the future You may have noticed that we introduced a mask function Eq. 9.32 above. This is because the self-attention computation as we’ve described has a problem: the calculation in QK| results in a score for each query value every key value, including those that follow the query. This is inappropriate in e setting of language modeling: guessing the next word is pretty simple if you ready know it! To fix this, the elements in the upper-triangular portion of the atrix are zeroed out (set to •), thus eliminating any knowledge of words that llow in the sequence. This is done in practice by adding a mask matrix M in hich Mij = • 8j > i (i.e. for the upper-triangular portion) and Mij = 0 otherwise. g. 9.9 shows the resulting masked QK| matrix. (we’ll see in Chapter 11 how to q1•k1 q2•k1 q2•k2 q4•k1 q4•k2 q4•k3 q4•k4 q3•k1 q3•k2 q3•k3 N N −∞ −∞ −∞ −∞ −∞ −∞
  • 42. Another point: A,en-on is quadra-c in length vector embedding representation for each token in the input. We’ve reduced the tire self-attention step for an entire sequence of N tokens for one head to the llowing computation: A = softmax ✓ mask ✓ QK| p dk ◆◆ V (9.32) asking out the future You may have noticed that we introduced a mask function Eq. 9.32 above. This is because the self-attention computation as we’ve described has a problem: the calculation in QK| results in a score for each query value every key value, including those that follow the query. This is inappropriate in e setting of language modeling: guessing the next word is pretty simple if you ready know it! To fix this, the elements in the upper-triangular portion of the atrix are zeroed out (set to •), thus eliminating any knowledge of words that llow in the sequence. This is done in practice by adding a mask matrix M in hich Mij = • 8j > i (i.e. for the upper-triangular portion) and Mij = 0 otherwise. g. 9.9 shows the resulting masked QK| matrix. (we’ll see in Chapter 11 how to q1•k1 q2•k1 q2•k2 q4•k1 q4•k2 q4•k3 q4•k4 q3•k1 q3•k2 q3•k3 N N −∞ −∞ −∞ −∞ −∞ −∞
  • 43. Attention again k1 k2 k3 k4 KT QKT v1 v2 v3 v4 V q2•k2 q4•k2 q4•k3 q4•k4 q3•k2 q3•k3 −∞ −∞ −∞ −∞ −∞ −∞ q1•k1 q2•k1 q2•k2 q4•k1 q4•k2 q4•k3 q4•k4 q3•k1 q3•k2 q3•k3 q1•k2 q2•k3 q1•k3 q3•k4 q2•k4 q1•k4 = QKT masked = q1•k1 q2•k1 q4•k1 q3•k1 q1•k1 q1•k1 = x a1 a2 a3 a4 A Query Token 1 Query Token 2 Query Token 3 Query Token 4 Q Q = Value Token 1 Value Token 2 Value Token 3 Value Token 4 V x WV = Input Token 1 Input Token 2 Input Token 3 Input Token 4 X Key Token 1 Key Token 2 Key Token 3 Key Token 4 K x WK = Input Token 1 Input Token 2 Input Token 3 Input Token 4 X dk x N N x N N x N N x dv N x dv dk d x dk d x dv N x dk N x d N x dk N x d N x dv k2 k3 k4 KT QKT v1 v2 v3 v4 V q2•k2 q4•k2 q4•k3 q4•k4 q3•k2 q3•k3 −∞ −∞ −∞ −∞ −∞ −∞ q1•k1 q2•k1 q2•k2 q4•k1 q4•k2 q4•k3 q4•k4 q3•k1 q3•k2 q3•k3 q1•k2 q2•k3 q1•k3 q3•k4 q2•k4 q1•k4 = QKT masked = q1•k1 q2•k1 q4•k1 q3•k1 q1•k1 q1•k1 = x a1 a2 a3 a4 A Query Token 1 Query Token 2 Query Token 3 Query Token 4 Q = Value Token 1 Value Token 2 Value Token 3 Value Token 4 V x WV = Input Token 1 Input Token 2 Input Token 3 Input Token 4 X Key Token 1 Key Token 2 Key Token 3 Key Token 4 K x WK = Input Token 1 Input Token 2 Input Token 3 Input Token 4 X dk x N N x N N x N N x dv N x dv d x dk d x dv N x dk N x d N x dk N x d N x dv k1 k2 k3 k4 KT QKT v1 v2 v3 v4 V q2•k2 q4•k2 q4•k3 q4•k4 q3•k2 q3•k3 −∞ −∞ −∞ −∞ −∞ −∞ q1•k1 q2•k1 q2•k2 q4•k1 q4•k2 q4•k3 q4•k4 q3•k1 q3•k2 q3•k3 q1•k2 q2•k3 q1•k3 q3•k4 q2•k4 q1•k4 x = QKT masked = q1•k1 q2•k1 q4•k1 q3•k1 q1•k1 q1•k1 = x a1 a2 a3 a4 A Query Token 1 Query Token 2 Query Token 3 Query Token 4 Q Q = Value Token 1 Value Token 2 Value Token 3 Value Token 4 V x WV = Input Token 1 Input Token 2 Input Token 3 Input Token 4 X Key Token 1 Key Token 2 Key Token 3 Key Token 4 K x WK = Input Token 1 Input Token 2 Input Token 3 Input Token 4 X dk x N N x N N x N N x dv N x dv dk d x dk d x dv N x dk N x d N x dk N x d N x dv q1 q2 q3 q4 k1 k2 k3 k4 Q KT QKT v1 v2 v3 v4 V q2•k2 q4•k2 q4•k3 q4•k4 q3•k2 q3•k3 −∞ −∞ −∞ −∞ −∞ −∞ q1•k1 q2•k1 q2•k2 q4•k1 q4•k2 q4•k3 q4•k4 q3•k1 q3•k2 q3•k3 q1•k2 q2•k3 q1•k3 q3•k4 q2•k4 q1•k4 x = QKT masked mask = q1•k1 q2•k1 q4•k1 q3•k1 q1•k1 q1•k1 = x a1 a2 a3 a4 A Query Token 1 Query Token 2 Query Token 3 Query Token 4 Q Input Token 1 Input Token 2 Input Token 3 Input Token 4 X x WQ = Value Token 1 Value Token 2 Value Token 3 Value Token 4 V x WV = Input Token 1 Input Token 2 Input Token 3 Input Token 4 X Key Token 1 Key Token 2 Key Token 3 Key Token 4 K x WK = Input Token 1 Input Token 2 Input Token 3 Input Token 4 X N x dk dk x N N x N N x N N x dv N x dv d x dk d x dk d x dv N x d N x dk N x d N x dk N x d N x dv q1 q2 q3 q4 k1 k2 k3 k4 Q KT QKT v1 v2 v3 v4 V q2•k2 q4•k2 q4•k3 q4•k4 q3•k2 q3•k3 −∞ −∞ −∞ −∞ −∞ −∞ q1•k1 q2•k1 q2•k2 q4•k1 q4•k2 q4•k3 q4•k4 q3•k1 q3•k2 q3•k3 q1•k2 q2•k3 q1•k3 q3•k4 q2•k4 q1•k4 x = QKT masked mask = q1•k1 q2•k1 q4•k1 q3•k1 q1•k1 q1•k1 = x a1 a2 a3 a4 A Query Token 1 Query Token 2 Query Token 3 Query Token 4 Q Input Token 1 Input Token 2 Input Token 3 Input Token 4 X x WQ = Value Token 1 Value Token 2 Value Token 3 Value Token 4 V x WV = Input Token 1 Input Token 2 Input Token 3 Input Token 4 X Key Token 1 Key Token 2 Key Token 3 Key Token 4 K x WK = Input Token 1 Input Token 2 Input Token 3 Input Token 4 X N x dk dk x N N x N N x N N x dv N x dv d x dk d x dk d x dv N x d N x dk N x d N x dk N x d N x dv q1 q2 q3 q4 k1 k2 k3 k4 Q KT QKT v1 v2 v3 v4 V q2•k2 q4•k2 q4•k3 q4•k4 q3•k2 q3•k3 −∞ −∞ −∞ −∞ −∞ −∞ q1•k1 q2•k1 q2•k2 q4•k1 q4•k2 q4•k3 q4•k4 q3•k1 q3•k2 q3•k3 q1•k2 q2•k3 q1•k3 q3•k4 q2•k4 q1•k4 x = QKT masked mask = q1•k1 q2•k1 q4•k1 q3•k1 q1•k1 q1•k1 = x a1 a2 a3 a4 A Query Token 1 Query Token 2 Query Token 3 Query Token 4 Q Input Token 1 Input Token 2 Input Token 3 Input Token 4 X x WQ = Value Token 1 Value Token 2 Value Token 3 Value Token 4 V x WV = Input Token 1 Input Token 2 Input Token 3 Input Token 4 X Key Token 1 Key Token 2 Key Token 3 Key Token 4 K x WK = Input Token 1 Input Token 2 Input Token 3 Input Token 4 X N x dk dk x N N x N N x N N x dv N x dv d x dk d x dk d x dv N x d N x dk N x d N x dk N x d N x dv q1 q2 q3 q4 k1 k2 k3 k4 Q KT QKT v1 v2 v3 v4 V q2•k2 q4•k2 q4•k3 q4•k4 q3•k2 q3•k3 −∞ −∞ −∞ −∞ −∞ −∞ q1•k1 q2•k1 q2•k2 q4•k1 q4•k2 q4•k3 q4•k4 q3•k1 q3•k2 q3•k3 q1•k2 q2•k3 q1•k3 q3•k4 q2•k4 q1•k4 x = QKT masked mask = q1•k1 q2•k1 q4•k1 q3•k1 q1•k1 q1•k1 = x a1 a2 a3 a4 A Query Token 1 Query Token 2 Query Token 3 Query Token 4 Q Input Token 1 Input Token 2 Input Token 3 Input Token 4 X x WQ = Value Token 1 Value Token 2 Value Token 3 Value Token 4 V x WV = Input Token 1 Input Token 2 Input Token 3 Input Token 4 X Key Token 1 Key Token 2 Key Token 3 Key Token 4 K x WK = Input Token 1 Input Token 2 Input Token 3 Input Token 4 X N x dk dk x N N x N N x N N x dv N x dv d x dk d x dk d x dv N x d N x dk N x d N x dk N x d N x dv
  • 44. Parallelizing Multi-head Attention 9.4 • THE INPUT: EMBEDDINGS FOR TOKEN AND POSITION he self-attention output A of shape [N ⇥d]. Qi = XWQi ; Ki = XWKi ; Vi = XWVi (9. headi = SelfAttention(Qi ,Ki ,Vi ) = softmax ✓ Qi Ki| p dk ◆ Vi (9. MultiHeadAttention(X) = (head1 head2... headh)WO (9. utting it all together with the parallel input matrix X The function compu n parallel by an entire layer of N transformer block over the entire N input tok an be expressed as:
  • 45. Parallelizing Multi-head Attention or Putting it all together with the parallel input matrix X The function computed in parallel by an entire layer of N transformer block over the entire N input tokens can be expressed as: O = LayerNorm(X+MultiHeadAttention(X)) (9.36) H = LayerNorm(O+FFN(O)) (9.37) Or we can break it down with one equation for each component computation, using T (of shape [N ⇥ d]) to stand for transformer and superscripts to demarcate each computation inside the block: T1 = MultiHeadAttention(X) (9.38) T2 = X+T1 (9.39) T3 = LayerNorm(T2 ) (9.40) T4 = FFN(T3 ) (9.41) e expressed as: O = LayerNorm(X+MultiHeadAttention(X)) (9.36) H = LayerNorm(O+FFN(O)) (9.37) e can break it down with one equation for each component computation, using shape [N ⇥ d]) to stand for transformer and superscripts to demarcate each utation inside the block: T1 = MultiHeadAttention(X) (9.38) T2 = X+T1 (9.39) T3 = LayerNorm(T2 ) (9.40) T4 = FFN(T3 ) (9.41) T5 = T4 +T3 (9.42) H = LayerNorm(T5 ) (9.43)
  • 47. Transformers Input and output: Position embeddings and the Language Model Head
  • 48. Token and Position Embeddings The matrix X (of shape [N × d]) has an embedding for each word in the context. This embedding is created by adding two distinct embedding for each input • token embedding • positional embedding
  • 49. Token Embeddings Embedding matrix E has shape [|V | × d ]. • One row for each of the |V | tokens in the vocabulary. • Each word is a row vector of d dimensions Given: string "Thanks for all the" 1. Tokenize with BPE and convert into vocab indices w = [5,4000,10532,2224] 2. Select the corresponding rows from E, each row an embedding • (row 5, row 4000, row 10532, row 2224).
  • 50. Position Embeddings There are many methods, but we'll just describe the simplest: absolute position. Goal: learn a position embedding matrix Epos of shape [1 × N ]. Start with randomly initialized embeddings As with word embeddings, these position embeddings are learned along with other parameters during training.
  • 51. Each x is just the sum of word and posi5on embeddings X = Composite Embeddings (word + position) Transformer Block Janet 1 will 2 back 3 Janet will back the bill the 4 bill 5 + + + + + Position Embeddings Word Embeddings
  • 52. Language modeling head Layer L Transformer Block Softmax over vocabulary V Unembedding layer … 1 x |V| Logits Word probabilities 1 x |V| hL 1 w1 w2 wN hL 2 hL N d x |V| 1 x d Unembedding layer = ET y1 y2 y|V| … u1 u2 u|V| … Language Model Head takes hL N and outputs a distribution over vocabulary V
  • 53. Language modeling head Softmax over vocabulary V Unembedding layer … 1 x |V| Logits Word probabilities 1 x |V| wN hL N d x |V| 1 x d Unembedding layer = ET y1 y2 y|V| … u1 u2 u|V| … Unembedding layer: linear layer projects from hL N (shape [1 × d]) to logit vector Why "unembedding"? Tied to ET Weight tying, we use the same weights for two different matrices Unembedding layer maps from an embedding to a 1x|V| vector of logits
  • 54. Language modeling head Softmax over vocabulary V Unembedding layer … 1 x |V| Logits Word probabilities 1 x |V| wN hL N d x |V| 1 x d Unembedding layer = ET y1 y2 y|V| … u1 u2 u|V| … Logits, the score vector u One score for each of the |V | possible words in the vocabulary V . Shape 1 × |V |. Softmax turns the logits into probabilities over vocabulary. Shape 1 × |V |. This linear layer can be learned, but more commonly we tie this matri transpose of) the embedding matrix E. Recall that in weight tying, we same weights for two different matrices in the model. Thus at the input sta transformer the embedding matrix (of shape [|V|⇥d]) is used to map from a vector over the vocabulary (of shape [1 ⇥ |V|]) to an embedding (of shape And then in the language model head, ET , the transpose of the embedding m shape [d ⇥|V|]) is used to map back from an embedding (shape [1⇥d]) to over the vocabulary (shape [1⇥|V|]). In the learning process, E will be opti be good at doing both of these mappings. We therefore sometimes call the t ET the unembedding layer because it is performing this reverse mapping. A softmax layer turns the logits u into the probabilities y over the voca u = hL N ET y = softmax(u)
  • 55. The final transformer model wi Sample token to generate at position i+1 feedforward layer norm attention layer norm U Input token Language Modeling Head Input Encoding E i + … logits feedforward layer norm attention layer norm Layer 1 Layer 2 h1 i = x2 i x1 i h2 i = x3 i feedforward layer norm attention layer norm hL i hL-1 i = xL i y1 y2 y|V| … Token probabilities u1 u2 u|V| … softmax wi+1 Layer L Sample token to generate at position i+1 U Language Modeling Head logits feedforward hL i y1 y2 y|V| … Token probabilities u1 u2 u|V| … softmax wi+1
  • 57. Transformers Input and output: Position embeddings and the Language Model Head