Lectures 10-11_ Representation Capacity – Large Language Models.pdf

Representation Capacity
CSE665: Large Language Models

Outline
Section 1: N-Gram and MLP Models
Section 2: RNNs and LSTM Models
Section 3: Transformers
Section 4: Esoteric “Transformer” Architectures
Section 5: Towards Natural Language Understanding
2

Q: Please pass the salt and pepper to
1) me
4) refrigerator
2) coﬀee
Fill in the blank!
3) yes

In the ﬁrst place, what are language models?
Language models determine the probabilities of a series of words
Example: Find the probability of a word w given some sample text history h
W = “the”, h = “its water is so transparent that”
“Chain rule of probability” - Conditional probability of a word given previous words
P(“its water is so transparent that the”) = P(“its”) x “P(“water”|”its”) x P(“is” | “its water”).... x
P(“the”|”its water is so transparent that”)
[Kaduri22]: Kaduri (2022) “From N-grams to CodeX”
[Jurafsky23]: Jurafsky et al. (2023) “Speech and Language Processing”

bi-gram
Approximates conditional probability of a word given by using only
the conditional probability of the preceding word
P(“its water is so transparent that the”) = P(“its”) x “P(“water”|”its”) x P(“is” |
“water”).... x P(“the”| that”)
P(“the|the water is so transparent that”) = P(“the”|”that”)

bi-gram
Approximates conditional probability of a word given by using only
the conditional probability of the preceding word
We assume we can predict the probability of a future unit without
looking far into the past

n-gram
Approximates conditional probability of a word given by using only the
conditional probability of the (n-1) words
To estimate these probabilities, we use maximum likelihood estimation:
1) Getting counts of the n-grams from a given corpus
2) Normalizing the counts so they lie between 0 and 1

Issues with “count-based” approximations
Language is a creative exercise, many different permutations of
words could have the same meaning.
Probability would be zero if n-gram is absent in the corpus and not
dealt with (Sparse data problem).
Large amount of memory required - for a language with a
vocabulary of words with an -gram language model, we would
need to store values

A neural probabilistic language model
1) Take into consideration words of similar meaning
2) Should be able to take into consideration “longer” contexts
without incurring large memory resources
Ex: “The cat is walking in the bedroom” should be similar to ‘A dog
was running in a room”
TLDR: Instead of storing the permutations, lets learn an embedding of each token in the
vocab given the context.
[Bengio03]: Benjio et. al. (2003) “A Neural Probabilistic Language Model”

Main ideas:
1) Associating each word in the vocabulary with a word feature
vector
2) Expressing the joint probability function of word sequences in
terms of the word feature vectors

Ex: “The cat is walking in the bedroom” should be similar to ‘A dog
was running in a room”
“the” should be similar to “a”
“Bedroom” should be similar to “room”
“Running” should be similar to “walking”
Hence, we should be able to generalize from “The cat is walking in
the bedroom” to “A dog was running in a room”, as similar words are
expected to have similar feature vectors

Neural network architecture proposed by Bengio et al., 2003
Embedding Layer
Hidden Layer
Probability Layer

Embedding Layer
A mapping from a word to a vector that describes it
Represented by a matrix, where represents the size of
vocabulary and represents the size of vector (30 - 100 in Bengio et
al., 2003)
Embeddings here are trained via the task at hand (predicting the
next word given a context)

Hidden Layer
It transforms input sequences of feature vectors and capture
contextual information
In the paper, a multi-layered perceptron was used, with hyperbolic
tangent functions included if hidden layers are used

Probability Layer
Produces a probability distribution over the words in the vocabulary
through the use of the softmax function
Output is a vector, where the i-th index of the vector represents

Effectiveness of the model
Test perplexity improvement of 24% compared to n-gram models
Able to take advantage of more contexts (2-gram to 4-gram contexts
beneﬁtted the MLP approach, but not for the n-gram approach)
Inclusion of the hyperbolic tangent functions as hidden units
improved the perplexity of a given model

Limitations of N-gram models
Not able to grasp relations longer than window size (learning a 6
word relation with a 5-gram neural network is not possible)
Cannot model “memory” in the network (n-gram would only have
the context of (n-1) words.
The man went to the bank to deposit a check.
The children played by the river bank.

Need for Sequential modelling
In N-gram: For one word only one embedding. Does not take
context (memory) into account.

Outline
Section 1: N-Gram and MLP Models
Section 2: RNNs and LSTM Models
Section 3: Transformers
Section 4: Esoteric “Transformer” Architectures
Section 5: Towards Natural Language Understanding
22

Agenda: Transformer
- What is transformer?
- Encoder and decoder.
- Self attention.
- Probing
- Attention heads;
- Feedforward layers.
25

What is transformer?
- Motivation
26
Attention is all you need NIPS ‘17.

What is transformer?
- Encoder (left) and the decoder
(right)
- Q2: What is the connection
between encoder and decoder?
- Q3: Which components in
transformers are following models
using?
- BERT (masked language
modeling) uses which
component?
- T5 (seq2seq)?
- GPT (text generation)?
27
Source: Attention is all you need NIPS ‘17.

Encoder
Components in encoder:
- Multi-head attention
- FF
- Add & Normal
28
Primarily for optimization,
skip them in class.

Recall attention
29
Source: NUS CS4248 Natural Language Processing

Self-attention in Transformer
- Q: Query,
- K: Key
- V: Values
- Motivation?
- Similar to dot product attention.
- Scaling factor for stable gradient.
30

Walk through self attention -- Step 1
- We ﬁrst transform the input
X into Q, K and V.
31
Source: https://jalammar.github.io/illustrated-transformer/

Walk through self attention -- Step 2
- Then perform the self
attention between Q, K and
V.
32

Walk through self attention -- Step 2 Example
33

Walk through self attention -- Step 3 -- Multihead
34
Why do we need
multi-head attention
anyway?

Encoder as “memory” for decoder
Encoder: MultiHead(Q, K, V)
- Q: Are Q, K, V the same?
Decoder: MultiHead(Q, K, V)
35

Encoder: MultiHead(Q, K, V)
- Yes!
Decoder: MultiHead(Q, K, V)
36

37

Masking for the decoder
38

Reading 1: Probing attention heads
39
Source: Revealing the Dark Secrets of BERT

Lectures 10-11_ Representation Capacity – Large Language Models.pdf

More Related Content

Similar to Lectures 10-11_ Representation Capacity – Large Language Models.pdf (20)

Recently uploaded (20)

Lectures 10-11_ Representation Capacity – Large Language Models.pdf