SlideShare a Scribd company logo
Representation Capacity
CSE665: Large Language Models
Outline
Section 1: N-Gram and MLP Models
Section 2: RNNs and LSTM Models
Section 3: Transformers
Section 4: Esoteric “Transformer” Architectures
Section 5: Towards Natural Language Understanding
2
N-Gram and MLP Models
3
Q: Please pass the salt and pepper to
1) me
4) refrigerator
2) coffee
Fill in the blank!
3) yes
Q: Please pass the salt and pepper to
1) me
4) refrigerator
2) coffee
Fill in the blank!
3) yes
In the first place, what are language models?
Language models determine the probabilities of a series of words
Example: Find the probability of a word w given some sample text history h
W = “the”, h = “its water is so transparent that”
“Chain rule of probability” - Conditional probability of a word given previous words
P(“its water is so transparent that the”) = P(“its”) x “P(“water”|”its”) x P(“is” | “its water”).... x
P(“the”|”its water is so transparent that”)
[Kaduri22]: Kaduri (2022) “From N-grams to CodeX”
[Jurafsky23]: Jurafsky et al. (2023) “Speech and Language Processing”
bi-gram
Approximates conditional probability of a word given by using only
the conditional probability of the preceding word
P(“its water is so transparent that the”) = P(“its”) x “P(“water”|”its”) x P(“is” |
“water”).... x P(“the”| that”)
P(“the|the water is so transparent that”) = P(“the”|”that”)
[Kaduri22]: Kaduri (2022) “From N-grams to CodeX”
[Jurafsky23]: Jurafsky et al. (2023) “Speech and Language Processing”
bi-gram
Approximates conditional probability of a word given by using only
the conditional probability of the preceding word
We assume we can predict the probability of a future unit without
looking far into the past
[Kaduri22]: Kaduri (2022) “From N-grams to CodeX”
[Jurafsky23]: Jurafsky et al. (2023) “Speech and Language Processing”
n-gram
Approximates conditional probability of a word given by using only the
conditional probability of the (n-1) words
To estimate these probabilities, we use maximum likelihood estimation:
1) Getting counts of the n-grams from a given corpus
2) Normalizing the counts so they lie between 0 and 1
[Kaduri22]: Kaduri (2022) “From N-grams to CodeX”
[Jurafsky23]: Jurafsky et al. (2023) “Speech and Language Processing”
Issues with “count-based” approximations
Language is a creative exercise, many different permutations of
words could have the same meaning.
Probability would be zero if n-gram is absent in the corpus and not
dealt with (Sparse data problem).
Large amount of memory required - for a language with a
vocabulary of words with an -gram language model, we would
need to store values
[Kaduri22]: Kaduri (2022) “From N-grams to CodeX”
[Jurafsky23]: Jurafsky et al. (2023) “Speech and Language Processing”
A neural probabilistic language model
1) Take into consideration words of similar meaning
2) Should be able to take into consideration “longer” contexts
without incurring large memory resources
Ex: “The cat is walking in the bedroom” should be similar to ‘A dog
was running in a room”
TLDR: Instead of storing the permutations, lets learn an embedding of each token in the
vocab given the context.
[Bengio03]: Benjio et. al. (2003) “A Neural Probabilistic Language Model”
[Kaduri22]: Kaduri (2022) “From N-grams to CodeX”
[Jurafsky23]: Jurafsky et al. (2023) “Speech and Language Processing”
A neural probabilistic language model
Main ideas:
1) Associating each word in the vocabulary with a word feature
vector
2) Expressing the joint probability function of word sequences in
terms of the word feature vectors
[Bengio03]: Benjio et. al. (2003) “A Neural Probabilistic Language Model”
[Kaduri22]: Kaduri (2022) “From N-grams to CodeX”
[Jurafsky23]: Jurafsky et al. (2023) “Speech and Language Processing”
A neural probabilistic language model
Ex: “The cat is walking in the bedroom” should be similar to ‘A dog
was running in a room”
“the” should be similar to “a”
“Bedroom” should be similar to “room”
“Running” should be similar to “walking”
Hence, we should be able to generalize from “The cat is walking in
the bedroom” to “A dog was running in a room”, as similar words are
expected to have similar feature vectors
[Bengio03]: Benjio et. al. (2003) “A Neural Probabilistic Language Model”
[Kaduri22]: Kaduri (2022) “From N-grams to CodeX”
[Jurafsky23]: Jurafsky et al. (2023) “Speech and Language Processing”
Neural network architecture proposed by Bengio et al., 2003
Embedding Layer
Hidden Layer
Probability Layer
[Bengio03]: Benjio et. al. (2003) “A Neural Probabilistic Language Model”
[Kaduri22]: Kaduri (2022) “From N-grams to CodeX”
[Jurafsky23]: Jurafsky et al. (2023) “Speech and Language Processing”
A neural probabilistic language model
A neural probabilistic language model
Embedding Layer
A mapping from a word to a vector that describes it
Represented by a matrix, where represents the size of
vocabulary and represents the size of vector (30 - 100 in Bengio et
al., 2003)
Embeddings here are trained via the task at hand (predicting the
next word given a context)
[Bengio03]: Benjio et. al. (2003) “A Neural Probabilistic Language Model”
[Jurafsky23]: Jurafsky et al. (2023) “Speech and Language Processing”
Hidden Layer
It transforms input sequences of feature vectors and capture
contextual information
In the paper, a multi-layered perceptron was used, with hyperbolic
tangent functions included if hidden layers are used
[Bengio03]: Benjio et. al. (2003) “A Neural Probabilistic Language Model”
[Jurafsky23]: Jurafsky et al. (2023) “Speech and Language Processing”
Probability Layer
Produces a probability distribution over the words in the vocabulary
through the use of the softmax function
Output is a vector, where the i-th index of the vector represents
[Bengio03]: Benjio et. al. (2003) “A Neural Probabilistic Language Model”
[Jurafsky23]: Jurafsky et al. (2023) “Speech and Language Processing”
Effectiveness of the model
Test perplexity improvement of 24% compared to n-gram models
Able to take advantage of more contexts (2-gram to 4-gram contexts
benefitted the MLP approach, but not for the n-gram approach)
Inclusion of the hyperbolic tangent functions as hidden units
improved the perplexity of a given model
[Bengio03]: Benjio et. al. (2003) “A Neural Probabilistic Language Model”
[Jurafsky23]: Jurafsky et al. (2023) “Speech and Language Processing”
Limitations of N-gram models
Not able to grasp relations longer than window size (learning a 6
word relation with a 5-gram neural network is not possible)
Cannot model “memory” in the network (n-gram would only have
the context of (n-1) words.
The man went to the bank to deposit a check.
The children played by the river bank.
[Kaduri22]: Kaduri (2022) “From N-grams to CodeX”
[Jurafsky23]: Jurafsky et al. (2023) “Speech and Language Processing”
Need for Sequential modelling
In N-gram: For one word only one embedding. Does not take
context (memory) into account.
[Kaduri22]: Kaduri (2022) “From N-grams to CodeX”
[Jurafsky23]: Jurafsky et al. (2023) “Speech and Language Processing”
Outline
Section 1: N-Gram and MLP Models
Section 2: RNNs and LSTM Models
Section 3: Transformers
Section 4: Esoteric “Transformer” Architectures
Section 5: Towards Natural Language Understanding
22
RNNs and LSTMs (SKIP)
23
Transformer
24
Agenda: Transformer
- What is transformer?
- Encoder and decoder.
- Self attention.
- Probing
- Attention heads;
- Feedforward layers.
25
What is transformer?
- Motivation
26
Attention is all you need NIPS ‘17.
What is transformer?
- Encoder (left) and the decoder
(right)
- Q2: What is the connection
between encoder and decoder?
- Q3: Which components in
transformers are following models
using?
- BERT (masked language
modeling) uses which
component?
- T5 (seq2seq)?
- GPT (text generation)?
27
Source: Attention is all you need NIPS ‘17.
Encoder
Components in encoder:
- Multi-head attention
- FF
- Add & Normal
28
Source: Attention is all you need NIPS ‘17.
Primarily for optimization,
skip them in class.
Recall attention
29
Source: NUS CS4248 Natural Language Processing
Self-attention in Transformer
- Q: Query,
- K: Key
- V: Values
- Motivation?
- Similar to dot product attention.
- Scaling factor for stable gradient.
30
Source: Attention is all you need NIPS ‘17.
Walk through self attention -- Step 1
- We first transform the input
X into Q, K and V.
31
Source: https://jalammar.github.io/illustrated-transformer/
Walk through self attention -- Step 2
- Then perform the self
attention between Q, K and
V.
32
Source: https://jalammar.github.io/illustrated-transformer/
Walk through self attention -- Step 2 Example
33
Source: https://jalammar.github.io/illustrated-transformer/
Walk through self attention -- Step 3 -- Multihead
34
Source: https://jalammar.github.io/illustrated-transformer/
Why do we need
multi-head attention
anyway?
Encoder as “memory” for decoder
Encoder: MultiHead(Q, K, V)
- Q: Are Q, K, V the same?
Decoder: MultiHead(Q, K, V)
35
Encoder as “memory” for decoder
Encoder: MultiHead(Q, K, V)
- Q: Are Q, K, V the same?
- Yes!
Decoder: MultiHead(Q, K, V)
- Q: Are Q, K, V the same?
36
Encoder as “memory” for decoder
37
Source: NUS CS4248 Natural Language Processing
Masking for the decoder
38
Source: NUS CS4248 Natural Language Processing
Reading 1: Probing attention heads
39
Source: Revealing the Dark Secrets of BERT
40
41

More Related Content

PDF
Contemporary Models of Natural Language Processing
PPTX
Embedding for fun fumarola Meetup Milano DLI luglio
PPTX
A Neural Probabilistic Language Model
PDF
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
PDF
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
PPT
2-Chapter Two-N-gram Language Models.ppt
PDF
Visual-Semantic Embeddings: some thoughts on Language
PPTX
What is word2vec?
Contemporary Models of Natural Language Processing
Embedding for fun fumarola Meetup Milano DLI luglio
A Neural Probabilistic Language Model
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
2-Chapter Two-N-gram Language Models.ppt
Visual-Semantic Embeddings: some thoughts on Language
What is word2vec?

Similar to Lectures 10-11_ Representation Capacity – Large Language Models.pdf (20)

PDF
[Paper Reading] Unsupervised Learning of Sentence Embeddings using Compositi...
PDF
How to supervise a thesis in NLP in the ChatGPT era? By Laure Soulier
PDF
Master's Thesis Alessandro Calmanovici
PDF
Improvement wsd dictionary using annotated corpus and testing it with simplif...
PDF
Word Embeddings, why the hype ?
PDF
A REVIEW ON PARTS-OF-SPEECH TECHNOLOGIES
PDF
Deep Learning for Natural Language Processing: Word Embeddings
PPTX
A Panorama of Natural Language Processing
PDF
A Neural Probabilistic Language Model_v2
PDF
Master defence 2020 - Anastasiia Khaburska - Statistical and Neural Language ...
PPTX
Text Mining for Lexicography
PDF
Deep learning for nlp
PDF
New word analogy corpus
PDF
Turkish language modeling using BERT
PDF
From MDE to SLE (April 17th, 2015)
PDF
A survey on parallel corpora alignment
PDF
ATAR: Attention-based LSTM for Arabizi transliteration
PDF
Natural Language Processing
PDF
Multilingual Text Classification using Ontologies
[Paper Reading] Unsupervised Learning of Sentence Embeddings using Compositi...
How to supervise a thesis in NLP in the ChatGPT era? By Laure Soulier
Master's Thesis Alessandro Calmanovici
Improvement wsd dictionary using annotated corpus and testing it with simplif...
Word Embeddings, why the hype ?
A REVIEW ON PARTS-OF-SPEECH TECHNOLOGIES
Deep Learning for Natural Language Processing: Word Embeddings
A Panorama of Natural Language Processing
A Neural Probabilistic Language Model_v2
Master defence 2020 - Anastasiia Khaburska - Statistical and Neural Language ...
Text Mining for Lexicography
Deep learning for nlp
New word analogy corpus
Turkish language modeling using BERT
From MDE to SLE (April 17th, 2015)
A survey on parallel corpora alignment
ATAR: Attention-based LSTM for Arabizi transliteration
Natural Language Processing
Multilingual Text Classification using Ontologies
Ad

Recently uploaded (20)

PPTX
Microbial diseases, their pathogenesis and prophylaxis
PPTX
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
PDF
102 student loan defaulters named and shamed – Is someone you know on the list?
PDF
TR - Agricultural Crops Production NC III.pdf
PDF
RMMM.pdf make it easy to upload and study
PDF
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf
PDF
Insiders guide to clinical Medicine.pdf
PDF
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx
PDF
Physiotherapy_for_Respiratory_and_Cardiac_Problems WEBBER.pdf
PDF
Mark Klimek Lecture Notes_240423 revision books _173037.pdf
PDF
Supply Chain Operations Speaking Notes -ICLT Program
PDF
VCE English Exam - Section C Student Revision Booklet
PDF
STATICS OF THE RIGID BODIES Hibbelers.pdf
PDF
Basic Mud Logging Guide for educational purpose
PDF
O5-L3 Freight Transport Ops (International) V1.pdf
PDF
BÀI TẬP BỔ TRỢ 4 KỸ NĂNG TIẾNG ANH 9 GLOBAL SUCCESS - CẢ NĂM - BÁM SÁT FORM Đ...
PPTX
Introduction to Child Health Nursing – Unit I | Child Health Nursing I | B.Sc...
PPTX
master seminar digital applications in india
PDF
01-Introduction-to-Information-Management.pdf
PPTX
Pharma ospi slides which help in ospi learning
Microbial diseases, their pathogenesis and prophylaxis
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
102 student loan defaulters named and shamed – Is someone you know on the list?
TR - Agricultural Crops Production NC III.pdf
RMMM.pdf make it easy to upload and study
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf
Insiders guide to clinical Medicine.pdf
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx
Physiotherapy_for_Respiratory_and_Cardiac_Problems WEBBER.pdf
Mark Klimek Lecture Notes_240423 revision books _173037.pdf
Supply Chain Operations Speaking Notes -ICLT Program
VCE English Exam - Section C Student Revision Booklet
STATICS OF THE RIGID BODIES Hibbelers.pdf
Basic Mud Logging Guide for educational purpose
O5-L3 Freight Transport Ops (International) V1.pdf
BÀI TẬP BỔ TRỢ 4 KỸ NĂNG TIẾNG ANH 9 GLOBAL SUCCESS - CẢ NĂM - BÁM SÁT FORM Đ...
Introduction to Child Health Nursing – Unit I | Child Health Nursing I | B.Sc...
master seminar digital applications in india
01-Introduction-to-Information-Management.pdf
Pharma ospi slides which help in ospi learning
Ad

Lectures 10-11_ Representation Capacity – Large Language Models.pdf

  • 2. Outline Section 1: N-Gram and MLP Models Section 2: RNNs and LSTM Models Section 3: Transformers Section 4: Esoteric “Transformer” Architectures Section 5: Towards Natural Language Understanding 2
  • 3. N-Gram and MLP Models 3
  • 4. Q: Please pass the salt and pepper to 1) me 4) refrigerator 2) coffee Fill in the blank! 3) yes
  • 5. Q: Please pass the salt and pepper to 1) me 4) refrigerator 2) coffee Fill in the blank! 3) yes
  • 6. In the first place, what are language models? Language models determine the probabilities of a series of words Example: Find the probability of a word w given some sample text history h W = “the”, h = “its water is so transparent that” “Chain rule of probability” - Conditional probability of a word given previous words P(“its water is so transparent that the”) = P(“its”) x “P(“water”|”its”) x P(“is” | “its water”).... x P(“the”|”its water is so transparent that”) [Kaduri22]: Kaduri (2022) “From N-grams to CodeX” [Jurafsky23]: Jurafsky et al. (2023) “Speech and Language Processing”
  • 7. bi-gram Approximates conditional probability of a word given by using only the conditional probability of the preceding word P(“its water is so transparent that the”) = P(“its”) x “P(“water”|”its”) x P(“is” | “water”).... x P(“the”| that”) P(“the|the water is so transparent that”) = P(“the”|”that”) [Kaduri22]: Kaduri (2022) “From N-grams to CodeX” [Jurafsky23]: Jurafsky et al. (2023) “Speech and Language Processing”
  • 8. bi-gram Approximates conditional probability of a word given by using only the conditional probability of the preceding word We assume we can predict the probability of a future unit without looking far into the past [Kaduri22]: Kaduri (2022) “From N-grams to CodeX” [Jurafsky23]: Jurafsky et al. (2023) “Speech and Language Processing”
  • 9. n-gram Approximates conditional probability of a word given by using only the conditional probability of the (n-1) words To estimate these probabilities, we use maximum likelihood estimation: 1) Getting counts of the n-grams from a given corpus 2) Normalizing the counts so they lie between 0 and 1 [Kaduri22]: Kaduri (2022) “From N-grams to CodeX” [Jurafsky23]: Jurafsky et al. (2023) “Speech and Language Processing”
  • 10. Issues with “count-based” approximations Language is a creative exercise, many different permutations of words could have the same meaning. Probability would be zero if n-gram is absent in the corpus and not dealt with (Sparse data problem). Large amount of memory required - for a language with a vocabulary of words with an -gram language model, we would need to store values [Kaduri22]: Kaduri (2022) “From N-grams to CodeX” [Jurafsky23]: Jurafsky et al. (2023) “Speech and Language Processing”
  • 11. A neural probabilistic language model 1) Take into consideration words of similar meaning 2) Should be able to take into consideration “longer” contexts without incurring large memory resources Ex: “The cat is walking in the bedroom” should be similar to ‘A dog was running in a room” TLDR: Instead of storing the permutations, lets learn an embedding of each token in the vocab given the context. [Bengio03]: Benjio et. al. (2003) “A Neural Probabilistic Language Model” [Kaduri22]: Kaduri (2022) “From N-grams to CodeX” [Jurafsky23]: Jurafsky et al. (2023) “Speech and Language Processing”
  • 12. A neural probabilistic language model Main ideas: 1) Associating each word in the vocabulary with a word feature vector 2) Expressing the joint probability function of word sequences in terms of the word feature vectors [Bengio03]: Benjio et. al. (2003) “A Neural Probabilistic Language Model” [Kaduri22]: Kaduri (2022) “From N-grams to CodeX” [Jurafsky23]: Jurafsky et al. (2023) “Speech and Language Processing”
  • 13. A neural probabilistic language model Ex: “The cat is walking in the bedroom” should be similar to ‘A dog was running in a room” “the” should be similar to “a” “Bedroom” should be similar to “room” “Running” should be similar to “walking” Hence, we should be able to generalize from “The cat is walking in the bedroom” to “A dog was running in a room”, as similar words are expected to have similar feature vectors [Bengio03]: Benjio et. al. (2003) “A Neural Probabilistic Language Model” [Kaduri22]: Kaduri (2022) “From N-grams to CodeX” [Jurafsky23]: Jurafsky et al. (2023) “Speech and Language Processing”
  • 14. Neural network architecture proposed by Bengio et al., 2003 Embedding Layer Hidden Layer Probability Layer [Bengio03]: Benjio et. al. (2003) “A Neural Probabilistic Language Model” [Kaduri22]: Kaduri (2022) “From N-grams to CodeX” [Jurafsky23]: Jurafsky et al. (2023) “Speech and Language Processing” A neural probabilistic language model
  • 15. A neural probabilistic language model
  • 16. Embedding Layer A mapping from a word to a vector that describes it Represented by a matrix, where represents the size of vocabulary and represents the size of vector (30 - 100 in Bengio et al., 2003) Embeddings here are trained via the task at hand (predicting the next word given a context) [Bengio03]: Benjio et. al. (2003) “A Neural Probabilistic Language Model” [Jurafsky23]: Jurafsky et al. (2023) “Speech and Language Processing”
  • 17. Hidden Layer It transforms input sequences of feature vectors and capture contextual information In the paper, a multi-layered perceptron was used, with hyperbolic tangent functions included if hidden layers are used [Bengio03]: Benjio et. al. (2003) “A Neural Probabilistic Language Model” [Jurafsky23]: Jurafsky et al. (2023) “Speech and Language Processing”
  • 18. Probability Layer Produces a probability distribution over the words in the vocabulary through the use of the softmax function Output is a vector, where the i-th index of the vector represents [Bengio03]: Benjio et. al. (2003) “A Neural Probabilistic Language Model” [Jurafsky23]: Jurafsky et al. (2023) “Speech and Language Processing”
  • 19. Effectiveness of the model Test perplexity improvement of 24% compared to n-gram models Able to take advantage of more contexts (2-gram to 4-gram contexts benefitted the MLP approach, but not for the n-gram approach) Inclusion of the hyperbolic tangent functions as hidden units improved the perplexity of a given model [Bengio03]: Benjio et. al. (2003) “A Neural Probabilistic Language Model” [Jurafsky23]: Jurafsky et al. (2023) “Speech and Language Processing”
  • 20. Limitations of N-gram models Not able to grasp relations longer than window size (learning a 6 word relation with a 5-gram neural network is not possible) Cannot model “memory” in the network (n-gram would only have the context of (n-1) words. The man went to the bank to deposit a check. The children played by the river bank. [Kaduri22]: Kaduri (2022) “From N-grams to CodeX” [Jurafsky23]: Jurafsky et al. (2023) “Speech and Language Processing”
  • 21. Need for Sequential modelling In N-gram: For one word only one embedding. Does not take context (memory) into account. [Kaduri22]: Kaduri (2022) “From N-grams to CodeX” [Jurafsky23]: Jurafsky et al. (2023) “Speech and Language Processing”
  • 22. Outline Section 1: N-Gram and MLP Models Section 2: RNNs and LSTM Models Section 3: Transformers Section 4: Esoteric “Transformer” Architectures Section 5: Towards Natural Language Understanding 22
  • 23. RNNs and LSTMs (SKIP) 23
  • 25. Agenda: Transformer - What is transformer? - Encoder and decoder. - Self attention. - Probing - Attention heads; - Feedforward layers. 25
  • 26. What is transformer? - Motivation 26 Attention is all you need NIPS ‘17.
  • 27. What is transformer? - Encoder (left) and the decoder (right) - Q2: What is the connection between encoder and decoder? - Q3: Which components in transformers are following models using? - BERT (masked language modeling) uses which component? - T5 (seq2seq)? - GPT (text generation)? 27 Source: Attention is all you need NIPS ‘17.
  • 28. Encoder Components in encoder: - Multi-head attention - FF - Add & Normal 28 Source: Attention is all you need NIPS ‘17. Primarily for optimization, skip them in class.
  • 29. Recall attention 29 Source: NUS CS4248 Natural Language Processing
  • 30. Self-attention in Transformer - Q: Query, - K: Key - V: Values - Motivation? - Similar to dot product attention. - Scaling factor for stable gradient. 30 Source: Attention is all you need NIPS ‘17.
  • 31. Walk through self attention -- Step 1 - We first transform the input X into Q, K and V. 31 Source: https://jalammar.github.io/illustrated-transformer/
  • 32. Walk through self attention -- Step 2 - Then perform the self attention between Q, K and V. 32 Source: https://jalammar.github.io/illustrated-transformer/
  • 33. Walk through self attention -- Step 2 Example 33 Source: https://jalammar.github.io/illustrated-transformer/
  • 34. Walk through self attention -- Step 3 -- Multihead 34 Source: https://jalammar.github.io/illustrated-transformer/ Why do we need multi-head attention anyway?
  • 35. Encoder as “memory” for decoder Encoder: MultiHead(Q, K, V) - Q: Are Q, K, V the same? Decoder: MultiHead(Q, K, V) 35
  • 36. Encoder as “memory” for decoder Encoder: MultiHead(Q, K, V) - Q: Are Q, K, V the same? - Yes! Decoder: MultiHead(Q, K, V) - Q: Are Q, K, V the same? 36
  • 37. Encoder as “memory” for decoder 37 Source: NUS CS4248 Natural Language Processing
  • 38. Masking for the decoder 38 Source: NUS CS4248 Natural Language Processing
  • 39. Reading 1: Probing attention heads 39 Source: Revealing the Dark Secrets of BERT
  • 40. 40
  • 41. 41