This lecture is to share:
• Observations upon, not causality of, the behavior of LLMs
• Early attempts to interpret their ability
• Useful intuitions and interesting thought experiments
Interpretation of Pretrained Language Models Chenyan Xiong 11-667
1. Fall 2023 11-667 CMU
Interpretation of Pretrained
Language Models
Chenyan Xiong
11-667
2. Fall 2023 11-667 CMU
2
Disclaimer
No one really understand why language model works
Very limited theory and very limited empirical observation, especially at large scale
This lecture is to share:
• Observations upon, not causality of, the behavior of LLMs
• Early attempts to interpret their ability
• Useful intuitions and interesting thought experiments
3. Fall 2023 11-667 CMU
3
Outline
What is captured in BERT?
Why pretrained models generalize?
What does in-context learning do?
4. Fall 2023 11-667 CMU
4
Outline
What is captured in BERT?
• Attention patterns
• Probing capture capabilities in representations
Why pretrained models generalize?
What does in-context learning do?
5. Fall 2023 11-667 CMU
5
BERT Attention Patterns
Restate Transformer’s attention mechanism:
The new representation of position 𝑖 is the attention-weighted combination of other positions’ value
• Higher 𝛼𝑖𝑗 →bigger contribution of position 𝑗 to position 𝑖
[1] Clark Et al. “What Does BERT Look At? An Analysis of BERT’s Attention.” BlackBoxNLP 2019
𝛼𝑖𝑗 =
exp(𝑞𝑖 ⋅ 𝑘𝑗/ 𝑑𝑘)
σ𝑡 exp(𝑞𝑖 ⋅ 𝑘𝑡/ 𝑑𝑘)
𝑜𝑖 =
𝑗
𝛼𝑖𝑗𝑣𝑗
Attention from 𝑖 → 𝑗:
New representation of 𝑖:
6. Fall 2023 11-667 CMU
6
BERT Attention Patterns: Stats
Average Entropy of 𝛼𝑖𝑗
[1] Clark Et al. “What Does BERT Look At? An Analysis of BERT’s Attention.” BlackBoxNLP 2019
Figure 1: Entropy of BERT Attention Distributions [1]
7. Fall 2023 11-667 CMU
7
BERT Attention Patterns: Stats
[1] Clark Et al. “What Does BERT Look At? An Analysis of BERT’s Attention.” BlackBoxNLP 2019
Figure 1: Entropy of BERT Attention Distributions [1]
High entropy heads in lower layers:
• Bag-of-words alike mechanism
8. Fall 2023 11-667 CMU
8
BERT Attention Patterns: Stats
[1] Clark Et al. “What Does BERT Look At? An Analysis of BERT’s Attention.” BlackBoxNLP 2019
Figure 1: Entropy of BERT Attention Distributions [1]
Lower entropy in middle layers:
• Start forming certain patterns?
9. Fall 2023 11-667 CMU
9
BERT Attention Patterns: Stats
[1] Clark Et al. “What Does BERT Look At? An Analysis of BERT’s Attention.” BlackBoxNLP 2019
Figure 1: Entropy of BERT Attention Distributions [1]
Rising entropy in deep layers:
• More global information?
10. Fall 2023 11-667 CMU
BERT Attention Patterns: Common Patterns
Common Pattern 1: Broad attention
• Neural networks are hard to interpret
• Various stuffs mixed together, hard to tell
Figure 2: Attend Broadly (Left→Right) [1]
[1] Clark Et al. “What Does BERT Look At? An Analysis of BERT’s Attention.” BlackBoxNLP 2019
11. Fall 2023 11-667 CMU
BERT Attention Patterns: Common Patterns
Figure 3: Attend to Next (Left→Right) [1]
Common Pattern 2: Attend to next token
• Reverse RNN style
• Learned positional relation in pretraining
[1] Clark Et al. “What Does BERT Look At? An Analysis of BERT’s Attention.” BlackBoxNLP 2019
12. Fall 2023 11-667 CMU
BERT Attention Patterns: Common Patterns
Figure 4: Attend to [SEP] and punctuations (Left→Right) [1]
Common Pattern 3: Attend to [SEP] and “.”
• Centralizing attention to specific tokens
• Effect unclear
• Some consider it a “none” operation
• Some consider it as an information hub
• Maybe a mix of both, at different heads
[1] Clark Et al. “What Does BERT Look At? An Analysis of BERT’s Attention.” BlackBoxNLP 2019
13. Fall 2023 11-667 CMU
BERT Attention Patterns: Linguistic Examples
Figure 5: Objects Attend to their Verbs (Left→Right) [1]
[1] Clark Et al. “What Does BERT Look At? An Analysis of BERT’s Attention.” BlackBoxNLP 2019
14. Fall 2023 11-667 CMU
BERT Attention Patterns: Linguistic Examples
Figure 6: Noun Modifiers Attend to their Noun (Left→Right) [1]
[1] Clark Et al. “What Does BERT Look At? An Analysis of BERT’s Attention.” BlackBoxNLP 2019
15. Fall 2023 11-667 CMU
15
BERT Attention Patterns: Summaries
Many language phenomena are captured somewhere in the pretrained parameters
• Some attention head corresponds to linguistic relations
• More captured in pretraining, may not change much in fine-tuning
16. Fall 2023 11-667 CMU
16
BERT Attention Patterns: Summaries
Many language phenomena are captured somewhere in the pretrained parameters
• Some attention head corresponds to linguistic relations
• More captured in pretraining, may not change much in fine-tuning
Practical Implications:
• Attention weights reflect the importance perceived by language models
• An effective way to gather feedback from LLMs (handy in later lectures)
17. Fall 2023 11-667 CMU
17
Outline
What is captured in BERT?
• Attention patterns
• Probing capture capabilities in representations
Why pretrained models generalize?
What does in-context learning do?
18. Fall 2023 11-667 CMU
18
Probing Pretraining Representations
Probing what is stored in the representations of pretrained models
[2] Tenney, Ian, et al. "What do you learn from context? probing for sentence structure in
contextualized word representations." ICLR 2019
Figure 7: Edge Probing Technique [2]
19. Fall 2023 11-667 CMU
19
Probing Pretraining Representations
[2] Tenney, Ian, et al. "What do you learn from context? probing for sentence structure in
contextualized word representations." ICLR 2019
Figure 7: Edge Probing Technique [2]
Representations
as static features
Mixing representations from layers:
𝒉𝑡
mix
=
𝑙
𝑤𝑙𝒉𝑡
𝑙
; 𝑤𝑙 = softmax(𝑎𝑙)
• Weighted combination of layers (𝑙)
• Combination weights (𝑎𝑙
) is trained per task
with the classification layer
20. Fall 2023 11-667 CMU
20
Mixing representations from layers:
𝒉𝑡
mix
=
𝑙
𝑤𝑙𝒉𝑡
𝑙
; 𝑤𝑙 = softmax(𝑎𝑙)
• Weighted combination of layers (𝑙)
• Combination weights (𝑎𝑙
) is trained per task
with the classification layer
If the representation perform well
• as static features
• for simple MLP classifier
• In a language task
Then it encodes information required by that
task
Probing Pretraining Representations
[2] Tenney, Ian, et al. "What do you learn from context? probing for sentence structure in
contextualized word representations." ICLR 2019
Figure 7: Edge Probing Technique [2]
Simple
classification to
target labels
21. Fall 2023 11-667 CMU
21
Probing Pretraining Representations
[2] Tenney, Ian, et al. "What do you learn from context? probing for sentence structure in
contextualized word representations." ICLR 2019
Figure 7: Edge Probing Technique [2]
Mixing representations from layers:
𝒉𝑡
mix
=
𝑙
𝑤𝑙𝒉𝑡
𝑙
; 𝑤𝑙 = softmax(𝑎𝑙)
Center-of-Gravity:
𝐸 𝑙 = σ𝑙 𝑙 ⋅ 𝑤𝑙
• Expected layer to convey the information
needed by the probe task
• Larger → information at higher layers
22. Fall 2023 11-667 CMU
22
Probing Pretraining Representations
[2] Tenney, Ian, et al. "What do you learn from context? probing for sentence structure in
contextualized word representations." ICLR 2019
Figure 7: Edge Probing Technique [2]
Mixing representations from layers:
𝒉𝑡
mix
=
𝑙
𝑤𝑙𝒉𝑡
𝑙
; 𝑤𝑙 = softmax(𝑎𝑙)
Center-of-Gravity:
𝐸 𝑙 = σ𝑙 𝑙 ⋅ 𝑤𝑙
• Expected layer to convey the information
Expected Layer:
Δ𝑙
= ProbeAcc 0: 𝑙 − ProbeAcc (0: 𝑙 − 1)
𝐸 Δ𝑙 =
σ𝑙 𝑙⋅Δ𝑙
σ𝑙 Δ𝑙
• Δ𝑙
: The benefit of adding layer 𝑙
• 𝐸 Δ𝑙
: The expected layer to solve the
probing task
23. Fall 2023 11-667 CMU
23
Probing Pretraining Representations: Probing Tasks
Task Description Type
Part-of-Speech Is the token a verb, noun, adj, etc. Syntactic
Constituent Labeling Is the span a noun phrase, verb phrase, etc. Syntactic
Dependency Labeling Label the functional relationship between tokens, e.g. subject-object? Syntactic
Named Entity Labeling Classify the entity type of a span, e.g., person, location, etc. Syntactic/Semantic
Semantic Role Labeling Label the predicate-augment structure of a sentence Semantic
Coreference Determine the reference of mentions to entities Semantic
Semantic Proto-Role Classifier the detailed role of predicate-augment Semantic
Relation Classification Predict real-world relations between entities Semantic/Knowledge
Table 1: Example Language Tasks to Probe BERT [2]
[2] Tenney, Ian, et al. "What do you learn from context? probing for sentence structure in
contextualized word representations." ICLR 2019
24. Fall 2023 11-667 CMU
24
Probing Pretraining Representations: Probing Results
[2] Tenney, Ian, Dipanjan Das, and Ellie Pavlick. "BERT Rediscovers the Classical NLP Pipeline."
ACL. 2019.
Probing Task GPT-1
(base)
BERT
(base)
BERT
(Large)
Part-of-Speech 95.0 96.7 96.9
Constituent Labeling 84.6 86.7 87.0
Dependency Labeling 94.1 85.1 95.4
Named Entity Labeling 92.5 96.2 96.5
Semantic Role Labeling 89.7 91.3 92.3
Coreference 86.3 90.2 91.4
Semantic Proto-Role 83.1 86.1 85.8
Relation Classification 81.0 82.0 82.4
Macro Average 88.3 89.3 91.0
Table 2: Overall Probing Results [2]
All very good numbers:
• The pretrained representations convey
syntactic and sematic information
25. Fall 2023 11-667 CMU
25
Probing Pretraining Representations: Across Layers
Part-of-Speech
Constituent Labeling
Dependency Labeling
Named Entity Labeling
Semantic Role Labeling
Coreference
Semantic Proto-Role
Relation Classification
[3] Tenney, Ian, Dipanjan Das, and Ellie Pavlick. "BERT Rediscovers the Classical NLP Pipeline."
ACL. 2019.
Figure 8: Edge Probing Results of BERT Large [3].
Layer 𝒍
Expected Layer Center of Gravity
Mixing representations from layers:
𝒉𝑡
mix
=
𝑙
𝑤𝑙𝒉𝑡
𝑙
; 𝑤𝑙 = softmax(𝑎𝑙)
Center-of-Gravity:
𝐸 𝑙 = σ𝑙 𝑙 ⋅ 𝑤𝑙
• Expected layer to convey the information
Expected Layer:
Δ𝑙
= ProbeAcc 0: 𝑙 − ProbeAcc (0: 𝑙 − 1)
𝐸 Δ𝑙 =
σ𝑙 𝑙⋅Δ𝑙
σ𝑙 Δ𝑙
• Δ𝑙
: The benefit of adding layer 𝑙
• 𝐸 Δ𝑙
: The expected layer to solve the
probing task
26. Fall 2023 11-667 CMU
26
Probing Pretraining Representations: Across Layers
Part-of-Speech
Constituent Labeling
Dependency Labeling
Named Entity Labeling
Semantic Role Labeling
Coreference
Semantic Proto-Role
Relation Classification
[3] Tenney, Ian, Dipanjan Das, and Ellie Pavlick. "BERT Rediscovers the Classical NLP Pipeline."
ACL. 2019.
Figure 8: Edge Probing Results of BERT Large [3].
Layer 𝒍
Expected Layer Center of Gravity
Different tasks are tackled at different layers
• Syntactic tasks at lower layers
• Semantic/Knowledge tasks at higher ones
27. Fall 2023 11-667 CMU
27
Probing Pretraining Representations: Across Training Steps
[4] Liu, et al. "Probing Across Time: What Does RoBERTa Know and When?."
EMNLP 2021.
Figure 9: Linguistics Task Probing
at RoBERTa Pretraining Steps [4].
0k 200k 400k 600k 800k 1M
Example Linguistic Tasks:
• Part-of-Speech
• Named Entity Labeling
• Syntactic Chunking
28. Fall 2023 11-667 CMU
28
Probing Pretraining Representations: Across Training Steps
[4] Liu, et al. "Probing Across Time: What Does RoBERTa Know and When?."
EMNLP 2021.
Figure 10: Factual/Common Sense Task Probing
at RoBERTa Pretraining Steps [4].
0k 200k 400k 600k 800k 1M
Example Factual/Commonsense Tasks:
• SQuAD
• ConceptNet
• Google Relation Extraction
29. Fall 2023 11-667 CMU
29
Probing Pretraining Representations: Across Training Steps
[4] Liu, et al. "Probing Across Time: What Does RoBERTa Know and When?."
EMNLP 2021.
Figure 11: Reasoning Task Probing
at RoBERTa Pretraining Steps [4].
0k 200k 400k 600k 800k 1M
Example Reasoning Tasks:
• Taxonomy Conjunction
• Multi-Hop Composition
• Object Comparison
30. Fall 2023 11-667 CMU
30
Probing Pretraining Representations: Across Training Steps
• Capturing tasks at different conceptual difficulty at different rate
• Emergent improvements
• Certain tasks require certain scale
Figure 11: Probing at Pretraining steps in Linguistic (left), Factual/Commonsense (middle), and Reasoning (right) tasks [4]
[4] Liu, et al. "Probing Across Time: What Does RoBERTa Know and When?."
EMNLP 2021.
31. Fall 2023 11-667 CMU
31
Probing Pretraining Representations: Summary
From the observatory point of view:
• Some attention patterns are intuitive
• Pretrained representations convey strong language information
• Different tasks are captured at different layers and different steps
• And the conceptual difficulty of tasks aligns with where & when they are captured
32. Fall 2023 11-667 CMU
32
Probing Pretraining Representations: Summary
From the observatory point of view:
• Some attention patterns are intuitive
• Pretrained representations convey strong language information
• Different tasks are captured at different layers and different steps
• And the conceptual difficulty of tasks aligns with where & when they are captured
It is tempting to think language models capture language semantics from a ground up way:
Syntactic →Semantic → Factual → Reasoning →General Intelligence
• Like a classic NLP pipeline
• Like how human brains learn natural language
33. Fall 2023 11-667 CMU
33
Probing Pretraining Representations: Summary
From the observatory point of view:
• Some attention patterns are intuitive
• Pretrained representations convey strong language information
• Different tasks are captured at different layers and different steps
• And the conceptual difficulty of tasks aligns with where & when they are captured
Practical implications:
• Efficient inference by only using what is needed: early exist, sparsity, distillation, etc.
It is tempting to think language models capture language semantics from a ground up way:
Syntactic →Semantic → Factual → Reasoning →General Intelligence
• Like a classic NLP pipeline
• Like how human brains learn natural language
But:
• Classic NLP tasks are not really ground up, best systems are often more direct & straightforward
• We really do not know how human brains work, perhaps less than we know how LLM works
34. Fall 2023 11-667 CMU
34
Outline
What is captured in BERT?
Why pretrained models generalize?
• Loss landscapes
• Implicit bias of language models
What does in-context learning do?
35. Fall 2023 11-667 CMU
35
Understand Generation Ability: Overview
Why pretrained models generalize to many fine-tuning tasks?
• Even on tasks with sufficient supervised label
Why larger models and longer pretraining steps improve generalization?
• In statistical machine learning: more complicated model + exhaustive training is recipe for overfitting
• But they indeed are the core advantages of pretraining models
36. Fall 2023 11-667 CMU
36
Visualization of Loss Landscape
Plot the loss function around a model parameter 𝜃
• Challenge: 𝜃 is super high dimension
Approximation: plot the loss landscape of 𝜃 towards two other parameters 𝜃1 and 𝜃2 [5]
𝑓 𝛼, 𝛽 = loss(𝜃 + 𝛼 𝜃1 − 𝜃 + 𝛽(𝜃2 − 𝜃))
• A plot along the axes of 𝛼 and 𝛽 the linear interpolation
[5] Li, et al. "Visualizing the loss landscape of neural nets.“
NeurIPS 2018.
37. Fall 2023 11-667 CMU
37
Visualization of Loss Landscape
Plot the loss function around a model parameter 𝜃
• Challenge: 𝜃 is super high dimension
Approximation: plot the loss landscape of 𝜃 towards two other parameters 𝜃1 and 𝜃2 [5]
𝑓 𝛼, 𝛽 = loss(𝜃 + 𝛼 𝜃1 − 𝜃 + 𝛽(𝜃2 − 𝜃))
• A plot along the axes of 𝛼 and 𝛽 the linear interpolation
[5] Li, et al. “Visualizing the loss landscape of neural nets.” NeurIPS 2018.
Figure 12: A sharp loss landscape and a smooth loss landscape [5]
38. Fall 2023 11-667 CMU
38
Visualization of Loss Landscape: BERT
BERT landscape in finetuning [6]
𝑓 𝛼, 𝛽 = loss(𝜃 + 𝛼 𝜃1 − 𝜃 + 𝛽(𝜃2 − 𝜃))
• 𝜃 starting parameter of fine-tuning: pretrained or random initialized
• 𝜃1 the finetuned parameter of this task
• 𝜃2 the finetuned parameter of another task, which is meaningful
[6] Hao, Yaru, et al. "Visualizing and Understanding the Effectiveness of BERT."
EMNLP 2019.
39. Fall 2023 11-667 CMU
39
Visualization of Loss Landscape: BERT
BERT landscape in finetuning [6]
𝑓 𝛼, 𝛽 = loss(𝜃 + 𝛼 𝜃1 − 𝜃 + 𝛽(𝜃2 − 𝜃))
• 𝜃 starting parameter of fine-tuning: pretrained or random initialized
• 𝜃1 the finetuned parameter of this task
• 𝜃2 the finetuned parameter of another task, which is meaningful
[6] Hao, et al. "Visualizing and Understanding the Effectiveness of BERT."
EMNLP 2019.
Figure 13: Loss landscape of finetuning MNLI from random or pretrained BERT [6]
40. Fall 2023 11-667 CMU
40
Visualization of Loss Landscape: BERT
BERT landscape in finetuning [6]
𝑓 𝛼, 𝛽 = loss(𝜃 + 𝛼 𝜃1 − 𝜃 + 𝛽(𝜃2 − 𝜃))
• 𝜃 starting parameter of fine-tuning: pretrained or random initialized
• 𝜃1 the finetuned parameter of this task
• 𝜃2 the finetuned parameter of another task, which is meaningful
[6] Hao, et al. "Visualizing and Understanding the Effectiveness of BERT."
EMNLP 2019.
Random Pretrained
Figure 13: Loss landscape of finetuning MNLI from random or pretrained BERT [6]
41. Fall 2023 11-667 CMU
41
Visualization of Loss Landscape: BERT
Plot the optimization path: project the checkpoint 𝜃′ at different steps to the loss landscape
[6] Hao, Yaru, et al. "Visualizing and Understanding the Effectiveness of BERT."
EMNLP 2019.
Figure 14: Optimization Trajectory when finetuning MNLI from random (left) and pretrained (right) BERT [6]
42. Fall 2023 11-667 CMU
42
Outline
What is captured in BERT?
Why pretrained models generalize?
• Loss landscapes
• Implicit bias of language models
What does in-context learning do?
43. Fall 2023 11-667 CMU
43
Inductive Bias of Language Models: Pretraining Longer
[7] Liu, Hong, et al. "Same Pre-training Loss, Better Downstream: Implicit Bias Matters for
Language Models." ICML 2023.
Figure 15: Probing Performances versus Pretraining Loss of a 25M Parameter BERT [7]
44. Fall 2023 11-667 CMU
44
Inductive Bias of Language Models: Pretraining Longer
[7] Liu, Hong, et al. "Same Pre-training Loss, Better Downstream: Implicit Bias Matters for
Language Models." ICML 2023.
Figure 15: Probing Performances versus Pretraining Loss of a 25M Parameter BERT [7]
Signs of overfitting and
instable learning
Yet smoothly improving downstream generalization
45. Fall 2023 11-667 CMU
45
Inductive Bias of Language Models: Pretraining Longer
[7] Liu, Hong, et al. "Same Pre-training Loss, Better Downstream: Implicit Bias Matters for
Language Models." ICML 2023.
Figure 15: Probing Performances versus Pretraining Loss of a 25M Parameter BERT [7]
Trace of (Loss) Hessian: A reflection of the loss flatness
Same pretraining loss but flattener loss shape
46. Fall 2023 11-667 CMU
46
Inductive Bias of Language Models: Larger Models
Figure 16: Illustration of Optimization Trajectory [7]
[7] Liu, Hong, et al. "Same Pre-training Loss, Better Downstream: Implicit Bias Matters for
Language Models." ICML 2023.
47. Fall 2023 11-667 CMU
47
Inductive Bias of Language Models: Larger Models
Figure 16: Illustration of Optimization Trajectory [7]
[7] Liu, Hong, et al. "Same Pre-training Loss, Better Downstream: Implicit Bias Matters for
Language Models." ICML 2023.
Small Model
Large Model
Larger models can reach a flattener optima:
1. Larger transformers have bigger
solution space
2. They cover smaller transformers
3. Optimizer keep seeking for flattener
optima, even reached same loss
48. Fall 2023 11-667 CMU
48
Why Pretrained Models Generalize: Summary
Many observations on pretrained models lead to flatter optima
• Better starting point
• Better loss shape
• Pretraining longer and larger Transformers lead to more flatness
49. Fall 2023 11-667 CMU
49
Why Pretrained Models Generalize: Summary
Many observations on pretrained models lead to flatter optima
• Better starting point
• Better loss shape
• Pretraining longer and larger Transformers lead to more flatness
Why flatness matters?
• Many empirical evidences showing its connection to generalization ability
• Intuitively, more robust to data variations/noises
• Theoretically, argued that it leads to simpler network solutions
• Hochreiter, S. and Schmidhuber, J. Flat minima. Neural Computing 1997
50. Fall 2023 11-667 CMU
50
Why Pretrained Models Generalize: Summary
Many observations on pretrained models lead to flatter optima
• Better starting point
• Better loss shape
• Pretraining longer and larger Transformers lead to more flatness
Why flatness matters?
• Many empirical evidences showing its connection to generalization ability
• Intuitively, more robust to data variations/noises
• Theoretically, argued that it leads to simpler network solutions
• Hochreiter, S. and Schmidhuber, J. Flat minima. Neural Computing 1997
Why pretrained models prefer flatter optima?
• A inductive bias of the optimizer, the architecturer, the pretraining loss, or the combination of them?
• Much more research required
51. Fall 2023 11-667 CMU
51
Outline
What is captured in BERT?
Why pretrained models generalize?
What does in-context learning do?
• Semantic Prior or Input-Label Mapping
• Connection with Gradient Decent
52. Fall 2023 11-667 CMU
52
In-Context Learning Interpretation: Observations
Two sources of information:
• Semantic knowledge captured in LLM
• In-context training signals (input-label mapping)
[8] Wei, et al. "Larger language models do in-context learning differently." arXiv 2023.
Figure 17: Regular In-Context Learning [8]
53. Fall 2023 11-667 CMU
53
In-Context Learning Interpretation: Observations
Two sources of information:
• Semantic knowledge captured in LLM
• In-context training signals (input-label mapping)
Which one works? Mixed observations:
• Random in-context labels work
→ Existing semantic knowledge
• Order of in-context data matter
→ In-context training signals
[8] Wei, et al. "Larger language models do in-context learning differently." arXiv 2023.
Figure 17: Regular In-Context Learning [8]
54. Fall 2023 11-667 CMU
54
In-Context Learning Interpretation: Random Label Test
Randomly flip X% of binary labels
• More flips (X↑), more requirement of existing
knowledge to make correct prediction
Behavior of models with bigger X%
• Those care less use more inner knowledge
• Those impacted more learn more in-context
[8] Wei, et al. "Larger language models do in-context learning differently." arXiv 2023.
Figure 18: Flipped-Label In-Context Learning [8]
55. Fall 2023 11-667 CMU
55
In-Context Learning Interpretation: Random Label Test
Randomly flip X% of binary labels
• More flips (X↑), more requirement of existing
knowledge to make correct prediction
Behavior of models with bigger X%
• Those care less use more inner knowledge
• Those impacted more learn more in-context
Question:
• Does larger LM care more, or less about bigger X?
[8] Wei, et al. "Larger language models do in-context learning differently." arXiv 2023.
Figure 18: Flipped-Label In-Context Learning [8]
56. Fall 2023 11-667 CMU
56
In-Context Learning Interpretation: Random Label Test
Larger models perform better with 0% flipped label
• But are much more sensitive to label flips
[8] Wei, et al. "Larger language models do in-context learning differently." arXiv 2023.
Large
Small
Figure 19: PaLM and GPT in Flipped-Label In-Context Learning,
binary classification with 16 examples per class [8]
57. Fall 2023 11-667 CMU
57
In-Context Learning Interpretation: Random Label Test
Larger models perform better with 0% flipped label
• But are much more sensitive to label flips
The strongest models can even over-correct
• With merely 32 in-context labels
There must be some learning in in-context learning
• Especially in larger LMs
[8] Wei, et al. "Larger language models do in-context learning differently." arXiv 2023.
Figure 19: PaLM and GPT in Flipped-Label In-Context Learning,
binary classification with 16 examples per class [8]
Large
Small
58. Fall 2023 11-667 CMU
58
In-Context Learning Interpretation: No Semantic Test
Figure 20: In-Context Learning with Semantically-Unrelated
Label Terms [8]
[8] Wei, et al. "Larger language models do in-context learning differently." arXiv 2023.
Use semantically-unrelated label terms
• E.g., foo / bar instead of positive / negative
• Models have to learn more from in-context
Behavior of models with unrelated labels
• Those perform well learns more in-context
• Those impacted rely more in existing knowledge
59. Fall 2023 11-667 CMU
59
In-Context Learning Interpretation: No Semantic Test
[8] Wei, et al. "Larger language models do in-context learning differently." arXiv 2023.
Figure 21: In-Context Learning Accuracy with Semantically-
Unrelated Labels versus Related Labels [8]
Larger models work better with unrelated labels
• They learn in-context label mappings better
Smaller models are more prune to unrelated labels
• They rely more on their prior-knowledge
60. Fall 2023 11-667 CMU
60
In-Context Learning Interpretation: No Semantic Test
Figure 22: In-Context Learning with Different Number of
Semantically-Unrelated Labels [8]
[8] Wei, et al. "Larger language models do in-context learning differently." arXiv 2023.
Larger models better leverages in-context examples
• Advantages more pronounces with more labels
Not much better than random with two examples
• Confirms unrelated labels are not aligned with
existing semantic knowledge
61. Fall 2023 11-667 CMU
61
In-Context Learning Interpretation: Observations
Smaller LMs rely more on existing knowledge and are less effective in learning from in-context
• Less sensitive to flipped labels
• Hard to capture semantically-unrelated input-label mappings
• Random labels unlikely to change output of small LMs
Larger LMs are more effectively in learning from in-context examples
• Can reverse their semantic prior to predict flipped labels
• Can learn semantic-unrelated label mappings
• Better utilizes more in-context examples
62. Fall 2023 11-667 CMU
62
In-Context Learning Interpretation: Observations
Smaller LMs rely more on existing knowledge and are less effective in learning from in-context
• Less sensitive to flipped labels
• Hard to capture semantically-unrelated input-label mappings
• Random labels unlikely to change output of small LMs
Larger LMs are more effectively in learning from in-context examples
• Can reverse their semantic prior to predict flipped labels
• Can learn semantic-unrelated label mappings
• Better utilizes more in-context examples
Why? How can LLMs learn from in-context examples?
63. Fall 2023 11-667 CMU
63
Outline
What is captured in BERT?
Why pretrained models generalize?
What does in-context learning do?
• Semantic Prior or Input-Label Mapping
• Connection with Gradient Decent
64. Fall 2023 11-667 CMU
64
Learning in In-Context Learning: Gradient Construction
One can manually construct a Transformer (𝑇𝐹GD) that does gradient operation in in-context learning
• Its prediction given in-context learning examples (𝑋𝑘, 𝑌𝑘)
== a reference model after performing SGD on (𝑋𝑘, 𝑌𝑘)
• The predict change of adding a new (𝑥, 𝑦) is similar with reference model after an SGD step with (𝑥, 𝑦)
[9] Oswald, et al. “Transformers Learn In-Context by Gradient Descent." ICML 2023.
65. Fall 2023 11-667 CMU
65
Learning in In-Context Learning: Gradient Construction
One can manually construct a Transformer (𝑇𝐹GD) that does gradient operation in in-context learning
• Its prediction given in-context learning examples (𝑋𝑘, 𝑌𝑘)
== a reference model after performing SGD on (𝑋𝑘, 𝑌𝑘)
• The predict change of adding a new (𝑥, 𝑦) is similar with reference model after an SGD step with (𝑥, 𝑦)
Currently it can be done in these conditions [9]:
• Linear self-attention, no SoftMax
• Reference model is a simple regression model such as linear regression
• Can stack linear self-attention with MLP but nothing more, i.e. no layer norm etc.
[9] Oswald, et al. “Transformers Learn In-Context by Gradient Descent." ICML 2023.
66. Fall 2023 11-667 CMU
66
Learning in In-Context Learning: Gradient Construction
Detailed mathematical construction can be found in Oswald et al. 2023 [9].
Intuitively:
• Self-attention is a high-capacity function and can approximate many math operations
• The reference model (the one who does SGD) is a simple linear regression model
• Lost of non-linearity removed to facilitated the construction
[9] Oswald, et al. “Transformers Learn In-Context by Gradient Descent." ICML 2023.
67. Fall 2023 11-667 CMU
67
Learning in In-Context Learning: Gradient Construction
Detailed mathematical construction can be found in Oswald et al. 2023 [9].
Intuitively:
• Self-attention is a high-capacity function and can approximate many math operations
• The reference model (the one who does SGD) is a simple linear regression model
• Lost of non-linearity removed to facilitated the construction
A very toy-ish set up, but a good thought process and a starting point to understand complicated LLMs
• Similar assumptions are often taken in current deep learning theory research
The gradient decent Transformer 𝑇GD is learn in-context by gradient decent by construction
[9] Oswald, et al. “Transformers Learn In-Context by Gradient Descent." ICML 2023.
68. Fall 2023 11-667 CMU
68
Learning in In-Context Learning: Trained Transformer
𝑇𝐹GD is constructed but not learned
• A constructed measurement target
One can train the toy Transformer 𝑇𝐹Train in the same in-context learning set up
• E.g., to perform linear regression task with in-context examples
[9] Oswald, et al. “Transformers Learn In-Context by Gradient Descent." ICML 2023.
69. Fall 2023 11-667 CMU
69
Learning in In-Context Learning: Comparison
𝑇𝐹GD is constructed but not learned
• A constructed measurement target
One can train the toy Transformer 𝑇𝐹Train in the same in-context learning set up
• E.g., to perform linear regression task with in-context examples
[9] Oswald, et al. “Transformers Learn In-Context by Gradient Descent." ICML 2023.
Figure 23: Comparison of constructed 𝑇𝐹GD and Trained 𝑇𝐹Train. [9]
Trained Transformer matches the
constructed gradient decent Transformer
• Near identical
• Prediction L2 difference
• Model sensitivity cosine/L2 difference
• Model sensitivity L2 difference
70. Fall 2023 11-667 CMU
70
Learning in In-Context Learning: Comparison
𝑇𝐹GD is constructed but not learned
• A constructed measurement target
One can train the toy Transformer 𝑇𝐹Train in the same in-context learning set up
• E.g., to perform linear regression task with in-context examples
[9] Oswald, et al. “Transformers Learn In-Context by Gradient Descent." ICML 2023.
Figure 23: Comparison of constructed 𝑇𝐹GD and Trained 𝑇𝐹Train. [9]
Trained Transformer matches the
constructed gradient decent Transformer
• Near identical
• Prediction L2 difference
• Model sensitivity cosine/L2 difference
• Model sensitivity L2 difference
Transformers (with strong assumptions
and simplifications) learn in-context by
gradient descent (of a linear regression
model)
71. Fall 2023 11-667 CMU
71
Learning in In-Context Learning: Multi-Layer Transformer
Compare the constructed and learned Transformer in multi-layer setting
Figure 24: Two-layer 𝑇𝐹GD versus 𝑇𝐹Train. [9] Figure 25: Five-layer 𝑇𝐹GD versus 𝑇𝐹Train. [9]
72. Fall 2023 11-667 CMU
72
Learning in In-Context Learning: Multi-Layer Transformer
Compare the constructed and learned Transformer in multi-layer setting
• Learned Transformer outperforms the constructed 𝑇𝐹GD
• Upgraded gradient decent 𝑇𝐹GD with manually tuned data transformation matches better
• Divergence increases with deeper (five only, still) networks
• But still remarkable similarity of in-context learning and gradient decent
Figure 24: Two-layer 𝑇𝐹GD versus 𝑇𝐹Train. [9] Figure 25: Five-layer 𝑇𝐹GD versus 𝑇𝐹Train. [9]
73. Fall 2023 11-667 CMU
Learning in In-Context Learning: Theory versus Empirical
Empirical Observation
• Larger Transformers better learn in-context
• More in-context examples help larger model more
• Smaller Transformers rely more on existing semantic
Theory
• Transformers perform one gradient step per layer
• And per in-context example
• Smaller models have limited gradient steps built in
Assumptions :
• Linear attention + MLP Transformer
• Simple regression reference model
• Shallow networks
74. Fall 2023 11-667 CMU
74
In-Context Learning Interpretation: Summary
Various solid empirical evidence that:
• Larger Transformers do learn in-context
• In-context learning ability correlates with model scale
Theorical connections are build between in-context learning and gradient decent observations
• Good intuitions
• One way to make sense of in-context learning
75. Fall 2023 11-667 CMU
75
In-Context Learning Interpretation: Discussion
Likely many not-yet-finished learning theory,
• This interpretation is more for our understanding and inspiration
• Strong assumptions are introduced to make the theory
Personal views:
• In-context learning is different from SGD and is more powerful in some scenarios
• Connecting with existing, well-known techniques is a good starting point
• Eventually researchers will develop new theorical frameworks to explain the amazing capabilities of LLM
76. Fall 2023 11-667 CMU
76
Outline
What is captured in BERT?
• Attention patterns
• Probing capture capabilities in representations
Why pretrained models generalize?
• Loss landscapes
• Implicit bias of language models
What does in-context learning do?
• Semantic Prior or Input-Label Mapping
• Connection with Gradient Decent
77. Fall 2023 11-667 CMU
Quiz: Why the order of in-context example matters?
78. Fall 2023 11-667 CMU
78
References: BERTology
• Clark, Kevin, et al. "What does bert look at? an analysis of bert's attention." arXiv preprint arXiv:1906.04341
(2019).
• Tenney, Ian, Dipanjan Das, and Ellie Pavlick. "BERT rediscovers the classical NLP pipeline." arXiv preprint
arXiv:1905.05950 (2019).
• Htut, Phu Mon, et al. "Do attention heads in BERT track syntactic dependencies?." arXiv preprint
arXiv:1911.12246 (2019).
• Liu, Leo Z., et al. "Probing across time: What does RoBERTa know and when?." arXiv preprint arXiv:2104.07885
(2021).
• Tenney, Ian, et al. "What do you learn from context? probing for sentence structure in contextualized word
representations." arXiv preprint arXiv:1905.06316 (2019).
• Rogers, Anna, Olga Kovaleva, and Anna Rumshisky. "A primer in BERTology: What we know about how BERT
works." Transactions of the Association for Computational Linguistics 8 (2021): 842-866.
• Carlini, Nicholas, et al. "Extracting Training Data from Large Language Models." USENIX Security Symposium. Vol.
6. 2021.
• Carlini, Nicholas, et al. "Quantifying memorization across neural language models." arXiv preprint
arXiv:2202.07646 (2022).
• Izacard, Gautier, and Edouard Grave. "Distilling knowledge from reader to retriever for question answering."
arXiv preprint arXiv:2012.04584 (2020).
79. Fall 2023 11-667 CMU
79
References: Optimization
• Erhan, Dumitru, et al. "The difficulty of training deep architectures and the effect of unsupervised pre-training."
Artificial Intelligence and Statistics. PMLR, 2009.
• Li, Hao, et al. "Visualizing the loss landscape of neural nets." Advances in neural information processing systems
31 (2018).
• Hao, Yaru, et al. "Visualizing and understanding the effectiveness of BERT." arXiv preprint arXiv:1908.05620
(2019).
• Liu, Hong, et al. "Same Pre-training Loss, Better Downstream: Implicit Bias Matters for Language Models." arXiv
preprint arXiv:2210.14199 (2022).
• Chiang, Ping-yeh, et al. "Loss Landscapes are All You Need: Neural Network Generalization Can Be Explained
Without the Implicit Bias of Gradient Descent." The Eleventh International Conference on Learning
Representations. 2023.
80. Fall 2023 11-667 CMU
80
References: Knowledge
• Petroni, Fabio, et al. "Language models as knowledge bases?." arXiv preprint arXiv:1909.01066 (2019).
• Roberts, Adam, Colin Raffel, and Noam Shazeer. "How much knowledge can you pack into the parameters of a
language model?." arXiv preprint arXiv:2002.08910 (2020).
• Jiang, Zhengbao, et al. "How can we know what language models know?." Transactions of the Association for
Computational Linguistics 8 (2020): 423-438.
• Zaken, Elad Ben, Shauli Ravfogel, and Yoav Goldberg. "Bitfit: Simple parameter-efficient fine-tuning for
transformer-based masked language-models." arXiv preprint arXiv:2106.10199 (2021).
• Min, Sewon, et al. "Rethinking the Role of Demonstrations: What Makes In-Context Learning Work?." arXiv
preprint arXiv:2202.12837 (2022).
• Geva, Mor, et al. "Transformer feed-forward layers are key-value memories." arXiv preprint arXiv:2012.14913
(2020).
• Meng, Kevin, et al. "Locating and editing factual associations in GPT." Advances in Neural Information Processing
Systems 35 (2022): 17359-17372.
81. Fall 2023 11-667 CMU
BERT Attention Patterns: Linguistic Examples
Figure 5: Objects Attend to their Verbs (Left→Right) [1]
[1] Clark Et al. “What Does BERT Look At? An Analysis of BERT’s Attention.” BlackBoxNLP 2019
82. Fall 2023 11-667 CMU
BERT Attention Patterns: Linguistic Examples
Figure 6: Noun Modifiers Attend to their Noun (Left→Right) [1]
[1] Clark Et al. “What Does BERT Look At? An Analysis of BERT’s Attention.” BlackBoxNLP 2019
83. Fall 2023 11-667 CMU
83
Probing Pretraining Representations: Across Layers
Mixing representations from multiple layers:
𝒉𝑡
mix
= σ𝑙 𝑠𝑙𝒉𝑡
𝑙
; 𝑠𝑙 = softmax(𝛼𝑙)
Definition: Center-of-Gravity
𝐸 𝑙 = σ𝑙 𝑙 ⋅ 𝑠𝑙
• Expected layer to convey the information needed by the probe task
• Larger Center-of-Gravity → information needed captured at higher layers
Definition: Expected Layer
Δ𝑙 = Probing Score 0: 𝑙 − Probing Score(0: 𝑙 − 1)
𝐸 Δ𝑙 =
σ𝑙 𝑙⋅Δ𝑙
σ𝑙 Δ𝑙
• Δ𝑙 : The benefit of adding layer 𝑙 in the mix
• 𝐸 Δ𝑙 : The expected layer to resolve the probing task
[3] Tenney, Ian, Dipanjan Das, and Ellie Pavlick. "BERT Rediscovers the Classical NLP Pipeline."
ACL. 2019.
85. Fall 2023 11-667 CMU
85
In-Context Learning Interpretation: Summary
Various solid empirical evidence that:
• Larger Transformers do learn in-context
• In-context learning ability correlates with model scale
Theorical connections are build between in-context learning and gradient decent observations
• Good intuitions
• One way to make sense of in-context learning
• Very strong assumptions are introduced for the connection, unfortunately