SlideShare a Scribd company logo
Chenyan Xiong 11-667 CMU
Building Blocks of Modern LLMs 2:
Pretraining Tasks
Chenyan Xiong
11-667
08/31/2023
Chenyan Xiong 11-667 CMU
Pretraining Tasks
Chenyan Xiong 11-667 CMU
3
Pretraining and Language Modeling
Pre-training: An unsupervised learning phrase before traditional supervised learning
• Original goal: provide better initialization points for supervised training
Language modeling: Predict a part of a given language piece (target) using the rest (context)
• A classic task in NLP et al. to model human usage of natural language
Chenyan Xiong 11-667 CMU
4
Pretraining and Language Modeling
Why language modeling as pretraining task?
• Infinite data, way more than current computing system can consume
• Beyond trillions of web pages processed
• Much more discovered
Chenyan Xiong 11-667 CMU
5
Pretraining and Language Modeling
Why language modeling as pretraining task?
• Infinite data, way more than current computing system can consume
• Beyond trillions of web pages processed
• Much more discovered
• Language, a main carrier of human knowledge
• We learn, communicate, and invent through language
• Other modalities often centered around language
• Not all tasks need language, but one would argue whether that is “human intelligence”
Chenyan Xiong 11-667 CMU
6
Pretraining and Language Modeling
Why language modeling as pretraining task?
• Infinite data, way more than current computing system can consume
• Beyond trillions of web pages processed
• Much more discovered
• Language, a main carrier of human knowledge
• We learn, communicate, and invent through language
• Other modalities often centered around language
• Not all tasks need language, but one would argue whether that is “human intelligence”
• Many real-world applications are centered around language
• Search, machine translation, question answering, writing assistance, etc.
Chenyan Xiong 11-667 CMU
7
Autoregressive Language Modeling
Classic language modeling: Given previous words, predict the next word
• Let 𝑋 = {𝑥1, … 𝑥𝑡 … , 𝑥𝑛} a text sequence of n tokens, the standard language modeling objective is to maximize
the likelihood:
𝐿𝑙𝑚 𝑋 = ෍
𝑡
log 𝑝(𝑥𝑡|𝑥𝑡−𝑘:𝑡−1; Θ)
• Where:
• 𝑥𝑡: t-th token, the prediction target
• 𝑥𝑡−𝑘:𝑡−1: previous k tokens (context), k=context window size
• Θ: language model parameters
Autoregressive: predicting the next word given previous words
• Following the nature of language, though can be done reversely too
Chenyan Xiong 11-667 CMU
8
Autoregressive Language Modeling
Classic language modeling: Given previous words, predict the next word
• Let 𝑋 = {𝑥1, … 𝑥𝑡 … , 𝑥𝑛} a text sequence of n tokens, the standard language modeling objective is to maximize
the likelihood:
𝐿𝑙𝑚 𝑋 = ෍
𝑡
log 𝑝(𝑥𝑡|𝑥𝑡−𝑘:𝑡−1; Θ)
• Where:
• 𝑥𝑡: t-th token, the prediction target
• 𝑥𝑡−𝑘:𝑡−1: previous k tokens (context), k=context window size
• Θ: language model parameters
The Steelers enjoy a large, widespread fanbase nicknamed Steeler
Nation
Language Model (𝚯)
𝑥𝑡−𝑘:𝑡−1
𝑥𝑡
Chenyan Xiong 11-667 CMU
9
Autoregressive Language Modeling
The language model can be implemented in many ways
• Discrete n-gram frequency based:
𝑝 𝑥𝑡 𝑥𝑡−𝑘:𝑡−1 =
count 𝑥𝑡−𝑘:𝑡−1, 𝑥𝑡
count 𝑥𝑡−𝑘:𝑡−1
• Continuous neural network models:
𝑝 𝑥𝑡 𝑥𝑡−𝑘:𝑡−1; Θ = 𝑓 𝑥𝑡 𝑥𝑡−𝑘:𝑡−1; Θ
• 𝑓(; Θ): a neural network, e.g., feedforward network, CNN, RNN, or
Chenyan Xiong 11-667 CMU
10
Autoregressive Language Modeling
The language model can be implemented in many ways
• Discrete n-gram frequency based:
𝑝 𝑥𝑡 𝑥𝑡−𝑘:𝑡−1 =
count 𝑥𝑡−𝑘:𝑡−1, 𝑥𝑡
count 𝑥𝑡−𝑘:𝑡−1
• Continuous neural network models:
𝑝 𝑥𝑡 𝑥𝑡−𝑘:𝑡−1; Θ = 𝑓 𝑥𝑡 𝑥𝑡−𝑘:𝑡−1; Θ
• 𝑓(; Θ): a neural network, e.g., feedforward network, CNN, RNN, or
• Transformer Decoder:
<s> A B C D E F G H
Transformer Decoder
A B C D E F G H </s>
Input:
𝑓(;Θ):
Target:
Chenyan Xiong 11-667 CMU
11
Autoregressive Language Modeling
Advantages of autoregressive language modeling:
• Intuitive, follows the nature flow of human language
• Aligns with many natural language generation style tasks
• Training signals at every token position in the sequence
Constraints:
• More for decoder style models, a.k.a. unidirectional networks→restriction of model flexibility
<s> A B C D E F G H
Transformer Decoder
A B C D E F G H </s>
Input:
𝑓(;Θ):
Target:
Chenyan Xiong 11-667 CMU
12
Auto-Encoder Language Modeling
Learn to reconstruct language from a learned hidden representation
• Given the text sequence 𝑋 = {𝑥1, … 𝑥𝑡 … , 𝑥𝑛}, the auto-encoder is to maximize the reconstruction likelihood:
𝐿AE 𝑋 = ෍
𝑡
log 𝑝 𝑥𝑡 𝑥𝑡−𝑘:𝑡−1; Θdec, 𝒛 𝑓(𝒛|𝑋, Θenc)
• Where:
• Θdec: language decoder parameters
• Θenc: language encoder parameters
• 𝒛: the hidden representation. Many viable formulations. In this class it is a neural embedding.
Chenyan Xiong 11-667 CMU
13
Auto-Encoder Language Modeling
Learn to reconstruct language from a learned hidden representation
• Given the text sequence 𝑋 = {𝑥1, … 𝑥𝑡 … , 𝑥𝑛}, the auto-encoder is to maximize the reconstruction likelihood:
𝐿AE 𝑋 = ෍
𝑡
log 𝑝 𝑥𝑡 𝑥𝑡−𝑘:𝑡−1; Θdec, 𝒛 𝑓(𝒛|𝑋, Θenc)
• Where:
• Θdec: language decoder parameters
• Θenc: language encoder parameters
• 𝒛: the hidden representation. Many viable formulations. In this class it is a neural embedding.
The Steelers enjoy a large, widespread fanbase nicknamed Steeler Nation
Nation
Language Encoder (𝚯𝐞𝐧𝐜)
𝑋
𝑥𝑡
𝒛 The Steelers enjoy a large, widespread fanbase nicknamed Steeler
Language Decoder (𝚯𝐝𝐞𝐜)
Chenyan Xiong 11-667 CMU
14
Auto-Encoder Language Modeling
The encoder and decoder can be various types of neural networks
• RNN, CNN, Transformers
• The signature is the information bottleneck 𝒛 between encoder and decoder
• Advantage of Auto-Encoder language modeling
• Explicit learning towards the sequence embedding 𝒛
• Allows various operations to convey prior knowledge to 𝒛 for generation, especially for vision-alike modalities
• Aligns with language representation tasks that need sequence level embeddings
The Steelers enjoy a large, widespread fanbase nicknamed Steeler Nation
Nation
Language Encoder (𝚯𝐞𝐧𝐜)
𝑋
𝑥𝑡
𝒛 The Steelers enjoy a large, widespread fanbase nicknamed Steeler
Language Decoder (𝚯𝐝𝐞𝐜)
Chenyan Xiong 11-667 CMU
15
Early experiments with decoder and auto-encoder
Evaluation set up:
• Task: IMDB sentiment classification
• Given the text of a review from IMDB, classify whether positive or negative
[1] Dai, Andrew M., and Quoc V. Le. "Semi-supervised sequence learning." NeurIPS 2015.
Table 1: Examples of IMDB sentiment classification task [1]
Chenyan Xiong 11-667 CMU
16
Early experiments with decoder and auto-encoder
Evaluation set up:
• Task: IMDB sentiment classification
• Pretraining: language modeling on 8 million IMDB movie reviews
• Neural network: LSTMs
• Auto-Encoder: discard decoder, fine-tune encoder
• Decoder: fine-tune decoder
One of the earliest explorations of language model pretraining, in 2015 [1]
[1] Dai, Andrew M., and Quoc V. Le. "Semi-supervised sequence learning." NeurIPS 2015.
Chenyan Xiong 11-667 CMU
17
Early experiments with decoder and auto-encoder
Evaluation set up:
• Task: IMDB sentiment classification
• Pretraining: language modeling on 8 million IMDB movie reviews
• Neural network: LSTMs
• Auto-Encoder: discard decoder, fine-tune encoder
• Decoder: fine-tune decoder
One of the earliest explorations of language model pretraining, in 2015 [1]
[1] Dai, Andrew M., and Quoc V. Le. "Semi-supervised sequence learning." NeurIPS 2015.
Method Test Error Rate↓
LSTM (No Pretraining, Finetune Only) 13.5%
Auto-Regressive LSTM Decoder (Pretrain→Finetune) 7.64%
Auto-Encoder LSTM Encoder (Pretrain→Finetune) 7.24%
Auto-Encoder LSTM Encoder (Pretrain + Finetune, Multi-Task) 14.7%
Table 2: Results on IMDB sentiment classification task [1]
Chenyan Xiong 11-667 CMU
18
Early experiments with decoder and auto-encoder
Observations from Dai and Le [1]:
• Pretraining helps significantly, as a better initialization
• Not only on accuracy but also on stability, and generalization ability
• Decoder LSTM as a representation model is slightly worse than encoder LSTM
• Mixing pretraining and supervised learning hurts.
• It is pre-training.
[1] Dai, Andrew M., and Quoc V. Le. "Semi-supervised sequence learning." NeurIPS 2015.
Method Test Error Rate↓
LSTM (No Pretraining, Finetune Only) 13.5%
Auto-Regressive LSTM Decoder (Pretrain→Finetune) 7.64%
Auto-Encoder LSTM Encoder (Pretrain→Finetune) 7.24%
Auto-Encoder LSTM Encoder (Pretrain + Finetune, Multi-Task) 14.7%
Table 2: Results on IMDB sentiment classification task [1]
Chenyan Xiong 11-667 CMU
19
GPT-1: Pretraining + Transformer Decoder
GPT-1 combines unsupervised pretraining and Transformer network
• Auto-regressive language modeling
• Transformer decoder
Another significant difference: Scale
• Much bigger network
• Transformers are easier to train than LSTM
• More data
• Books Corpus, ~1 billion words.
[2] Radford, Alec, et al. "Improving language understanding by generative pre-training." (2018).
Chenyan Xiong 11-667 CMU
20
GPT-1: Experimental Setup
Evaluation Task: GLUE benchmark
• A set of language classification tasks
• Most informative task is Multi-Genre Natural Language Inference (MNLI)
• Given a pair of statements, predict whether one entails, contradicts, or is neural to the other
Premise Hypothesis Label
Conceptually cream skimming has two basic
dimensions - product and geography.
Product and geography are what make
cream skimming work.
Neutral
Read for Slate 's take on Jackson's findings. Slate had an opinion on Jackson's findings. Entailment
In an increasingly interdependent world, many pressing
problems that affect Americans can be addressed only
through cooperation with other countries
We should be independent and stay away
from talking and working with other
nations.
Contradiction
Table 3: Examples of MNLI
[2] Radford, Alec, et al. "Improving language understanding by generative pre-training." (2018).
Chenyan Xiong 11-667 CMU
21
GPT-1: Evaluation Results
Results on MNLI and GLUE Average
Transformer is a much stronger architecture than LSTM
• More power
• Much easier to train
Pretraining brings a huge advantage
• Mixing pretraining with finetuning does not really help
[2] Radford, Alec, et al. "Improving language understanding by generative pre-training." (2018).
Method MNLI (ACC) GLUE AVG
Pretrained LSTM Decoder 73.7 69.1
Non Pretrained Transformer 75.7 59.9
Pretrained Transformer 81.1 75.0
Pretrained Transformer + LM Multi-Task Finetune 81.8 74.7
Table 4: GPT-1 Results on GLUE [2]
Chenyan Xiong 11-667 CMU
Early Insights on Pretraining and Transformer
Early glimpse of zero-shot task solving
[2] Radford, Alec, et al. "Improving language understanding by generative pre-training." (2018).
Figure 1: GPT-1 GLUE Performance at Different Stages [2]
Chenyan Xiong 11-667 CMU
Early Insights on Pretraining and Transformer
Early glimpse of zero-shot task solving
Improving zero-shot with more pretraining Steps
• Burst increasements on some tasks
• Different benefits on different tasks
Many benefits as a starting point of finetuning
• Not only a faster initialization but a better one
• Necessary for tasks with limited labels
[2] Radford, Alec, et al. "Improving language understanding by generative pre-training." (2018).
Figure 1: GPT-1 GLUE Performance at Different Stages [2]
Chenyan Xiong 11-667 CMU
24
Pretraining by Denoising Task
Denoising training
• Reconstruct the original input from an input mixed with noises
• Variety ways to construct the noisy input
• A classic unsupervised learning task used in many modalities
• Language, vision, molecular, etc.
Figure 2: Example of Vision Denoising Training [3]
[3] Brempong, Emmanuel Asiedu, et al. "Denoising pretraining for semantic segmentation."
CVPR 2022.
Chenyan Xiong 11-667 CMU
25
Masked Language Modeling
Masked Language Modeling, the denoising pretraining used in BERT
• Noisy Input: Text sequence with masked out token positions
• Reconstruction Target: Original tokens at masked out positions
• Let 𝑋MSK = {𝑥1, … MSK 𝑡 … , 𝑥𝑛} a text sequence of n tokens with positions 𝑡 ∈ 𝑀 replaced with [MSK] tokens,
• the Masked LM task is to maximize the likelihood of recovering masked out tokens:
𝐿MLM 𝑋 = ෍
𝑡∈𝑀
log 𝑝(𝑥𝑡|𝑋MSK; Θ)
The Steelers [MSK] a large, widespread [MSK] nicknamed Steeler Nation
Fanbase
Masked Language Model (𝚯)
𝑋MSK
𝑥𝑡 Enjoy
[4] Devlin et al. "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding."
NAACL-HLT. 2019.
Chenyan Xiong 11-667 CMU
26
BERT Pretraining with Masked LM
BERT uses a bi-directional Transformer encoder as the language model
• Forward pass:
• Mask LM Head:
• Mask LM Loss:
Where:
• 𝒙 the embedding of token 𝑥
• 𝑯, 𝒉𝒕 the last layer’s representation of Transformer and the one for the t-th position.
[4] Devlin et al. "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding."
NAACL-HLT. 2019.
𝑋MSK
Transformer
𝑯
MLM Head
𝑝MLM(𝑥|𝒉𝑡)
𝑝MLM 𝑥 𝒉𝑖 =
exp(𝒙𝑇𝒉𝑡)
σ𝑥𝑖∈𝑉 exp 𝒙𝒊
𝑇
𝒉𝒕
𝐿MLM = E(− ෍
𝑡∈𝑀
log 𝑝MLM 𝑥𝑡 𝒉𝑡 )
Chenyan Xiong 11-667 CMU
27
BERT: Experimental Setup
Notable hyper-parameters
• Both became standard experimental settings in the pretraining literature
• Base setting is chosen to be close to GPT-1 for comparison
Other important setups
• Mask fraction: 15%
• Optimizer: Adam with warm up
Total
Parameters
Transformer
Layers
Hidden
Dimensions
Sequence
Length
Pretraining Corpus Pretraining Steps
BERTbase 110M 12 768 512 Wikipedia (2.5 billion
words)+ BookCorpus (0.8b)
128K tokens/batch *
1M steps
BERTlarge 340M 24 1024 512
[4] Devlin et al. "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding."
NAACL-HLT. 2019.
Table 5: BERT base and large configurations
Chenyan Xiong 11-667 CMU
28
BERT: Experimental Setup
Evaluation Tasks: GLUE, SQuAD, and many more
SQuAD: Question answering, reading comprehension style
• Given a natural language question and a passage, find the span (n-gram) answer in the passage
• Evaluate by matching the target answer phrase
• A good representative of several types of NLP tasks:
• Knowledge-intensive: Questions require “human knowledge” to answer
• Token-level tasks: Label prediction at token level
• One of the early QA experiences in commercial search engines (extractive QA)
[4] Devlin et al. "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding."
NAACL-HLT. 2019.
Question: What kind of music does Beyonce do?
Passage: Beyoncé's music is generally R&B, but she also incorporates pop, soul and funk into
her songs. 4 demonstrated Beyoncé's exploration of 90s-style R&B, as well as further
use of soul and hip hop than compared to previous releases….
Target Answer: R&B
Table 6: SQuAD Example
Chenyan Xiong 11-667 CMU
29
BERT: Evaluation Results
Results on MNLI, GLUE Average, and SQuAD 1.1 Develop set
Much stronger results than GPT-1
• More flexibile architecture (allow bidirectional attention path)
• More data (Wiki + BookCorpus)
Significant gains by scaling from base to large
[4] Devlin et al. "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding."
NAACL-HLT. 2019.
MNLI (ACC) GLUE AVG SQuAD (F1)
ELMO 76.3 71.0 85.6
GPT-1 81.8 75.1 n.a.
BERTbase 84.0 79.6 88.5
BERTlarge 86.3 82.1 90.9
Table 7: BERT Evaluation Results [4]
Chenyan Xiong 11-667 CMU
BERT: Analysis
Benefits of Masked LM
Significant benefits from using Masked LM
• Hard to apply MLM on decoder only models
Auto-regressive LM starts faster
• But quickly by-passed by Masked LM
Figure 3: BERT finetuned accuracy after different pretraining
steps with Masked LM and Auto-regressive LM [4]
[4] Devlin et al. "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding."
NAACL-HLT. 2019.
Chenyan Xiong 11-667 CMU
31
More Finessed Denoising Task: Span Masking
Span Masking: Instead of randomly sampled token positions, masking out more spans (continuous positions)
[5] Joshi, Mandar, et al. "SpanBERT: Improving pre-training by representing and predicting spans."
TACL 2020.
The Steelers enjoy a large, widespread [MSK] [MSK] [MSK] nation
fanbase nicknamed steeler
Masked Language Model (𝚯)
𝑥𝑡:𝑡+3
𝑋SpanMSK
Chenyan Xiong 11-667 CMU
32
More Finessed Denoising Task: Span Masking
Span Masking: instead of randomly sampled token positions, masking out more spans (continuous positions)
• Span sampling:
• Sample a span length (# of tokens) from a geometric distribution
• Randomly sample a starting point of the span to mask
• Till reached total mask fraction (15%)
[5] Joshi, Mandar, et al. "SpanBERT: Improving pre-training by representing and predicting spans."
TACL 2020.
The Steelers enjoy a large, widespread [MSK] [MSK] [MSK] nation
fanbase nicknamed steeler
Masked Language Model (𝚯)
𝑥𝑡:𝑡+3
Figure 4: Geometric distribution used to
sample span length in SpanBERT [5]
𝑋SpanMSK
Chenyan Xiong 11-667 CMU
33
More Finessed Denoising Task: Span Masking
Span Masking: instead of randomly sampled token positions, masking out more spans (continuous positions)
Benefits:
• A little higher granularity (tokens to phrases), thus harder/more semantical?
• Aligns well with some downstream applications, e.g., SQuAD
[5] Joshi, Mandar, et al. "SpanBERT: Improving pre-training by representing and predicting spans."
TACL 2020.
The Steelers enjoy a large, widespread [MSK] [MSK] [MSK] nation.
fanbase nicknamed steeler
Masked Language Model (𝚯)
𝑋SpanMSK
𝑥𝑡:𝑡+3
Chenyan Xiong 11-667 CMU
34
More Finessed Denoising Task: Salient Span Masking
Salient Span Mask (SSM): Masking out spans corresponding to entities and attributes (salient)
First use fine-tuned BERT to tag named entities and rules to tag dates (salient spans)
Sample span mask from salient spans
Benefits:
• A lightweight way of introducing knowledge
• Directly targeting knowledge-intensive tasks, e.g., dates
The Steelers enjoy a large, widespread fanbase nicknamed [MSK] [MSK].
steeler nation
Masked Language Model (𝚯)
𝑋SSM
𝑥𝑡:𝑡+2
[6] Guu, Kelvin, et al. "Retrieval augmented language model pre-training."
ICML, 2020
Chenyan Xiong 11-667 CMU
35
Recap: Autoregressive LM and Masked LM
Autoregressive LM Masked LM
Neural Architecture More suited for decoder Encoder and decoder
Training Density All Token Positions 15% of Masked Positions
Converging Speed/Stability Fast and stable Slower and less stable
Task Fit Generation Representation
Notable Models GPT-* BERT
Table 8: Recap of Autoregressive LM and Masked LM
Chenyan Xiong 11-667 CMU
36
Combination of Auto-Regressive and Masked LM
Various efforts to combine the benefits of Auto-Regressive LM and Masked LM
• One model for both generation and representation
• Better training effectiveness from multi-task learning?
Notable examples:
• UniLM: Dong, Li, et al. "Unified language model pre-training for natural language understanding and generation."
NeurIPS 2019.
• XL-NET: Yang et al. “XL-NET: Generalized autoregressive pretraining for language understanding." NeurIPS 2019.
Chenyan Xiong 11-667 CMU
37
Transformer Encoder-Decoders
Much of the difference of auto-regressive versus Masked LM also resides in the Transformer architecture:
• Encoder: bi-directional representation power
• Decoder: natural generation
Transformer Encoder-Decoder enjoy the benefits of both
• Flexible for various types of denoising tasks
• Support different downstream applications with either side, or both together
<s> A B C D E F G </s>
Transformer Encoder
H I J K L M N O </s>
Input:
Target:
Transformer Decoder
<s> H I J K L M N O
Chenyan Xiong 11-667 CMU
38
T5: Text-to-Text Transfer Transformers
Encoder-Decoder Transformer pretrained with language modeling tasks
• The flexibility allowed T5 to explore many different denoising tasks
[7] Colin, et al. "Exploring the limits of transfer learning with a unified text-to-text transformer.“
JMLR. 2020.
Table 9: Pretraining Tasks Explored in T5 [7].
<s> A B C D E F G </s>
Transformer Encoder
H I J K L M N O </s>
Input:
Target:
Transformer Decoder
<s> H I J K L M N O
Chenyan Xiong 11-667 CMU
39
T5 Pretraining Task Studies
Use of T5: fine-tuned with
• Encoder takes task input
• Decoder generating the label word, e.g., “Entailment” for MNLI
Denoising Task GLUE AVG SQuAD
Auto-Regressive LM 80.7 78.0
De-shuffling 73.2 67.6
Masked-LM, Reconstruct All 83.0 80.7
Replace Corrupted Spans 83.3 80.9
Drop Corrupted Tokens 84.4 80.5
Table 9: Pretraining Tasks Results with T5 base [7].
[7] Colin, et al. "Exploring the limits of transfer learning with a unified text-to-text transformer.“
JMLR. 2020.
Chenyan Xiong 11-667 CMU
40
T5 Pretraining Task Studies
Use of T5: fine-tuned with
• Encoder takes task input
• Decoder generating the label word, e.g., “Entailment” for MNLI
• Different variations of Masked-LM style denoising task performed similarly
Table 9: Pretraining Tasks Results with T5 base [7].
[7] Colin, et al. "Exploring the limits of transfer learning with a unified text-to-text transformer.“
JMLR. 2020.
Denoising Task GLUE AVG SQuAD
Auto-Regressive LM 80.7 78.0
De-shuffling 73.2 67.6
Masked-LM, Reconstruct All 83.0 80.7
Replace Corrupted Spans 83.3 80.9
Drop Corrupted Tokens 84.4 80.5
Chenyan Xiong 11-667 CMU
41
BART Pretraining Tasks
Various denoising tasks explored with BART’s encoder-decoder
• Both sentence level and token level
• Flexible architecture enabled reconstruction from various types of noises
[8] Lewis, Mike, et al. "BART: Denoising Sequence-to-Sequence Pre-training for Natural Language
Generation, Translation, and Comprehension." ACL. 2020.
Figure 5: Denoising Tasks Explored in BART [8]
Chenyan Xiong 11-667 CMU
42
BART Pretraining Task Studies
Use of BART:
• Representation style tasks: feed same inputs to both encoder and decoder, use decoder representations
• Generation: use decoder
MNLI (Acc) SQuAD (F1)
Document Rotation 75.3 77.2
Sentence Shuffling 81.5 85.4
Token Masking 84.1 90.4
Token Deletion 84.1 90.4
Text Infilling 84.0 90.8
Text Infilling + Sentence Shuffling 83.8 90.8
Table 10: Pretraining Tasks Results with BART base [7].
Chenyan Xiong 11-667 CMU
43
BART Pretraining Task Studies
Use of BART:
• Representation style tasks: feed same inputs to both encoder and decoder, use decoder representations
• Generation: use decoder
• Different variations of Masked-LM style denoising task performed similarly
MNLI (Acc) SQuAD (F1)
Document Rotation 75.3 77.2
Sentence Shuffling 81.5 85.4
Token Masking 84.1 90.4
Token Deletion 84.1 90.4
Text Infilling 84.0 90.8
Text Infilling + Sentence Shuffling 83.8 90.8
Table 10: Pretraining Tasks Results with BART base [7].
Chenyan Xiong 11-667 CMU
44
Pretraining Tasks: Summary
Classic Auto-Regressive LM and BERT’s Masked LM are very effective
• A solid foundation to scale up
Early explorations on variant language modeling tasks do not obtain much general improvements
• Application-specific gains are more observed
• All in forms of (rule-based random noise + reconstruction target)
Sequence level tasks not showing much benefits on tasks like GLUE and SQuAD
• Hard to fathom strong “semantic”, “knowledge”, or “intelligence” from some sequence level tasks
TL;DR: for base scale LMs
• Generation→ Auto-Regressive LM
• Representation→ Masked LM
Chenyan Xiong 11-667 CMU
Questions?
Chenyan Xiong 11-667 CMU
46
References: Pretraining Objectives
• [Pretraining] Dai, Andrew M., and Quoc V. Le. "Semi-supervised sequence learning." Advances in neural
information processing systems 28 (2015).
• [ELMO] Sarzynska-Wawer, Justyna, et al. "Detecting formal thought disorder by deep contextualized word
representations." Psychiatry Research 304 (2021): 114135.
• [GPT] Radford, Alec, et al. "Improving language understanding by generative pre-training." (2018).
• [BERT] Devlin et al. "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding." In
Proceedings of NAACL-HLT, pp. 4171-4186. 2019.
• [XL-NET] Yang, Zhilin, et al. "XLNet: Generalized autoregressive pretraining for language understanding."
Advances in neural information processing systems 32 (2019).
• [SpanBERT] Joshi, Mandar, et al. "Spanbert: Improving pre-training by representing and predicting spans."
Transactions of the Association for Computational Linguistics 8 (2020): 64-77.
• [REALM] Guu, Kelvin, et al. "Retrieval augmented language model pre-training." International conference on
machine learning. PMLR, 2020.
• [BART] Lewis, Mike, et al. “BART: Denoising sequence-to-sequence pre-training for natural language generation,
translation, and comprehension." arXiv preprint arXiv:1910.13461 (2019).
• [T5] Raffel, Colin, et al. "Exploring the limits of transfer learning with a unified text-to-text transformer." The
Journal of Machine Learning Research 21.1 (2020): 5485-5551.

More Related Content

PDF
Beyond the Symbols: A 30-minute Overview of NLP
PDF
CUHK intern PPT. Machine Translation Evaluation: Methods and Tools
PDF
Deep Learning & NLP: Graphs to the Rescue!
PDF
Engineering Intelligent NLP Applications Using Deep Learning – Part 2
PPTX
Talk from NVidia Developer Connect
PDF
CSE 291 – AI Agents 2/11 – Attention and Language Modelingattention_language_...
PDF
Li Deng at AI Frontiers: Three Generations of Spoken Dialogue Systems (Bots)
PDF
Deep Learning for Natural Language Processing: Word Embeddings
Beyond the Symbols: A 30-minute Overview of NLP
CUHK intern PPT. Machine Translation Evaluation: Methods and Tools
Deep Learning & NLP: Graphs to the Rescue!
Engineering Intelligent NLP Applications Using Deep Learning – Part 2
Talk from NVidia Developer Connect
CSE 291 – AI Agents 2/11 – Attention and Language Modelingattention_language_...
Li Deng at AI Frontiers: Three Generations of Spoken Dialogue Systems (Bots)
Deep Learning for Natural Language Processing: Word Embeddings

Similar to Pretraining Task - Auto-Regressive LM, Transformer Encoder-Decoders (20)

PDF
W9L2 Scaling Up LLM Pretraining: Scaling Law
PDF
MT SUMMIT PPT: Language-independent Model for Machine Translation Evaluation ...
PDF
Natural language processing (Python)
PDF
OWF14 - Big Data : The State of Machine Learning in 2014
PDF
Master defence 2020 - Anastasiia Khaburska - Statistical and Neural Language ...
PDF
[KDD 2018 tutorial] End to-end goal-oriented question answering systems
PDF
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
PDF
GPT-2: Language Models are Unsupervised Multitask Learners
PDF
LEPOR: an augmented machine translation evaluation metric - Thesis PPT
PDF
Lepor: augmented automatic MT evaluation metric
PDF
Open vocabulary problem
PDF
230915 paper summary learning to world model with language with details - pub...
PDF
230906 paper summary - learning to world model with language - public.pdf
PDF
Natural Language Processing (NLP)
PDF
A Neural Grammatical Error Correction built on Better Pre-training and Sequen...
PDF
Neural Machine Translation (D3L4 Deep Learning for Speech and Language UPC 2017)
PPTX
Natural Language Processing (NLP).pptx
PDF
PL Lecture 01 - preliminaries
PDF
Deep Learning, an interactive introduction for NLP-ers
PPTX
2106 ACM DIS
W9L2 Scaling Up LLM Pretraining: Scaling Law
MT SUMMIT PPT: Language-independent Model for Machine Translation Evaluation ...
Natural language processing (Python)
OWF14 - Big Data : The State of Machine Learning in 2014
Master defence 2020 - Anastasiia Khaburska - Statistical and Neural Language ...
[KDD 2018 tutorial] End to-end goal-oriented question answering systems
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
GPT-2: Language Models are Unsupervised Multitask Learners
LEPOR: an augmented machine translation evaluation metric - Thesis PPT
Lepor: augmented automatic MT evaluation metric
Open vocabulary problem
230915 paper summary learning to world model with language with details - pub...
230906 paper summary - learning to world model with language - public.pdf
Natural Language Processing (NLP)
A Neural Grammatical Error Correction built on Better Pre-training and Sequen...
Neural Machine Translation (D3L4 Deep Learning for Speech and Language UPC 2017)
Natural Language Processing (NLP).pptx
PL Lecture 01 - preliminaries
Deep Learning, an interactive introduction for NLP-ers
2106 ACM DIS
Ad

More from cniclsh1 (20)

PDF
Knowledge Representation Part VI by Jan Pettersen Nytun
PDF
Knowledge Representation Part III by Jan Pettersen Nytun
PDF
interacting-with-ai-2023---module-2---session-4---handout.pdf
PDF
interacting-with-ai-2023---module-2---session-3---handout.pdf
PDF
interacting-with-ai-2023---module-2---session-1---handout.pdf
PDF
Chatbot are sentient, turing test, generative AI
PDF
Model-Based Reinforcement Learning CS 285: Deep Reinforcement Learning, Decis...
PDF
Inverse Reinforcement Learning CS 285: Deep Reinforcement Learning, Decision ...
PDF
Probabilistic AI Lecture 1: Introduction to variational inference and the ELBO
PDF
Bayesian Statistics in High Dimensions Lecture 1: Curve and surface estimation
PDF
Foundations of Artificial Intelligence 1. Introduction Organizational Aspects...
PDF
W1L2_11-667 - Building Blocks of Modern LLMs 2: Pretraining Tasks
PDF
W4L2_11-667: LARGE LANGUAGE MODELS: METHODS AND APPLICATIONS - Parameter Effi...
PDF
W4L1_11-667: LARGE LANGUAGE MODELS: METHODS AND APPLICATIONS - Human Evaluati...
PDF
W4L2_11-667: LARGE LANGUAGE MODELS: METHODS AND APPLICATIONS PETM Parameter E...
PDF
W6L1_LARGE LANGUAGE MODELS: METHODS AND APPLICATIONS - Chatbots and AI Agents
PDF
LLM for Search Engines: Part 2,Pretrain retrieval representations
PDF
W10L2 Scaling Up LLM Pretraining: Parallel Training Scaling Up Optimizer Basi...
PDF
Scaling Up LLM Pretraining: Parallel Training
PDF
W11L2 Efficient Scaling Retrieval Augmentation.pdf
Knowledge Representation Part VI by Jan Pettersen Nytun
Knowledge Representation Part III by Jan Pettersen Nytun
interacting-with-ai-2023---module-2---session-4---handout.pdf
interacting-with-ai-2023---module-2---session-3---handout.pdf
interacting-with-ai-2023---module-2---session-1---handout.pdf
Chatbot are sentient, turing test, generative AI
Model-Based Reinforcement Learning CS 285: Deep Reinforcement Learning, Decis...
Inverse Reinforcement Learning CS 285: Deep Reinforcement Learning, Decision ...
Probabilistic AI Lecture 1: Introduction to variational inference and the ELBO
Bayesian Statistics in High Dimensions Lecture 1: Curve and surface estimation
Foundations of Artificial Intelligence 1. Introduction Organizational Aspects...
W1L2_11-667 - Building Blocks of Modern LLMs 2: Pretraining Tasks
W4L2_11-667: LARGE LANGUAGE MODELS: METHODS AND APPLICATIONS - Parameter Effi...
W4L1_11-667: LARGE LANGUAGE MODELS: METHODS AND APPLICATIONS - Human Evaluati...
W4L2_11-667: LARGE LANGUAGE MODELS: METHODS AND APPLICATIONS PETM Parameter E...
W6L1_LARGE LANGUAGE MODELS: METHODS AND APPLICATIONS - Chatbots and AI Agents
LLM for Search Engines: Part 2,Pretrain retrieval representations
W10L2 Scaling Up LLM Pretraining: Parallel Training Scaling Up Optimizer Basi...
Scaling Up LLM Pretraining: Parallel Training
W11L2 Efficient Scaling Retrieval Augmentation.pdf
Ad

Recently uploaded (20)

PDF
KodekX | Application Modernization Development
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
cuic standard and advanced reporting.pdf
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Spectral efficient network and resource selection model in 5G networks
PPTX
MYSQL Presentation for SQL database connectivity
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PPTX
Programs and apps: productivity, graphics, security and other tools
PPTX
Big Data Technologies - Introduction.pptx
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Network Security Unit 5.pdf for BCA BBA.
PPTX
Spectroscopy.pptx food analysis technology
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Electronic commerce courselecture one. Pdf
KodekX | Application Modernization Development
Review of recent advances in non-invasive hemoglobin estimation
cuic standard and advanced reporting.pdf
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Spectral efficient network and resource selection model in 5G networks
MYSQL Presentation for SQL database connectivity
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
Chapter 3 Spatial Domain Image Processing.pdf
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Programs and apps: productivity, graphics, security and other tools
Big Data Technologies - Introduction.pptx
Building Integrated photovoltaic BIPV_UPV.pdf
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Network Security Unit 5.pdf for BCA BBA.
Spectroscopy.pptx food analysis technology
Reach Out and Touch Someone: Haptics and Empathic Computing
The AUB Centre for AI in Media Proposal.docx
Electronic commerce courselecture one. Pdf

Pretraining Task - Auto-Regressive LM, Transformer Encoder-Decoders

  • 1. Chenyan Xiong 11-667 CMU Building Blocks of Modern LLMs 2: Pretraining Tasks Chenyan Xiong 11-667 08/31/2023
  • 2. Chenyan Xiong 11-667 CMU Pretraining Tasks
  • 3. Chenyan Xiong 11-667 CMU 3 Pretraining and Language Modeling Pre-training: An unsupervised learning phrase before traditional supervised learning • Original goal: provide better initialization points for supervised training Language modeling: Predict a part of a given language piece (target) using the rest (context) • A classic task in NLP et al. to model human usage of natural language
  • 4. Chenyan Xiong 11-667 CMU 4 Pretraining and Language Modeling Why language modeling as pretraining task? • Infinite data, way more than current computing system can consume • Beyond trillions of web pages processed • Much more discovered
  • 5. Chenyan Xiong 11-667 CMU 5 Pretraining and Language Modeling Why language modeling as pretraining task? • Infinite data, way more than current computing system can consume • Beyond trillions of web pages processed • Much more discovered • Language, a main carrier of human knowledge • We learn, communicate, and invent through language • Other modalities often centered around language • Not all tasks need language, but one would argue whether that is “human intelligence”
  • 6. Chenyan Xiong 11-667 CMU 6 Pretraining and Language Modeling Why language modeling as pretraining task? • Infinite data, way more than current computing system can consume • Beyond trillions of web pages processed • Much more discovered • Language, a main carrier of human knowledge • We learn, communicate, and invent through language • Other modalities often centered around language • Not all tasks need language, but one would argue whether that is “human intelligence” • Many real-world applications are centered around language • Search, machine translation, question answering, writing assistance, etc.
  • 7. Chenyan Xiong 11-667 CMU 7 Autoregressive Language Modeling Classic language modeling: Given previous words, predict the next word • Let 𝑋 = {𝑥1, … 𝑥𝑡 … , 𝑥𝑛} a text sequence of n tokens, the standard language modeling objective is to maximize the likelihood: 𝐿𝑙𝑚 𝑋 = ෍ 𝑡 log 𝑝(𝑥𝑡|𝑥𝑡−𝑘:𝑡−1; Θ) • Where: • 𝑥𝑡: t-th token, the prediction target • 𝑥𝑡−𝑘:𝑡−1: previous k tokens (context), k=context window size • Θ: language model parameters Autoregressive: predicting the next word given previous words • Following the nature of language, though can be done reversely too
  • 8. Chenyan Xiong 11-667 CMU 8 Autoregressive Language Modeling Classic language modeling: Given previous words, predict the next word • Let 𝑋 = {𝑥1, … 𝑥𝑡 … , 𝑥𝑛} a text sequence of n tokens, the standard language modeling objective is to maximize the likelihood: 𝐿𝑙𝑚 𝑋 = ෍ 𝑡 log 𝑝(𝑥𝑡|𝑥𝑡−𝑘:𝑡−1; Θ) • Where: • 𝑥𝑡: t-th token, the prediction target • 𝑥𝑡−𝑘:𝑡−1: previous k tokens (context), k=context window size • Θ: language model parameters The Steelers enjoy a large, widespread fanbase nicknamed Steeler Nation Language Model (𝚯) 𝑥𝑡−𝑘:𝑡−1 𝑥𝑡
  • 9. Chenyan Xiong 11-667 CMU 9 Autoregressive Language Modeling The language model can be implemented in many ways • Discrete n-gram frequency based: 𝑝 𝑥𝑡 𝑥𝑡−𝑘:𝑡−1 = count 𝑥𝑡−𝑘:𝑡−1, 𝑥𝑡 count 𝑥𝑡−𝑘:𝑡−1 • Continuous neural network models: 𝑝 𝑥𝑡 𝑥𝑡−𝑘:𝑡−1; Θ = 𝑓 𝑥𝑡 𝑥𝑡−𝑘:𝑡−1; Θ • 𝑓(; Θ): a neural network, e.g., feedforward network, CNN, RNN, or
  • 10. Chenyan Xiong 11-667 CMU 10 Autoregressive Language Modeling The language model can be implemented in many ways • Discrete n-gram frequency based: 𝑝 𝑥𝑡 𝑥𝑡−𝑘:𝑡−1 = count 𝑥𝑡−𝑘:𝑡−1, 𝑥𝑡 count 𝑥𝑡−𝑘:𝑡−1 • Continuous neural network models: 𝑝 𝑥𝑡 𝑥𝑡−𝑘:𝑡−1; Θ = 𝑓 𝑥𝑡 𝑥𝑡−𝑘:𝑡−1; Θ • 𝑓(; Θ): a neural network, e.g., feedforward network, CNN, RNN, or • Transformer Decoder: <s> A B C D E F G H Transformer Decoder A B C D E F G H </s> Input: 𝑓(;Θ): Target:
  • 11. Chenyan Xiong 11-667 CMU 11 Autoregressive Language Modeling Advantages of autoregressive language modeling: • Intuitive, follows the nature flow of human language • Aligns with many natural language generation style tasks • Training signals at every token position in the sequence Constraints: • More for decoder style models, a.k.a. unidirectional networks→restriction of model flexibility <s> A B C D E F G H Transformer Decoder A B C D E F G H </s> Input: 𝑓(;Θ): Target:
  • 12. Chenyan Xiong 11-667 CMU 12 Auto-Encoder Language Modeling Learn to reconstruct language from a learned hidden representation • Given the text sequence 𝑋 = {𝑥1, … 𝑥𝑡 … , 𝑥𝑛}, the auto-encoder is to maximize the reconstruction likelihood: 𝐿AE 𝑋 = ෍ 𝑡 log 𝑝 𝑥𝑡 𝑥𝑡−𝑘:𝑡−1; Θdec, 𝒛 𝑓(𝒛|𝑋, Θenc) • Where: • Θdec: language decoder parameters • Θenc: language encoder parameters • 𝒛: the hidden representation. Many viable formulations. In this class it is a neural embedding.
  • 13. Chenyan Xiong 11-667 CMU 13 Auto-Encoder Language Modeling Learn to reconstruct language from a learned hidden representation • Given the text sequence 𝑋 = {𝑥1, … 𝑥𝑡 … , 𝑥𝑛}, the auto-encoder is to maximize the reconstruction likelihood: 𝐿AE 𝑋 = ෍ 𝑡 log 𝑝 𝑥𝑡 𝑥𝑡−𝑘:𝑡−1; Θdec, 𝒛 𝑓(𝒛|𝑋, Θenc) • Where: • Θdec: language decoder parameters • Θenc: language encoder parameters • 𝒛: the hidden representation. Many viable formulations. In this class it is a neural embedding. The Steelers enjoy a large, widespread fanbase nicknamed Steeler Nation Nation Language Encoder (𝚯𝐞𝐧𝐜) 𝑋 𝑥𝑡 𝒛 The Steelers enjoy a large, widespread fanbase nicknamed Steeler Language Decoder (𝚯𝐝𝐞𝐜)
  • 14. Chenyan Xiong 11-667 CMU 14 Auto-Encoder Language Modeling The encoder and decoder can be various types of neural networks • RNN, CNN, Transformers • The signature is the information bottleneck 𝒛 between encoder and decoder • Advantage of Auto-Encoder language modeling • Explicit learning towards the sequence embedding 𝒛 • Allows various operations to convey prior knowledge to 𝒛 for generation, especially for vision-alike modalities • Aligns with language representation tasks that need sequence level embeddings The Steelers enjoy a large, widespread fanbase nicknamed Steeler Nation Nation Language Encoder (𝚯𝐞𝐧𝐜) 𝑋 𝑥𝑡 𝒛 The Steelers enjoy a large, widespread fanbase nicknamed Steeler Language Decoder (𝚯𝐝𝐞𝐜)
  • 15. Chenyan Xiong 11-667 CMU 15 Early experiments with decoder and auto-encoder Evaluation set up: • Task: IMDB sentiment classification • Given the text of a review from IMDB, classify whether positive or negative [1] Dai, Andrew M., and Quoc V. Le. "Semi-supervised sequence learning." NeurIPS 2015. Table 1: Examples of IMDB sentiment classification task [1]
  • 16. Chenyan Xiong 11-667 CMU 16 Early experiments with decoder and auto-encoder Evaluation set up: • Task: IMDB sentiment classification • Pretraining: language modeling on 8 million IMDB movie reviews • Neural network: LSTMs • Auto-Encoder: discard decoder, fine-tune encoder • Decoder: fine-tune decoder One of the earliest explorations of language model pretraining, in 2015 [1] [1] Dai, Andrew M., and Quoc V. Le. "Semi-supervised sequence learning." NeurIPS 2015.
  • 17. Chenyan Xiong 11-667 CMU 17 Early experiments with decoder and auto-encoder Evaluation set up: • Task: IMDB sentiment classification • Pretraining: language modeling on 8 million IMDB movie reviews • Neural network: LSTMs • Auto-Encoder: discard decoder, fine-tune encoder • Decoder: fine-tune decoder One of the earliest explorations of language model pretraining, in 2015 [1] [1] Dai, Andrew M., and Quoc V. Le. "Semi-supervised sequence learning." NeurIPS 2015. Method Test Error Rate↓ LSTM (No Pretraining, Finetune Only) 13.5% Auto-Regressive LSTM Decoder (Pretrain→Finetune) 7.64% Auto-Encoder LSTM Encoder (Pretrain→Finetune) 7.24% Auto-Encoder LSTM Encoder (Pretrain + Finetune, Multi-Task) 14.7% Table 2: Results on IMDB sentiment classification task [1]
  • 18. Chenyan Xiong 11-667 CMU 18 Early experiments with decoder and auto-encoder Observations from Dai and Le [1]: • Pretraining helps significantly, as a better initialization • Not only on accuracy but also on stability, and generalization ability • Decoder LSTM as a representation model is slightly worse than encoder LSTM • Mixing pretraining and supervised learning hurts. • It is pre-training. [1] Dai, Andrew M., and Quoc V. Le. "Semi-supervised sequence learning." NeurIPS 2015. Method Test Error Rate↓ LSTM (No Pretraining, Finetune Only) 13.5% Auto-Regressive LSTM Decoder (Pretrain→Finetune) 7.64% Auto-Encoder LSTM Encoder (Pretrain→Finetune) 7.24% Auto-Encoder LSTM Encoder (Pretrain + Finetune, Multi-Task) 14.7% Table 2: Results on IMDB sentiment classification task [1]
  • 19. Chenyan Xiong 11-667 CMU 19 GPT-1: Pretraining + Transformer Decoder GPT-1 combines unsupervised pretraining and Transformer network • Auto-regressive language modeling • Transformer decoder Another significant difference: Scale • Much bigger network • Transformers are easier to train than LSTM • More data • Books Corpus, ~1 billion words. [2] Radford, Alec, et al. "Improving language understanding by generative pre-training." (2018).
  • 20. Chenyan Xiong 11-667 CMU 20 GPT-1: Experimental Setup Evaluation Task: GLUE benchmark • A set of language classification tasks • Most informative task is Multi-Genre Natural Language Inference (MNLI) • Given a pair of statements, predict whether one entails, contradicts, or is neural to the other Premise Hypothesis Label Conceptually cream skimming has two basic dimensions - product and geography. Product and geography are what make cream skimming work. Neutral Read for Slate 's take on Jackson's findings. Slate had an opinion on Jackson's findings. Entailment In an increasingly interdependent world, many pressing problems that affect Americans can be addressed only through cooperation with other countries We should be independent and stay away from talking and working with other nations. Contradiction Table 3: Examples of MNLI [2] Radford, Alec, et al. "Improving language understanding by generative pre-training." (2018).
  • 21. Chenyan Xiong 11-667 CMU 21 GPT-1: Evaluation Results Results on MNLI and GLUE Average Transformer is a much stronger architecture than LSTM • More power • Much easier to train Pretraining brings a huge advantage • Mixing pretraining with finetuning does not really help [2] Radford, Alec, et al. "Improving language understanding by generative pre-training." (2018). Method MNLI (ACC) GLUE AVG Pretrained LSTM Decoder 73.7 69.1 Non Pretrained Transformer 75.7 59.9 Pretrained Transformer 81.1 75.0 Pretrained Transformer + LM Multi-Task Finetune 81.8 74.7 Table 4: GPT-1 Results on GLUE [2]
  • 22. Chenyan Xiong 11-667 CMU Early Insights on Pretraining and Transformer Early glimpse of zero-shot task solving [2] Radford, Alec, et al. "Improving language understanding by generative pre-training." (2018). Figure 1: GPT-1 GLUE Performance at Different Stages [2]
  • 23. Chenyan Xiong 11-667 CMU Early Insights on Pretraining and Transformer Early glimpse of zero-shot task solving Improving zero-shot with more pretraining Steps • Burst increasements on some tasks • Different benefits on different tasks Many benefits as a starting point of finetuning • Not only a faster initialization but a better one • Necessary for tasks with limited labels [2] Radford, Alec, et al. "Improving language understanding by generative pre-training." (2018). Figure 1: GPT-1 GLUE Performance at Different Stages [2]
  • 24. Chenyan Xiong 11-667 CMU 24 Pretraining by Denoising Task Denoising training • Reconstruct the original input from an input mixed with noises • Variety ways to construct the noisy input • A classic unsupervised learning task used in many modalities • Language, vision, molecular, etc. Figure 2: Example of Vision Denoising Training [3] [3] Brempong, Emmanuel Asiedu, et al. "Denoising pretraining for semantic segmentation." CVPR 2022.
  • 25. Chenyan Xiong 11-667 CMU 25 Masked Language Modeling Masked Language Modeling, the denoising pretraining used in BERT • Noisy Input: Text sequence with masked out token positions • Reconstruction Target: Original tokens at masked out positions • Let 𝑋MSK = {𝑥1, … MSK 𝑡 … , 𝑥𝑛} a text sequence of n tokens with positions 𝑡 ∈ 𝑀 replaced with [MSK] tokens, • the Masked LM task is to maximize the likelihood of recovering masked out tokens: 𝐿MLM 𝑋 = ෍ 𝑡∈𝑀 log 𝑝(𝑥𝑡|𝑋MSK; Θ) The Steelers [MSK] a large, widespread [MSK] nicknamed Steeler Nation Fanbase Masked Language Model (𝚯) 𝑋MSK 𝑥𝑡 Enjoy [4] Devlin et al. "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding." NAACL-HLT. 2019.
  • 26. Chenyan Xiong 11-667 CMU 26 BERT Pretraining with Masked LM BERT uses a bi-directional Transformer encoder as the language model • Forward pass: • Mask LM Head: • Mask LM Loss: Where: • 𝒙 the embedding of token 𝑥 • 𝑯, 𝒉𝒕 the last layer’s representation of Transformer and the one for the t-th position. [4] Devlin et al. "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding." NAACL-HLT. 2019. 𝑋MSK Transformer 𝑯 MLM Head 𝑝MLM(𝑥|𝒉𝑡) 𝑝MLM 𝑥 𝒉𝑖 = exp(𝒙𝑇𝒉𝑡) σ𝑥𝑖∈𝑉 exp 𝒙𝒊 𝑇 𝒉𝒕 𝐿MLM = E(− ෍ 𝑡∈𝑀 log 𝑝MLM 𝑥𝑡 𝒉𝑡 )
  • 27. Chenyan Xiong 11-667 CMU 27 BERT: Experimental Setup Notable hyper-parameters • Both became standard experimental settings in the pretraining literature • Base setting is chosen to be close to GPT-1 for comparison Other important setups • Mask fraction: 15% • Optimizer: Adam with warm up Total Parameters Transformer Layers Hidden Dimensions Sequence Length Pretraining Corpus Pretraining Steps BERTbase 110M 12 768 512 Wikipedia (2.5 billion words)+ BookCorpus (0.8b) 128K tokens/batch * 1M steps BERTlarge 340M 24 1024 512 [4] Devlin et al. "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding." NAACL-HLT. 2019. Table 5: BERT base and large configurations
  • 28. Chenyan Xiong 11-667 CMU 28 BERT: Experimental Setup Evaluation Tasks: GLUE, SQuAD, and many more SQuAD: Question answering, reading comprehension style • Given a natural language question and a passage, find the span (n-gram) answer in the passage • Evaluate by matching the target answer phrase • A good representative of several types of NLP tasks: • Knowledge-intensive: Questions require “human knowledge” to answer • Token-level tasks: Label prediction at token level • One of the early QA experiences in commercial search engines (extractive QA) [4] Devlin et al. "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding." NAACL-HLT. 2019. Question: What kind of music does Beyonce do? Passage: Beyoncé's music is generally R&B, but she also incorporates pop, soul and funk into her songs. 4 demonstrated Beyoncé's exploration of 90s-style R&B, as well as further use of soul and hip hop than compared to previous releases…. Target Answer: R&B Table 6: SQuAD Example
  • 29. Chenyan Xiong 11-667 CMU 29 BERT: Evaluation Results Results on MNLI, GLUE Average, and SQuAD 1.1 Develop set Much stronger results than GPT-1 • More flexibile architecture (allow bidirectional attention path) • More data (Wiki + BookCorpus) Significant gains by scaling from base to large [4] Devlin et al. "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding." NAACL-HLT. 2019. MNLI (ACC) GLUE AVG SQuAD (F1) ELMO 76.3 71.0 85.6 GPT-1 81.8 75.1 n.a. BERTbase 84.0 79.6 88.5 BERTlarge 86.3 82.1 90.9 Table 7: BERT Evaluation Results [4]
  • 30. Chenyan Xiong 11-667 CMU BERT: Analysis Benefits of Masked LM Significant benefits from using Masked LM • Hard to apply MLM on decoder only models Auto-regressive LM starts faster • But quickly by-passed by Masked LM Figure 3: BERT finetuned accuracy after different pretraining steps with Masked LM and Auto-regressive LM [4] [4] Devlin et al. "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding." NAACL-HLT. 2019.
  • 31. Chenyan Xiong 11-667 CMU 31 More Finessed Denoising Task: Span Masking Span Masking: Instead of randomly sampled token positions, masking out more spans (continuous positions) [5] Joshi, Mandar, et al. "SpanBERT: Improving pre-training by representing and predicting spans." TACL 2020. The Steelers enjoy a large, widespread [MSK] [MSK] [MSK] nation fanbase nicknamed steeler Masked Language Model (𝚯) 𝑥𝑡:𝑡+3 𝑋SpanMSK
  • 32. Chenyan Xiong 11-667 CMU 32 More Finessed Denoising Task: Span Masking Span Masking: instead of randomly sampled token positions, masking out more spans (continuous positions) • Span sampling: • Sample a span length (# of tokens) from a geometric distribution • Randomly sample a starting point of the span to mask • Till reached total mask fraction (15%) [5] Joshi, Mandar, et al. "SpanBERT: Improving pre-training by representing and predicting spans." TACL 2020. The Steelers enjoy a large, widespread [MSK] [MSK] [MSK] nation fanbase nicknamed steeler Masked Language Model (𝚯) 𝑥𝑡:𝑡+3 Figure 4: Geometric distribution used to sample span length in SpanBERT [5] 𝑋SpanMSK
  • 33. Chenyan Xiong 11-667 CMU 33 More Finessed Denoising Task: Span Masking Span Masking: instead of randomly sampled token positions, masking out more spans (continuous positions) Benefits: • A little higher granularity (tokens to phrases), thus harder/more semantical? • Aligns well with some downstream applications, e.g., SQuAD [5] Joshi, Mandar, et al. "SpanBERT: Improving pre-training by representing and predicting spans." TACL 2020. The Steelers enjoy a large, widespread [MSK] [MSK] [MSK] nation. fanbase nicknamed steeler Masked Language Model (𝚯) 𝑋SpanMSK 𝑥𝑡:𝑡+3
  • 34. Chenyan Xiong 11-667 CMU 34 More Finessed Denoising Task: Salient Span Masking Salient Span Mask (SSM): Masking out spans corresponding to entities and attributes (salient) First use fine-tuned BERT to tag named entities and rules to tag dates (salient spans) Sample span mask from salient spans Benefits: • A lightweight way of introducing knowledge • Directly targeting knowledge-intensive tasks, e.g., dates The Steelers enjoy a large, widespread fanbase nicknamed [MSK] [MSK]. steeler nation Masked Language Model (𝚯) 𝑋SSM 𝑥𝑡:𝑡+2 [6] Guu, Kelvin, et al. "Retrieval augmented language model pre-training." ICML, 2020
  • 35. Chenyan Xiong 11-667 CMU 35 Recap: Autoregressive LM and Masked LM Autoregressive LM Masked LM Neural Architecture More suited for decoder Encoder and decoder Training Density All Token Positions 15% of Masked Positions Converging Speed/Stability Fast and stable Slower and less stable Task Fit Generation Representation Notable Models GPT-* BERT Table 8: Recap of Autoregressive LM and Masked LM
  • 36. Chenyan Xiong 11-667 CMU 36 Combination of Auto-Regressive and Masked LM Various efforts to combine the benefits of Auto-Regressive LM and Masked LM • One model for both generation and representation • Better training effectiveness from multi-task learning? Notable examples: • UniLM: Dong, Li, et al. "Unified language model pre-training for natural language understanding and generation." NeurIPS 2019. • XL-NET: Yang et al. “XL-NET: Generalized autoregressive pretraining for language understanding." NeurIPS 2019.
  • 37. Chenyan Xiong 11-667 CMU 37 Transformer Encoder-Decoders Much of the difference of auto-regressive versus Masked LM also resides in the Transformer architecture: • Encoder: bi-directional representation power • Decoder: natural generation Transformer Encoder-Decoder enjoy the benefits of both • Flexible for various types of denoising tasks • Support different downstream applications with either side, or both together <s> A B C D E F G </s> Transformer Encoder H I J K L M N O </s> Input: Target: Transformer Decoder <s> H I J K L M N O
  • 38. Chenyan Xiong 11-667 CMU 38 T5: Text-to-Text Transfer Transformers Encoder-Decoder Transformer pretrained with language modeling tasks • The flexibility allowed T5 to explore many different denoising tasks [7] Colin, et al. "Exploring the limits of transfer learning with a unified text-to-text transformer.“ JMLR. 2020. Table 9: Pretraining Tasks Explored in T5 [7]. <s> A B C D E F G </s> Transformer Encoder H I J K L M N O </s> Input: Target: Transformer Decoder <s> H I J K L M N O
  • 39. Chenyan Xiong 11-667 CMU 39 T5 Pretraining Task Studies Use of T5: fine-tuned with • Encoder takes task input • Decoder generating the label word, e.g., “Entailment” for MNLI Denoising Task GLUE AVG SQuAD Auto-Regressive LM 80.7 78.0 De-shuffling 73.2 67.6 Masked-LM, Reconstruct All 83.0 80.7 Replace Corrupted Spans 83.3 80.9 Drop Corrupted Tokens 84.4 80.5 Table 9: Pretraining Tasks Results with T5 base [7]. [7] Colin, et al. "Exploring the limits of transfer learning with a unified text-to-text transformer.“ JMLR. 2020.
  • 40. Chenyan Xiong 11-667 CMU 40 T5 Pretraining Task Studies Use of T5: fine-tuned with • Encoder takes task input • Decoder generating the label word, e.g., “Entailment” for MNLI • Different variations of Masked-LM style denoising task performed similarly Table 9: Pretraining Tasks Results with T5 base [7]. [7] Colin, et al. "Exploring the limits of transfer learning with a unified text-to-text transformer.“ JMLR. 2020. Denoising Task GLUE AVG SQuAD Auto-Regressive LM 80.7 78.0 De-shuffling 73.2 67.6 Masked-LM, Reconstruct All 83.0 80.7 Replace Corrupted Spans 83.3 80.9 Drop Corrupted Tokens 84.4 80.5
  • 41. Chenyan Xiong 11-667 CMU 41 BART Pretraining Tasks Various denoising tasks explored with BART’s encoder-decoder • Both sentence level and token level • Flexible architecture enabled reconstruction from various types of noises [8] Lewis, Mike, et al. "BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension." ACL. 2020. Figure 5: Denoising Tasks Explored in BART [8]
  • 42. Chenyan Xiong 11-667 CMU 42 BART Pretraining Task Studies Use of BART: • Representation style tasks: feed same inputs to both encoder and decoder, use decoder representations • Generation: use decoder MNLI (Acc) SQuAD (F1) Document Rotation 75.3 77.2 Sentence Shuffling 81.5 85.4 Token Masking 84.1 90.4 Token Deletion 84.1 90.4 Text Infilling 84.0 90.8 Text Infilling + Sentence Shuffling 83.8 90.8 Table 10: Pretraining Tasks Results with BART base [7].
  • 43. Chenyan Xiong 11-667 CMU 43 BART Pretraining Task Studies Use of BART: • Representation style tasks: feed same inputs to both encoder and decoder, use decoder representations • Generation: use decoder • Different variations of Masked-LM style denoising task performed similarly MNLI (Acc) SQuAD (F1) Document Rotation 75.3 77.2 Sentence Shuffling 81.5 85.4 Token Masking 84.1 90.4 Token Deletion 84.1 90.4 Text Infilling 84.0 90.8 Text Infilling + Sentence Shuffling 83.8 90.8 Table 10: Pretraining Tasks Results with BART base [7].
  • 44. Chenyan Xiong 11-667 CMU 44 Pretraining Tasks: Summary Classic Auto-Regressive LM and BERT’s Masked LM are very effective • A solid foundation to scale up Early explorations on variant language modeling tasks do not obtain much general improvements • Application-specific gains are more observed • All in forms of (rule-based random noise + reconstruction target) Sequence level tasks not showing much benefits on tasks like GLUE and SQuAD • Hard to fathom strong “semantic”, “knowledge”, or “intelligence” from some sequence level tasks TL;DR: for base scale LMs • Generation→ Auto-Regressive LM • Representation→ Masked LM
  • 45. Chenyan Xiong 11-667 CMU Questions?
  • 46. Chenyan Xiong 11-667 CMU 46 References: Pretraining Objectives • [Pretraining] Dai, Andrew M., and Quoc V. Le. "Semi-supervised sequence learning." Advances in neural information processing systems 28 (2015). • [ELMO] Sarzynska-Wawer, Justyna, et al. "Detecting formal thought disorder by deep contextualized word representations." Psychiatry Research 304 (2021): 114135. • [GPT] Radford, Alec, et al. "Improving language understanding by generative pre-training." (2018). • [BERT] Devlin et al. "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding." In Proceedings of NAACL-HLT, pp. 4171-4186. 2019. • [XL-NET] Yang, Zhilin, et al. "XLNet: Generalized autoregressive pretraining for language understanding." Advances in neural information processing systems 32 (2019). • [SpanBERT] Joshi, Mandar, et al. "Spanbert: Improving pre-training by representing and predicting spans." Transactions of the Association for Computational Linguistics 8 (2020): 64-77. • [REALM] Guu, Kelvin, et al. "Retrieval augmented language model pre-training." International conference on machine learning. PMLR, 2020. • [BART] Lewis, Mike, et al. “BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension." arXiv preprint arXiv:1910.13461 (2019). • [T5] Raffel, Colin, et al. "Exploring the limits of transfer learning with a unified text-to-text transformer." The Journal of Machine Learning Research 21.1 (2020): 5485-5551.