Pretraining Task - Auto-Regressive LM, Transformer Encoder-Decoders

Chenyan Xiong 11-667 CMU
Building Blocks of Modern LLMs 2:
Pretraining Tasks
Chenyan Xiong
11-667
08/31/2023

Pretraining Tasks

3
Pretraining and Language Modeling
Pre-training: An unsupervised learning phrase before traditional supervised learning
• Original goal: provide better initialization points for supervised training
Language modeling: Predict a part of a given language piece (target) using the rest (context)
• A classic task in NLP et al. to model human usage of natural language

4
Why language modeling as pretraining task?
• Infinite data, way more than current computing system can consume
• Beyond trillions of web pages processed
• Much more discovered

5
• Language, a main carrier of human knowledge
• We learn, communicate, and invent through language
• Other modalities often centered around language
• Not all tasks need language, but one would argue whether that is “human intelligence”

6
• Language, a main carrier of human knowledge
• We learn, communicate, and invent through language
• Other modalities often centered around language
• Not all tasks need language, but one would argue whether that is “human intelligence”
• Many real-world applications are centered around language
• Search, machine translation, question answering, writing assistance, etc.

7
Autoregressive Language Modeling
Classic language modeling: Given previous words, predict the next word
• Let 𝑋 = {𝑥1, … 𝑥𝑡 … , 𝑥𝑛} a text sequence of n tokens, the standard language modeling objective is to maximize
the likelihood:
𝐿𝑙𝑚 𝑋 = ෍
𝑡
log 𝑝(𝑥𝑡|𝑥𝑡−𝑘:𝑡−1; Θ)
• Where:
• 𝑥𝑡: t-th token, the prediction target
• 𝑥𝑡−𝑘:𝑡−1: previous k tokens (context), k=context window size
• Θ: language model parameters
Autoregressive: predicting the next word given previous words
• Following the nature of language, though can be done reversely too

8
Classic language modeling: Given previous words, predict the next word
• Let 𝑋 = {𝑥1, … 𝑥𝑡 … , 𝑥𝑛} a text sequence of n tokens, the standard language modeling objective is to maximize
the likelihood:
𝐿𝑙𝑚 𝑋 = ෍
𝑡
log 𝑝(𝑥𝑡|𝑥𝑡−𝑘:𝑡−1; Θ)
• Where:
• 𝑥𝑡: t-th token, the prediction target
• 𝑥𝑡−𝑘:𝑡−1: previous k tokens (context), k=context window size
• Θ: language model parameters
The Steelers enjoy a large, widespread fanbase nicknamed Steeler
Nation
Language Model (𝚯)
𝑥𝑡−𝑘:𝑡−1
𝑥𝑡

9
The language model can be implemented in many ways
• Discrete n-gram frequency based:
𝑝 𝑥𝑡 𝑥𝑡−𝑘:𝑡−1 =
count 𝑥𝑡−𝑘:𝑡−1, 𝑥𝑡
count 𝑥𝑡−𝑘:𝑡−1
• Continuous neural network models:
𝑝 𝑥𝑡 𝑥𝑡−𝑘:𝑡−1; Θ = 𝑓 𝑥𝑡 𝑥𝑡−𝑘:𝑡−1; Θ
• 𝑓(; Θ): a neural network, e.g., feedforward network, CNN, RNN, or

10
The language model can be implemented in many ways
• Discrete n-gram frequency based:
𝑝 𝑥𝑡 𝑥𝑡−𝑘:𝑡−1 =
count 𝑥𝑡−𝑘:𝑡−1, 𝑥𝑡
count 𝑥𝑡−𝑘:𝑡−1
• Continuous neural network models:
𝑝 𝑥𝑡 𝑥𝑡−𝑘:𝑡−1; Θ = 𝑓 𝑥𝑡 𝑥𝑡−𝑘:𝑡−1; Θ
• 𝑓(; Θ): a neural network, e.g., feedforward network, CNN, RNN, or
• Transformer Decoder:
<s> A B C D E F G H
Transformer Decoder
A B C D E F G H </s>
Input:
𝑓(;Θ):
Target:

11
Advantages of autoregressive language modeling:
• Intuitive, follows the nature flow of human language
• Aligns with many natural language generation style tasks
• Training signals at every token position in the sequence
Constraints:
• More for decoder style models, a.k.a. unidirectional networks→restriction of model flexibility
<s> A B C D E F G H
Transformer Decoder
A B C D E F G H </s>
Input:
𝑓(;Θ):
Target:

12
Auto-Encoder Language Modeling
Learn to reconstruct language from a learned hidden representation
• Given the text sequence 𝑋 = {𝑥1, … 𝑥𝑡 … , 𝑥𝑛}, the auto-encoder is to maximize the reconstruction likelihood:
𝐿AE 𝑋 = ෍
𝑡
log 𝑝 𝑥𝑡 𝑥𝑡−𝑘:𝑡−1; Θdec, 𝒛 𝑓(𝒛|𝑋, Θenc)
• Where:
• Θdec: language decoder parameters
• Θenc: language encoder parameters
• 𝒛: the hidden representation. Many viable formulations. In this class it is a neural embedding.

13
Learn to reconstruct language from a learned hidden representation
• Given the text sequence 𝑋 = {𝑥1, … 𝑥𝑡 … , 𝑥𝑛}, the auto-encoder is to maximize the reconstruction likelihood:
𝐿AE 𝑋 = ෍
𝑡
log 𝑝 𝑥𝑡 𝑥𝑡−𝑘:𝑡−1; Θdec, 𝒛 𝑓(𝒛|𝑋, Θenc)
• Where:
• Θdec: language decoder parameters
• Θenc: language encoder parameters
• 𝒛: the hidden representation. Many viable formulations. In this class it is a neural embedding.
The Steelers enjoy a large, widespread fanbase nicknamed Steeler Nation
Nation
Language Encoder (𝚯𝐞𝐧𝐜)
𝑋
𝑥𝑡
𝒛 The Steelers enjoy a large, widespread fanbase nicknamed Steeler
Language Decoder (𝚯𝐝𝐞𝐜)

14
The encoder and decoder can be various types of neural networks
• RNN, CNN, Transformers
• The signature is the information bottleneck 𝒛 between encoder and decoder
• Advantage of Auto-Encoder language modeling
• Explicit learning towards the sequence embedding 𝒛
• Allows various operations to convey prior knowledge to 𝒛 for generation, especially for vision-alike modalities
• Aligns with language representation tasks that need sequence level embeddings
The Steelers enjoy a large, widespread fanbase nicknamed Steeler Nation
Nation
Language Encoder (𝚯𝐞𝐧𝐜)
𝑋
𝑥𝑡
𝒛 The Steelers enjoy a large, widespread fanbase nicknamed Steeler
Language Decoder (𝚯𝐝𝐞𝐜)

15
Early experiments with decoder and auto-encoder
Evaluation set up:
• Task: IMDB sentiment classification
• Given the text of a review from IMDB, classify whether positive or negative
[1] Dai, Andrew M., and Quoc V. Le. "Semi-supervised sequence learning." NeurIPS 2015.
Table 1: Examples of IMDB sentiment classification task [1]

16
Evaluation set up:
• Pretraining: language modeling on 8 million IMDB movie reviews
• Neural network: LSTMs
• Auto-Encoder: discard decoder, fine-tune encoder
• Decoder: fine-tune decoder
One of the earliest explorations of language model pretraining, in 2015 [1]

17
Evaluation set up:
• Pretraining: language modeling on 8 million IMDB movie reviews
• Neural network: LSTMs
• Auto-Encoder: discard decoder, fine-tune encoder
• Decoder: fine-tune decoder
One of the earliest explorations of language model pretraining, in 2015 [1]
Method Test Error Rate↓
LSTM (No Pretraining, Finetune Only) 13.5%
Auto-Regressive LSTM Decoder (Pretrain→Finetune) 7.64%
Auto-Encoder LSTM Encoder (Pretrain→Finetune) 7.24%
Auto-Encoder LSTM Encoder (Pretrain + Finetune, Multi-Task) 14.7%
Table 2: Results on IMDB sentiment classification task [1]

18
Observations from Dai and Le [1]:
• Pretraining helps significantly, as a better initialization
• Not only on accuracy but also on stability, and generalization ability
• Decoder LSTM as a representation model is slightly worse than encoder LSTM
• Mixing pretraining and supervised learning hurts.
• It is pre-training.
Method Test Error Rate↓
LSTM (No Pretraining, Finetune Only) 13.5%
Auto-Regressive LSTM Decoder (Pretrain→Finetune) 7.64%
Auto-Encoder LSTM Encoder (Pretrain→Finetune) 7.24%
Auto-Encoder LSTM Encoder (Pretrain + Finetune, Multi-Task) 14.7%
Table 2: Results on IMDB sentiment classification task [1]

19
GPT-1: Pretraining + Transformer Decoder
GPT-1 combines unsupervised pretraining and Transformer network
• Auto-regressive language modeling
• Transformer decoder
Another significant difference: Scale
• Much bigger network
• Transformers are easier to train than LSTM
• More data
• Books Corpus, ~1 billion words.
[2] Radford, Alec, et al. "Improving language understanding by generative pre-training." (2018).

20
GPT-1: Experimental Setup
Evaluation Task: GLUE benchmark
• A set of language classification tasks
• Most informative task is Multi-Genre Natural Language Inference (MNLI)
• Given a pair of statements, predict whether one entails, contradicts, or is neural to the other
Premise Hypothesis Label
Conceptually cream skimming has two basic
dimensions - product and geography.
Product and geography are what make
cream skimming work.
Neutral
Read for Slate 's take on Jackson's findings. Slate had an opinion on Jackson's findings. Entailment
In an increasingly interdependent world, many pressing
problems that affect Americans can be addressed only
through cooperation with other countries
We should be independent and stay away
from talking and working with other
nations.
Contradiction
Table 3: Examples of MNLI

21
GPT-1: Evaluation Results
Results on MNLI and GLUE Average
Transformer is a much stronger architecture than LSTM
• More power
• Much easier to train
Pretraining brings a huge advantage
• Mixing pretraining with finetuning does not really help
Method MNLI (ACC) GLUE AVG
Pretrained LSTM Decoder 73.7 69.1
Non Pretrained Transformer 75.7 59.9
Pretrained Transformer 81.1 75.0
Pretrained Transformer + LM Multi-Task Finetune 81.8 74.7
Table 4: GPT-1 Results on GLUE [2]

Early Insights on Pretraining and Transformer
Early glimpse of zero-shot task solving
Figure 1: GPT-1 GLUE Performance at Different Stages [2]

Early Insights on Pretraining and Transformer
Early glimpse of zero-shot task solving
Improving zero-shot with more pretraining Steps
• Burst increasements on some tasks
• Different benefits on different tasks
Many benefits as a starting point of finetuning
• Not only a faster initialization but a better one
• Necessary for tasks with limited labels
Figure 1: GPT-1 GLUE Performance at Different Stages [2]

24
Pretraining by Denoising Task
Denoising training
• Reconstruct the original input from an input mixed with noises
• Variety ways to construct the noisy input
• A classic unsupervised learning task used in many modalities
• Language, vision, molecular, etc.
Figure 2: Example of Vision Denoising Training [3]
[3] Brempong, Emmanuel Asiedu, et al. "Denoising pretraining for semantic segmentation."
CVPR 2022.

25
Masked Language Modeling
Masked Language Modeling, the denoising pretraining used in BERT
• Noisy Input: Text sequence with masked out token positions
• Reconstruction Target: Original tokens at masked out positions
• Let 𝑋MSK = {𝑥1, … MSK 𝑡 … , 𝑥𝑛} a text sequence of n tokens with positions 𝑡 ∈ 𝑀 replaced with [MSK] tokens,
• the Masked LM task is to maximize the likelihood of recovering masked out tokens:
𝐿MLM 𝑋 = ෍
𝑡∈𝑀
log 𝑝(𝑥𝑡|𝑋MSK; Θ)
The Steelers [MSK] a large, widespread [MSK] nicknamed Steeler Nation
Fanbase
Masked Language Model (𝚯)
𝑋MSK
𝑥𝑡 Enjoy
[4] Devlin et al. "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding."
NAACL-HLT. 2019.

26
BERT Pretraining with Masked LM
BERT uses a bi-directional Transformer encoder as the language model
• Forward pass:
• Mask LM Head:
• Mask LM Loss:
Where:
• 𝒙 the embedding of token 𝑥
• 𝑯, 𝒉𝒕 the last layer’s representation of Transformer and the one for the t-th position.
NAACL-HLT. 2019.
𝑋MSK
Transformer
𝑯
MLM Head
𝑝MLM(𝑥|𝒉𝑡)
𝑝MLM 𝑥 𝒉𝑖 =
exp(𝒙𝑇𝒉𝑡)
σ𝑥𝑖∈𝑉 exp 𝒙𝒊
𝑇
𝒉𝒕
𝐿MLM = E(− ෍
𝑡∈𝑀
log 𝑝MLM 𝑥𝑡 𝒉𝑡 )

27
BERT: Experimental Setup
Notable hyper-parameters
• Both became standard experimental settings in the pretraining literature
• Base setting is chosen to be close to GPT-1 for comparison
Other important setups
• Mask fraction: 15%
• Optimizer: Adam with warm up
Total
Parameters
Transformer
Layers
Hidden
Dimensions
Sequence
Length
Pretraining Corpus Pretraining Steps
BERTbase 110M 12 768 512 Wikipedia (2.5 billion
words)+ BookCorpus (0.8b)
128K tokens/batch *
1M steps
BERTlarge 340M 24 1024 512
NAACL-HLT. 2019.
Table 5: BERT base and large configurations

28
BERT: Experimental Setup
Evaluation Tasks: GLUE, SQuAD, and many more
SQuAD: Question answering, reading comprehension style
• Given a natural language question and a passage, find the span (n-gram) answer in the passage
• Evaluate by matching the target answer phrase
• A good representative of several types of NLP tasks:
• Knowledge-intensive: Questions require “human knowledge” to answer
• Token-level tasks: Label prediction at token level
• One of the early QA experiences in commercial search engines (extractive QA)
NAACL-HLT. 2019.
Question: What kind of music does Beyonce do?
Passage: Beyoncé's music is generally R&B, but she also incorporates pop, soul and funk into
her songs. 4 demonstrated Beyoncé's exploration of 90s-style R&B, as well as further
use of soul and hip hop than compared to previous releases….
Target Answer: R&B
Table 6: SQuAD Example

29
BERT: Evaluation Results
Results on MNLI, GLUE Average, and SQuAD 1.1 Develop set
Much stronger results than GPT-1
• More flexibile architecture (allow bidirectional attention path)
• More data (Wiki + BookCorpus)
Significant gains by scaling from base to large
NAACL-HLT. 2019.
MNLI (ACC) GLUE AVG SQuAD (F1)
ELMO 76.3 71.0 85.6
GPT-1 81.8 75.1 n.a.
BERTbase 84.0 79.6 88.5
BERTlarge 86.3 82.1 90.9
Table 7: BERT Evaluation Results [4]

BERT: Analysis
Benefits of Masked LM
Significant benefits from using Masked LM
• Hard to apply MLM on decoder only models
Auto-regressive LM starts faster
• But quickly by-passed by Masked LM
Figure 3: BERT finetuned accuracy after different pretraining
steps with Masked LM and Auto-regressive LM [4]
NAACL-HLT. 2019.

31
More Finessed Denoising Task: Span Masking
Span Masking: Instead of randomly sampled token positions, masking out more spans (continuous positions)
[5] Joshi, Mandar, et al. "SpanBERT: Improving pre-training by representing and predicting spans."
TACL 2020.
The Steelers enjoy a large, widespread [MSK] [MSK] [MSK] nation
fanbase nicknamed steeler
𝑥𝑡:𝑡+3
𝑋SpanMSK

32
Span Masking: instead of randomly sampled token positions, masking out more spans (continuous positions)
• Span sampling:
• Sample a span length (# of tokens) from a geometric distribution
• Randomly sample a starting point of the span to mask
• Till reached total mask fraction (15%)
TACL 2020.
The Steelers enjoy a large, widespread [MSK] [MSK] [MSK] nation
𝑥𝑡:𝑡+3
Figure 4: Geometric distribution used to
sample span length in SpanBERT [5]
𝑋SpanMSK

33
Span Masking: instead of randomly sampled token positions, masking out more spans (continuous positions)
Benefits:
• A little higher granularity (tokens to phrases), thus harder/more semantical?
• Aligns well with some downstream applications, e.g., SQuAD
TACL 2020.
The Steelers enjoy a large, widespread [MSK] [MSK] [MSK] nation.
𝑋SpanMSK
𝑥𝑡:𝑡+3

34
More Finessed Denoising Task: Salient Span Masking
Salient Span Mask (SSM): Masking out spans corresponding to entities and attributes (salient)
First use fine-tuned BERT to tag named entities and rules to tag dates (salient spans)
Sample span mask from salient spans
Benefits:
• A lightweight way of introducing knowledge
• Directly targeting knowledge-intensive tasks, e.g., dates
The Steelers enjoy a large, widespread fanbase nicknamed [MSK] [MSK].
steeler nation
𝑋SSM
𝑥𝑡:𝑡+2
[6] Guu, Kelvin, et al. "Retrieval augmented language model pre-training."
ICML, 2020

35
Recap: Autoregressive LM and Masked LM
Autoregressive LM Masked LM
Neural Architecture More suited for decoder Encoder and decoder
Training Density All Token Positions 15% of Masked Positions
Converging Speed/Stability Fast and stable Slower and less stable
Task Fit Generation Representation
Notable Models GPT-* BERT
Table 8: Recap of Autoregressive LM and Masked LM

36
Combination of Auto-Regressive and Masked LM
Various efforts to combine the benefits of Auto-Regressive LM and Masked LM
• One model for both generation and representation
• Better training effectiveness from multi-task learning?
Notable examples:
• UniLM: Dong, Li, et al. "Unified language model pre-training for natural language understanding and generation."
NeurIPS 2019.
• XL-NET: Yang et al. “XL-NET: Generalized autoregressive pretraining for language understanding." NeurIPS 2019.

37
Transformer Encoder-Decoders
Much of the difference of auto-regressive versus Masked LM also resides in the Transformer architecture:
• Encoder: bi-directional representation power
• Decoder: natural generation
Transformer Encoder-Decoder enjoy the benefits of both
• Flexible for various types of denoising tasks
• Support different downstream applications with either side, or both together
<s> A B C D E F G </s>
Transformer Encoder
H I J K L M N O </s>
Input:
Target:
Transformer Decoder
<s> H I J K L M N O

38
T5: Text-to-Text Transfer Transformers
Encoder-Decoder Transformer pretrained with language modeling tasks
• The flexibility allowed T5 to explore many different denoising tasks
[7] Colin, et al. "Exploring the limits of transfer learning with a unified text-to-text transformer.“
JMLR. 2020.
Table 9: Pretraining Tasks Explored in T5 [7].
<s> A B C D E F G </s>
Transformer Encoder
H I J K L M N O </s>
Input:
Target:
Transformer Decoder
<s> H I J K L M N O

39
T5 Pretraining Task Studies
Use of T5: fine-tuned with
• Encoder takes task input
• Decoder generating the label word, e.g., “Entailment” for MNLI
Denoising Task GLUE AVG SQuAD
Auto-Regressive LM 80.7 78.0
De-shuffling 73.2 67.6
Masked-LM, Reconstruct All 83.0 80.7
Replace Corrupted Spans 83.3 80.9
Drop Corrupted Tokens 84.4 80.5
Table 9: Pretraining Tasks Results with T5 base [7].
JMLR. 2020.

40
T5 Pretraining Task Studies
Use of T5: fine-tuned with
• Encoder takes task input
• Decoder generating the label word, e.g., “Entailment” for MNLI
• Different variations of Masked-LM style denoising task performed similarly
Table 9: Pretraining Tasks Results with T5 base [7].
JMLR. 2020.
Denoising Task GLUE AVG SQuAD
Auto-Regressive LM 80.7 78.0
De-shuffling 73.2 67.6
Masked-LM, Reconstruct All 83.0 80.7
Replace Corrupted Spans 83.3 80.9
Drop Corrupted Tokens 84.4 80.5

41
BART Pretraining Tasks
Various denoising tasks explored with BART’s encoder-decoder
• Both sentence level and token level
• Flexible architecture enabled reconstruction from various types of noises
[8] Lewis, Mike, et al. "BART: Denoising Sequence-to-Sequence Pre-training for Natural Language
Generation, Translation, and Comprehension." ACL. 2020.
Figure 5: Denoising Tasks Explored in BART [8]

42
BART Pretraining Task Studies
Use of BART:
• Representation style tasks: feed same inputs to both encoder and decoder, use decoder representations
• Generation: use decoder
MNLI (Acc) SQuAD (F1)
Document Rotation 75.3 77.2
Sentence Shuffling 81.5 85.4
Token Masking 84.1 90.4
Token Deletion 84.1 90.4
Text Infilling 84.0 90.8
Text Infilling + Sentence Shuffling 83.8 90.8
Table 10: Pretraining Tasks Results with BART base [7].

43
BART Pretraining Task Studies
Use of BART:
• Representation style tasks: feed same inputs to both encoder and decoder, use decoder representations
• Generation: use decoder
• Different variations of Masked-LM style denoising task performed similarly
MNLI (Acc) SQuAD (F1)
Document Rotation 75.3 77.2
Sentence Shuffling 81.5 85.4
Token Masking 84.1 90.4
Token Deletion 84.1 90.4
Text Infilling 84.0 90.8
Text Infilling + Sentence Shuffling 83.8 90.8
Table 10: Pretraining Tasks Results with BART base [7].

44
Pretraining Tasks: Summary
Classic Auto-Regressive LM and BERT’s Masked LM are very effective
• A solid foundation to scale up
Early explorations on variant language modeling tasks do not obtain much general improvements
• Application-specific gains are more observed
• All in forms of (rule-based random noise + reconstruction target)
Sequence level tasks not showing much benefits on tasks like GLUE and SQuAD
• Hard to fathom strong “semantic”, “knowledge”, or “intelligence” from some sequence level tasks
TL;DR: for base scale LMs
• Generation→ Auto-Regressive LM
• Representation→ Masked LM

Questions?

46
References: Pretraining Objectives
• [Pretraining] Dai, Andrew M., and Quoc V. Le. "Semi-supervised sequence learning." Advances in neural
information processing systems 28 (2015).
• [ELMO] Sarzynska-Wawer, Justyna, et al. "Detecting formal thought disorder by deep contextualized word
representations." Psychiatry Research 304 (2021): 114135.
• [GPT] Radford, Alec, et al. "Improving language understanding by generative pre-training." (2018).
• [BERT] Devlin et al. "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding." In
Proceedings of NAACL-HLT, pp. 4171-4186. 2019.
• [XL-NET] Yang, Zhilin, et al. "XLNet: Generalized autoregressive pretraining for language understanding."
Advances in neural information processing systems 32 (2019).
• [SpanBERT] Joshi, Mandar, et al. "Spanbert: Improving pre-training by representing and predicting spans."
Transactions of the Association for Computational Linguistics 8 (2020): 64-77.
• [REALM] Guu, Kelvin, et al. "Retrieval augmented language model pre-training." International conference on
machine learning. PMLR, 2020.
• [BART] Lewis, Mike, et al. “BART: Denoising sequence-to-sequence pre-training for natural language generation,
translation, and comprehension." arXiv preprint arXiv:1910.13461 (2019).
• [T5] Raffel, Colin, et al. "Exploring the limits of transfer learning with a unified text-to-text transformer." The
Journal of Machine Learning Research 21.1 (2020): 5485-5551.

Pretraining Task - Auto-Regressive LM, Transformer Encoder-Decoders

More Related Content

Similar to Pretraining Task - Auto-Regressive LM, Transformer Encoder-Decoders (20)

More from cniclsh1 (20)

Recently uploaded (20)

Pretraining Task - Auto-Regressive LM, Transformer Encoder-Decoders