This document summarizes a lecture on pretraining methods GPT-2 and BERT. It discusses how GPT-2 removes the encoder-decoder architecture of Transformers, using only the decoder. BERT removes the decoder, using only the encoder to pretrain on two tasks: masking words and predicting if sentence pairs are in the right order. Fine-tuning is used to apply BERT to downstream tasks like sentiment analysis and question answering. The document also reviews how GPT-2 scales up size and uses techniques like byte-pair encoding and masked self-attention.