The document discusses video-language pre-training using transformer models. It introduces transfer learning techniques used to pre-train models on large datasets before fine-tuning them on smaller downstream tasks. Transformer networks are well-suited for this due to their ability to be easily deepened and their superior performance over other architectures. Proxy tasks like masked language modeling are used to pre-train models, before fine-tuning them on downstream tasks involving video and language like caption generation, question answering, and retrieval. Several video-language transformer models are presented that use either single-stream or multi-stream architectures to process both modalities.