The document summarizes a talk on multi-modal self-supervised learning from videos. It discusses using multiple modalities like vision, audio and language from videos for self-supervised learning. It presents two models: 1) A Multi-Modal Versatile network that can take any modality as input and respects the specificity of each while enabling comparison. 2) BraVe which learns representations by regressing a broad representation of the whole video from a narrow view to leverage different augmentations and modalities. Both models achieve state-of-the-art results on downstream tasks, showing videos provide rich self-supervision and using additional context improves representation learning.