1) Masked self-supervised pre-training (MAE) provides an effective way to pre-train vision models like ViT in a similar manner to masked language models.
2) MAE works by masking patches of images at a high ratio like 75%, encoding the visible patches, and predicting the masked patches with a lightweight decoder.
3) MAE achieves superior results compared to contrastive learning methods on downstream tasks with either linear probes or end-to-end fine-tuning.
4) MAE can also be extended to videos by masking 3D spatiotemporal patches and works well with even higher masking ratios of 90%.