1) Transformer-based approaches for visual representation learning such as Vision Transformers (ViTs) have shown promising performance compared to CNNs on image classification tasks.
2) A pure Transformer architecture pre-trained on a very large dataset like JFT-300M can outperform modern CNNs without any convolutions.
3) Self-supervised pre-training methods like DINO that leverage knowledge distillation have been shown to obtain comparable performance to supervised pre-training of ViTs using only unlabeled ImageNet data.