SlideShare a Scribd company logo
236781
Lecture 9
Deep learning
Transformers for Vision Tasks
Dr. Chaim Baskin
Transformers for image modeling
Transformers for image modeling: Outline
• Architectures
• Self-supervised learning
Previous approaches
1. On pixels, but locally or
factorized
Usually replaces 3x3 conv in ResNet:
Image credit: Stand-Alone Self-Attention in Vision Models by Ramachandran et.al.
Image credit: Local Relation Networks for Image Recognition by Hu et.al.
Examples:
Non-local NN (Wang et.al. 2017)
SASANet (Stand-Alone Self-Attention in Vision Models)
HaloNet (Scaling Local Self-Attn for Parameter Efficient...)
LR-Net (Local Relation Networks for Image Recognition)
SANet (Exploring Self-attention for Image Recognition)
...
Results:
Are usually "meh", nothing to call home about
Do not justify increased complexity
Do not justify slowdown over convolutions
Many prior works attempted to introduce self-
attention at the pixel level.
For 224px², that's 50k sequence length, too much!
Previous approaches
2. Globally, after/inside a full-blown CNN, or even detector/segmenter!
Cons:
result is highly complex, often multi-stage trained architecture.
not from pixels, i.e. transformer can't "learn to fix" the (often frozen!) CNN's mistakes.
Examples:
DETR (Carion, Massa et.al. 2020) Visual Transformers (Wu et.al. 2020)
UNITER (Chen, Li, Yu et.al. 2019) ViLBERT (Lu et.al. 2019)
etc...
VisualBERT (Li et.al. 20190)
Image credit: UNITER: UNiversal Image-TExt Representation Learning by Chen et.al.
Image credit: Visual Transformers: Token-based Image Representation and Processing for Computer Vision by Wu et.al.
• Split an image into patches, feed linearly projected patches into
standard transformer encoder
• With patches of 14x14 pixels, you need 16x16=256 patches to represent 224x224 images
Vision transformer (ViT) – Google
A. Dosovitskiy et al. An image is worth 16x16 words: Transformers for image recognition at scale. ICLR 2021
Vision transformer (ViT)
BiT: Big Transfer (ResNet)
ViT: Vision Transformer (Base/Large/Huge,
patch size of 14x14, 16x16, or 32x32)
Internal Google dataset (not public)
A. Dosovitskiy et al. An image is worth 16x16 words: Transformers for image recognition at scale. ICLR 2021
• Trained in a supervised fashion, fine-tuned on ImageNet
Vision transformer (ViT)
A. Dosovitskiy et al. An image is worth 16x16 words: Transformers for image recognition at scale. ICLR 2021
• Transformers are claimed to be more computationally
efficient than CNNs or hybrid architectures
Hierarchical transformer: Swin
Z. Liu et al. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. ICCV 2021
Hierarchical transformer: Swin
Z. Liu et al. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. ICCV 2021
Hierarchical transformer: Swin
Z. Liu et al. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. ICCV 2021
Swin results
COCO detection and segmentation
Beyond transformers?
I. Tolstikhin et al. MLP-Mixer: An all-MLP Architecture for Vision. NeurIPS 2021
Hybrid of CNNs and transformers?
T. Xiao et al. Early convolutions help transformers see better. NeurIPS 2021
Or completely back to CNNs?
Z. Liu et al. A ConvNet for the 2020s. CVPR 2022
Back to CNNs?
Z. Liu et al. A ConvNet for the 2020s. CVPR 2022
Back to CNNs?
Z. Liu et al. A ConvNet for the 2020s. CVPR 2022
Outline
• Architectures
• Self-supervised learning
DINO: Self-distillation with no labels
M. Caron et al. Emerging Properties in Self-Supervised Vision Transformers. ICCV 2021
DINO
M. Caron et al. Emerging Properties in Self-Supervised Vision Transformers. ICCV 2021
DINO
M. Caron et al. Emerging Properties in Self-Supervised Vision Transformers. ICCV 2021
Masked autoencoders
K. He et al. Masked autoencoders are scalable vision learners. CVPR 2022
Masked autoencoders
K. He et al. Masked autoencoders are scalable vision learners. CVPR 2022
Masked autoencoders: Results
K. He et al. Masked autoencoders are scalable vision learners. CVPR 2022
Application: Visual prompting
A. Bar et al. Visual prompting via image inpainting. NeurIPS 2022
Application: Visual prompting
Application: Visual prompting
Application: Visual prompting

More Related Content

PDF
BriefHistoryTransformerstransformers.pdf
PDF
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
PDF
Brief History of Visual Representation Learning
PDF
ViT (Vision Transformer) Review [CDM]
PDF
How is a Vision Transformer (ViT) model built and implemented?
PPTX
Transformer in Vision
PDF
leewayhertz.com-HOW IS A VISION TRANSFORMER MODEL ViT BUILT AND IMPLEMENTED.pdf
PDF
Visual Transformers
BriefHistoryTransformerstransformers.pdf
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
Brief History of Visual Representation Learning
ViT (Vision Transformer) Review [CDM]
How is a Vision Transformer (ViT) model built and implemented?
Transformer in Vision
leewayhertz.com-HOW IS A VISION TRANSFORMER MODEL ViT BUILT AND IMPLEMENTED.pdf
Visual Transformers

Similar to SLIDES OF LECTURE ABOUT TRANSFORMERS FOR VISION TASKS (20)

PDF
AE-ViT: Token Enhancement for Vision Transformers via CNN-Based Autoencoder E...
PDF
AE-ViT: Token Enhancement for Vision Transformers via CNN-based Autoencoder E...
PDF
“How Transformers Are Changing the Nature of Deep Learning Models,” a Present...
PDF
Transformer models for FER
PDF
Transformer based approaches for visual representation learning
PDF
“How Transformers are Changing the Direction of Deep Learning Architectures,”...
PDF
A Survey on Vision Transformer.pdf
PDF
210610 SSIIi2021 Computer Vision x Trasnformer
PDF
Transformers in 2021
PPTX
Deep Learning for Image Processing on 16 June 2025 MITS.pptx
PPTX
Transformers in vision and its challenges and comparision with CNN
PDF
[DSC Europe 23] Ivan Biliskov - Seeing Through the Lens of Transformers: A Ne...
PDF
unlocking-the-future-an-introduction-to-vision-transformers-202410100758143pD...
PPTX
State of transformers in Computer Vision
PPTX
ViT.pptx
PPTX
Transformers In Vision From Zero to Hero (DLI).pptx
PDF
Chapter 3 Deep Learning architectures.pdf
PPTX
Presentation vision transformersppt.pptx
PPTX
ConvNeXt: A ConvNet for the 2020s explained
PPTX
vision transformer siêu cấp vip ro vũ trụ
AE-ViT: Token Enhancement for Vision Transformers via CNN-Based Autoencoder E...
AE-ViT: Token Enhancement for Vision Transformers via CNN-based Autoencoder E...
“How Transformers Are Changing the Nature of Deep Learning Models,” a Present...
Transformer models for FER
Transformer based approaches for visual representation learning
“How Transformers are Changing the Direction of Deep Learning Architectures,”...
A Survey on Vision Transformer.pdf
210610 SSIIi2021 Computer Vision x Trasnformer
Transformers in 2021
Deep Learning for Image Processing on 16 June 2025 MITS.pptx
Transformers in vision and its challenges and comparision with CNN
[DSC Europe 23] Ivan Biliskov - Seeing Through the Lens of Transformers: A Ne...
unlocking-the-future-an-introduction-to-vision-transformers-202410100758143pD...
State of transformers in Computer Vision
ViT.pptx
Transformers In Vision From Zero to Hero (DLI).pptx
Chapter 3 Deep Learning architectures.pdf
Presentation vision transformersppt.pptx
ConvNeXt: A ConvNet for the 2020s explained
vision transformer siêu cấp vip ro vũ trụ
Ad

Recently uploaded (20)

PPTX
Lecture Notes Electrical Wiring System Components
PPTX
Artificial Intelligence
PPTX
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
PDF
R24 SURVEYING LAB MANUAL for civil enggi
PPTX
Safety Seminar civil to be ensured for safe working.
PDF
Unit I ESSENTIAL OF DIGITAL MARKETING.pdf
PPTX
bas. eng. economics group 4 presentation 1.pptx
PPTX
OOP with Java - Java Introduction (Basics)
DOCX
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
PDF
PPT on Performance Review to get promotions
PPT
Project quality management in manufacturing
DOCX
573137875-Attendance-Management-System-original
PDF
Embodied AI: Ushering in the Next Era of Intelligent Systems
PDF
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
PPTX
Current and future trends in Computer Vision.pptx
PPTX
CH1 Production IntroductoryConcepts.pptx
PDF
Model Code of Practice - Construction Work - 21102022 .pdf
PPTX
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
PPTX
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
PDF
TFEC-4-2020-Design-Guide-for-Timber-Roof-Trusses.pdf
Lecture Notes Electrical Wiring System Components
Artificial Intelligence
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
R24 SURVEYING LAB MANUAL for civil enggi
Safety Seminar civil to be ensured for safe working.
Unit I ESSENTIAL OF DIGITAL MARKETING.pdf
bas. eng. economics group 4 presentation 1.pptx
OOP with Java - Java Introduction (Basics)
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
PPT on Performance Review to get promotions
Project quality management in manufacturing
573137875-Attendance-Management-System-original
Embodied AI: Ushering in the Next Era of Intelligent Systems
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
Current and future trends in Computer Vision.pptx
CH1 Production IntroductoryConcepts.pptx
Model Code of Practice - Construction Work - 21102022 .pdf
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
TFEC-4-2020-Design-Guide-for-Timber-Roof-Trusses.pdf
Ad

SLIDES OF LECTURE ABOUT TRANSFORMERS FOR VISION TASKS

  • 1. 236781 Lecture 9 Deep learning Transformers for Vision Tasks Dr. Chaim Baskin
  • 3. Transformers for image modeling: Outline • Architectures • Self-supervised learning
  • 4. Previous approaches 1. On pixels, but locally or factorized Usually replaces 3x3 conv in ResNet: Image credit: Stand-Alone Self-Attention in Vision Models by Ramachandran et.al. Image credit: Local Relation Networks for Image Recognition by Hu et.al. Examples: Non-local NN (Wang et.al. 2017) SASANet (Stand-Alone Self-Attention in Vision Models) HaloNet (Scaling Local Self-Attn for Parameter Efficient...) LR-Net (Local Relation Networks for Image Recognition) SANet (Exploring Self-attention for Image Recognition) ... Results: Are usually "meh", nothing to call home about Do not justify increased complexity Do not justify slowdown over convolutions Many prior works attempted to introduce self- attention at the pixel level. For 224px², that's 50k sequence length, too much!
  • 5. Previous approaches 2. Globally, after/inside a full-blown CNN, or even detector/segmenter! Cons: result is highly complex, often multi-stage trained architecture. not from pixels, i.e. transformer can't "learn to fix" the (often frozen!) CNN's mistakes. Examples: DETR (Carion, Massa et.al. 2020) Visual Transformers (Wu et.al. 2020) UNITER (Chen, Li, Yu et.al. 2019) ViLBERT (Lu et.al. 2019) etc... VisualBERT (Li et.al. 20190) Image credit: UNITER: UNiversal Image-TExt Representation Learning by Chen et.al. Image credit: Visual Transformers: Token-based Image Representation and Processing for Computer Vision by Wu et.al.
  • 6. • Split an image into patches, feed linearly projected patches into standard transformer encoder • With patches of 14x14 pixels, you need 16x16=256 patches to represent 224x224 images Vision transformer (ViT) – Google A. Dosovitskiy et al. An image is worth 16x16 words: Transformers for image recognition at scale. ICLR 2021
  • 7. Vision transformer (ViT) BiT: Big Transfer (ResNet) ViT: Vision Transformer (Base/Large/Huge, patch size of 14x14, 16x16, or 32x32) Internal Google dataset (not public) A. Dosovitskiy et al. An image is worth 16x16 words: Transformers for image recognition at scale. ICLR 2021 • Trained in a supervised fashion, fine-tuned on ImageNet
  • 8. Vision transformer (ViT) A. Dosovitskiy et al. An image is worth 16x16 words: Transformers for image recognition at scale. ICLR 2021 • Transformers are claimed to be more computationally efficient than CNNs or hybrid architectures
  • 9. Hierarchical transformer: Swin Z. Liu et al. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. ICCV 2021
  • 10. Hierarchical transformer: Swin Z. Liu et al. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. ICCV 2021
  • 11. Hierarchical transformer: Swin Z. Liu et al. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. ICCV 2021
  • 12. Swin results COCO detection and segmentation
  • 13. Beyond transformers? I. Tolstikhin et al. MLP-Mixer: An all-MLP Architecture for Vision. NeurIPS 2021
  • 14. Hybrid of CNNs and transformers? T. Xiao et al. Early convolutions help transformers see better. NeurIPS 2021
  • 15. Or completely back to CNNs? Z. Liu et al. A ConvNet for the 2020s. CVPR 2022
  • 16. Back to CNNs? Z. Liu et al. A ConvNet for the 2020s. CVPR 2022
  • 17. Back to CNNs? Z. Liu et al. A ConvNet for the 2020s. CVPR 2022
  • 19. DINO: Self-distillation with no labels M. Caron et al. Emerging Properties in Self-Supervised Vision Transformers. ICCV 2021
  • 20. DINO M. Caron et al. Emerging Properties in Self-Supervised Vision Transformers. ICCV 2021
  • 21. DINO M. Caron et al. Emerging Properties in Self-Supervised Vision Transformers. ICCV 2021
  • 22. Masked autoencoders K. He et al. Masked autoencoders are scalable vision learners. CVPR 2022
  • 23. Masked autoencoders K. He et al. Masked autoencoders are scalable vision learners. CVPR 2022
  • 24. Masked autoencoders: Results K. He et al. Masked autoencoders are scalable vision learners. CVPR 2022
  • 25. Application: Visual prompting A. Bar et al. Visual prompting via image inpainting. NeurIPS 2022

Editor's Notes

  • #6: DETR uses a conventional CNN backbone to learn a 2D representation of an input image. The model flattens it and supplements it with a positional encoding before passing it into a transformer encoder. A transformer decoder then takes as input a small fixed number of learned positional embeddings, which we call object queries, and additionally attends to the encoder output. We pass each output embedding of the decoder to a shared feed forward network (FFN) that predicts either a detection (class and bounding box) or a “no object” class.
  • #7: Abstract: In this work, we generalize a recently proposed model architecture based on self-attention, the Transformer, to a sequence modeling formulation of image generation with a tractable likelihood. By restricting the selfattention mechanism to attend to local neighborhoods we significantly increase the size of images the model can process in practice, despite maintaining significantly larger receptive fields per layer than typical convolutional neural networks. While conceptually simple, our generative models significantly outperform the current state of the art in image generation on ImageNet, improving the best published negative log-likelihood on ImageNet from 3.83 to 3.77.
  • #11: Abstract: While the Transformer architecture has become the de-facto standard for natural language processing tasks, its applications to computer vision remain limited. In vision, attention is either applied in conjunction with convolutional networks, or used to replace certain components of convolutional networks while keeping their overall structure in place. We show that this reliance on CNNs is not necessary and a pure transformer applied directly to sequences of image patches can perform very well on image classification tasks. When pre-trained on large amounts of data and transferred to multiple mid-sized or small image recognition benchmarks (ImageNet, CIFAR-100, VTAB, etc.), Vision Transformer (ViT) attains excellent results compared to state-of-the-art convolutional networks while requiring substantially fewer computational resources to train
  • #12: Patches of size 14x14 = 16x16 patches to represent 224x224 images (sequence length = 256)
  • #13: Abstract: While the Transformer architecture has become the de-facto standard for natural language processing tasks, its applications to computer vision remain limited. In vision, attention is either applied in conjunction with convolutional networks, or used to replace certain components of convolutional networks while keeping their overall structure in place. We show that this reliance on CNNs is not necessary and a pure transformer applied directly to sequences of image patches can perform very well on image classification tasks. When pre-trained on large amounts of data and transferred to multiple mid-sized or small image recognition benchmarks (ImageNet, CIFAR-100, VTAB, etc.), Vision Transformer (ViT) attains excellent results compared to state-of-the-art convolutional networks while requiring substantially fewer computational resources to train
  • #19: Abstract: Vision transformer (ViT) models exhibit substandard optimizability. In particular, they are sensitive to the choice of optimizer (AdamW vs. SGD), optimizer hyperparameters, and training schedule length. In comparison, modern convolutional neural networks are easier to optimize. Why is this the case? In this work, we conjecture that the issue lies with the patchify stem of ViT models, which is implemented by a stride-p p×p convolution (p = 16 by default) applied to the input image. This large-kernel plus large-stride convolution runs counter to typical design choices of convolutional layers in neural networks. To test whether this atypical design choice causes an issue, we analyze the optimization behavior of ViT models with their original patchify stem versus a simple counterpart where we replace the ViT stem by a small number of stacked stride-two 3×3 convolutions. While the vast majority of computation in the two ViT designs is identical, we find that this small change in early visual processing results in markedly different training behavior in terms of the sensitivity to optimization settings as well as the final model accuracy. Using a convolutional stem in ViT dramatically increases optimization stability and also improves peak performance (by ∼1-2% top-1 accuracy on ImageNet-1k), while maintaining flops and runtime. The improvement can be observed across the wide spectrum of model complexities (from 1G to 36G flops) and dataset scales (from ImageNet-1k to ImageNet-21k). These findings lead us to recommend using a standard, lightweight convolutional stem for ViT models in this regime as a more robust architectural choice compared to the original ViT model design.