SlideShare a Scribd company logo
Transformers In Vision
From Zero to Hero!
Davide Coccomini & Nicola Messina
Davide Coccomini Nicola Messina
PhD Candidate
Italian National
Research Council
PhD Student
Italian National
Research Council
Reach me on …
Reach me on …
What do you think when you hear the
word «Transformer»?
Davide Coccomini & Nicola Messina | AICamp 2021
Transformers in Vision: From Zero to Hero
The Transformer «today»
Input Output
Multi-Head
Attention
Feed Forward
Feed Forward
Add & Norm
Add & Norm
Add & Norm
Add & Norm
Multi-Head
Attention
Multi-Head
Attention
+ +
Salta (90%) | Odia (9%) | Perchè (1%)
The cat jumps the wall Il gatto
Positional
Encoding
Positional
Encoding
V K Q
Nx
Mx
Davide Coccomini & Nicola Messina | AICamp 2021
V K Q V K Q
Encoder
Decoder
Add & Norm
Linear + Softmax
Memory
Outline
Some history: from RNNs to Transformers
Transformers’ attention and self-attention mechanisms
The power of the Transformer Encoder
From text to images: Vision Transformers
Transformers: The beginnings
From images to videos
The scale and data problem
Convolutional Neural Networks and Vision Transformers
Some interesting real-world applications
Transformers in Vision
Videos
Images
Text History
Introduced transformers in NLP
2017
Vision Transformers
2020
2021
Transformers for video
understanding
Now Computer Vision Revolution!
Transformers
Davide Coccomini & Nicola Messina | AICamp 2021
A step back: Recurrent Networks (RNNs)
E
The cat jumps the wall
h0
hstart
E
h1
E
h2
E
h3
E
h4
D h5
D
h6
D
h7
D
h8
D
Il gatto salta il
h9
D
<end>
Davide Coccomini & Nicola Messina | AICamp 2021
muro
Encoder
Final sentence
embedding
Decoder
<s>
Problems
1. We forget tokens too far in the past
2. We need to wait the previous token to compute the next hidden-state
E
The cat jumps the wall
h0
hstart
E
h1
E
h2
E
h3
E
h4
D h5
D h6
D h7
D h8
D
Il gatto salta il muro
h9
D
<end>
<s>
Davide Coccomini & Nicola Messina | AICamp 2021
Solving problem 1
"We forget tokens too far in the past"
E
The cat jumps the wall
h0
hstar
t
E
h1
E
h2
E
h3
E
h4
D D D
Il gatto salta
+ + + context
=
Attention
Solution
Add an attention mechanism
+
h5 h6
<s>
Solving problem 2
"We need to wait the previous token to compute the next hidden-state"
2017 paper
"Attention Is All You Need"
Solution
Throw away recurrent
connections
E
The cat jumps the wall
h0
hstar
t
E
h1
E
h2
E
h3
E
h4
D D D
Il gatto salta
+ + + context
=
Attention
+
h5 h6
<s>
Davide Coccomini & Nicola Messina | AICamp 2021
Full Transformer Architecture
Input Output
Multi-Head
Attention
Feed Forward
Feed Forward
Add & Norm
Add & Norm
Add & Norm
Add & Norm
Multi-Head
Attention
Multi-Head
Attention
+ +
“Salta” (90%) | “Odia” (9%) | “Perchè” (1%)
“The cat jumps the wall” “Il gatto”
Positional
Encoding
Positional
Encoding
V K Q
Nx
Mx
Davide Coccomini & Nicola Messina | AICamp 2021
V K Q V K Q
Encoder
Decoder
Add & Norm
Linear + Softmax
Memory
Transformer's Attention Mechanism
Target tokens “from the
point of view” of the
source sequence
Queries
Target
Sequence
Source
Sequence
FFN
FFN
FFN
FFN
Keys & Values
FFN
FFN
∙
∙
∙
∙
Norm
&
Softmax
Dot
product
Il
gatto
salta
The
cat
jumps
the
…
…
FFN
Key Value
“A4”
“N9”
“O7”
“A4”
“N2”
Transformer's Attention Mechanism
From a different perspective
Query
“N5”
Weighted
average
“gatto” token, built by
aggregating value vectors in
the source dictionary
Lookup Table
Target
Sequence
Source
Sequence
Il
gatto
salta
The
cat
jumps
the
…
wall
Soft-matching
Davide Coccomini & Nicola Messina | AICamp 2021
Attention and Self-Attention
Self-Attention
• Source = Target
• Key, Queries, Values
obtained from the
same sentence
• Captures intra-sequence dependencies
Attention
• Source ≠ Target
• Queries from Source
• Key, Values from Target
• Captures inter-sequence dependencies
I gave my dog Charlie some food
Ho dato da mangiare al mio cane Charlie
I gave my dog Charlie some food
To whom? What?
Who?
Multi-Head
Attention
V K Q
Multi-Head
Attention
V K Q
Target Source
Davide Coccomini & Nicola Messina | AICamp 2021
LINEAR
LINEAR
LINEAR
LINEAR
LINEAR
LINEAR
LINEAR
LINEAR
LINEAR
DOT NORMALIZE
MATMUL
CONCAT + DENSE
VALUES
KEYS
QUERIES
Multi-Head Self-Attention
Multiple instantiations of the attention mechanism
h
h
h
Davide Coccomini & Nicola Messina | AICamp 2021
The cat is running
h slices
Full Transformer Architecture
Input Output
Feed Forward
Feed Forward
Add & Norm
Add & Norm
Add & Norm
Add & Norm
Multi-Head
Attention
Multi-Head
Attention
+ +
Salta (90%) | Odia (9%) | Perchè (1%)
The cat jumps the wall Il gatto
Positional
Encoding
Positional
Encoding
V K Q
Nx
Mx
Davide Coccomini & Nicola Messina | AICamp 2021
V K Q V K Q
Encoder
Decoder
Add & Norm
Linear + Softmax
Memory
Key Value
“A4”
“N9”
“O7”
“A4”
“N2”
Lookup Table (source sequence)
Multi-Head
Self Attention
Multi-Head
Attention
Is there any problem
with Transformers?
Attention
calculation is O(n2)
Self-Attention
I gave my dog Charlie some food
I gave my dog Charlie some food
I gave my dog Charlie some food
. . . . . . . . . .
Davide Coccomini & Nicola Messina | AICamp 2021
I gave my dog Charlie some food
I
gave
my
dog
Charlie
some
food
Attention
calculation is O(n2)
Self-Attention
Davide Coccomini & Nicola Messina | AICamp 2021
The Power of the Transformer Encoder
• Many achievements using only the Encoder
• BERT (Devlin et al., 2018)
Next Sentence
Prediction {0, 1}
Transformer Encoder (N layers)
I gave my dog Charlie some food
<CLS> <SEP> He ate it
Positional Encoding
Embedding Layer
Masked Language
Modelling «ate»
Memory
Transformers in Computer Vision
Can we use the self-attention mechanism in images?
Davide Coccomini & Nicola Messina | AICamp 2021
Transformers in Computer Vision
256px
256px
3906250000
calculations
Impossible!
62500
pixels
• The transformer works with a set of tokens
• What are tokens in images?
Davide Coccomini & Nicola Messina | AICamp 2021
• Tokens as the features from an object detector
Transformers in Computer Vision
Tokens!
ROI Pooling
ROI Pooling
ROI Pooling
Davide Coccomini & Nicola Messina | AICamp 2021
“An image is worth 16x16 words”
Image
to
Patches
Tokens!
Linear
Projection
256px
256px
16px
16px
Vision Transformers (ViTs)
Davide Coccomini & Nicola Messina | AICamp 2021
“An image is worth 16x16 words” | Dosovitskiy et al., 2020
Vision Transformers (ViTs)
0 * 1 2 3 4 5 6 7 8 9
1 2 3
4 5 6
7 8 9
MLP
Head
CLASS
Linear Projection of Flattened Patches
Transformer Encoder
Davide Coccomini & Nicola Messina | AICamp 2021
Image Classification on ImageNet
Davide Coccomini & Nicola Messina | AICamp 2021
What about video?
Davide Coccomini & Nicola Messina | AICamp 2021
TimeSformers
Combine space and time attention with Divided Space-Time Attention!
frame
t
-
δ
frame
t
frame
t
+
δ
Space
Time
Time
Davide Coccomini & Nicola Messina | AICamp 2021
Is Space-Time Attention All You Need for Video Understanding? | Gedas Bertasius et al.
TimeSformers
Up to several minutes of
analysis!
Davide Coccomini & Nicola Messina | AICamp 2021
Wait… Can I use different types of
attention???
Transformers in Vision: From Zero to Hero
Transformers use a lot of memory!
Attention
Feed Forward
Feed Forward
.
.
.
.
2 GB
2 GB
2 GB
Davide Coccomini & Nicola Messina | AICamp 2021
USED MEMORY
2 GB
4 GB
A LOT!
Efficient Transformers
Attention
+
FeedForward
+
-
-
Attention
FeedForward
Davide Coccomini & Nicola Messina | AICamp 2021
A new efficient Transformer variant | Lukasz Kaiser
USED MEMORY
2 GB
4 GB
Rev Attention!
Patch
Partition
Davide Coccomini & Nicola Messina | AICamp 2021
Linear
Embedding
x2
Stage 1
Swin
Transformer
Block
Patch
Merging
Swin
Transformer
Block
x2
Stage 2
Patch
Merging
Swin
Transformer
Block
x6
Stage 3
Patch
Merging
Swin
Transformer
Block
x2
Stage 4
𝑯 × 𝑾 × 𝟑
𝑯
𝟒
×
𝑾
𝟒
× 𝟒𝟖
𝑯
𝟒
×
𝑾
𝟒
× 𝑪
𝑯
𝟖
×
𝑾
𝟖
× 𝟐𝑪
𝑯
𝟏𝟔
×
𝑾
𝟏𝟔
× 𝟒𝑪
𝑯
𝟑𝟐
×
𝑾
𝟑𝟐
× 𝟖𝑪
Swin Transformers
Swin Transformer: Hierarchical Vision Transformer using Shifted Windows | Ze Liu et al.
16x 4x
8x
16x
Swin Transformers
Swin Transformer
Vision Transformer
16x
16x
Davide Coccomini & Nicola Messina | AICamp 2021
Shifted Window based Self-Attention
Self-Attention Layer l Self-Attention Layer l+1
Davide Coccomini & Nicola Messina | AICamp 2021
Swin Transformers
Source: Swin Transformer Object Detection Demo – By DeepReader
https://www.youtube.com/watch?v=FQVS_0Bja6o
Can we do without Self-Attention?
It is «just» a transformation!
What essentially is the attention
mechanism?
Davide Coccomini & Nicola Messina | AICamp 2021
Attention
Mechanism
Input
Attention
Calculation
Embeddings
Feed Forward
Add & Normalize
Dense
Output Prediction
Add & Normalize
Embeddings
Fourier Network
Fourier
Transformation
Davide Coccomini & Nicola Messina | AICamp 2021
Why Fourier?
It’s just a transformation!
Image from mriquestion.com
Fourier Transform
Davide Coccomini & Nicola Messina | AICamp 2021
Fourier
Transform
Fourier
Transform
Transforming over
the hidden domain
Transforming over
the sequence
domain
Fourier Network
What does it transform?
Input Vectors
Davide Coccomini & Nicola Messina | AICamp 2021
FNet: Mixing Tokens with Fourier Transforms | James Lee-Thorp et al.
MLP Mixer
1 2 3
4 5 6
7 8 9
Per-patch Fully Connected
N x (Mixer Layer)
Davide Coccomini & Nicola Messina | AICamp 2021
Global Average Pooling
Fully-Connected
CLASS
MLP-Mixer: An all-MLP Architecture for Vision | Ilya Tolstikhin et al.
Layer
Norm
MLP
MLP
MLP
MLP
Layer
Norm
MLP
MLP
MLP
MLP
Mixer Layer
Davide Coccomini & Nicola Messina | AICamp 2021
Transforming over
the sequence
domain
Transforming over
the hidden
domain
Transformers in Vision: From Zero to Hero
Convolutional Neural
Network
Vision Transformer
Not Improving
Anymore Still Improving
Learned Knowledge
What happens during training?
Davide Coccomini & Nicola Messina | AICamp 2021
Why are they different?
Able to find
long-term dependencies
Learns
inductive biases
Need very large
dataset for training
Lack of global
understanding
Locality
Sensitive
Translation
Invariant
Davide Coccomini & Nicola Messina | AICamp 2021
Convolutional
Neural
Network
Vision
Transformers
A different point of view
ViTs are both local and global!
The ViT learns only global information
with low amount of data
0.7
0.01
Davide Coccomini & Nicola Messina | AICamp 2021
Heads focus on farther patches
A different point of view
ViTs are both local and global!
The ViT learns also local information
with more data
Davide Coccomini & Nicola Messina | AICamp 2021
0.7
0.01
Higher layers heads still focus on farther
patches
0.01
0.4
0.3
Lower layers heads focus on both farther
and closer patches
0.6
0.06
A different point of view
They learn different representations!
Similar representations
through the layers
Different representations
through the layers
Davide Coccomini & Nicola Messina | AICamp 2021
Do Vision Transformers See Like Convolutional Neural Networks? | Maithra Raghu et al.
A different point of view
Vision Transformers are very robust!
Davide Coccomini & Nicola Messina | AICamp 2021
Occlusion Distribution Shift Adversarial Perturbation Permutation
Intriguing Properties of Vision Transformers | Muzammal Naseer et al.
Can we obtain the best of the two
architectures?
Transformers in Vision: From Zero to Hero
28 x 28 24 x 24 8 x 8
12 x 12 4 x 4
Davide Coccomini & Nicola Messina | AICamp 2021
What happens in CNNs?
Hey! They are patches!
Convolutional
Neural
Network
Transformer
Encoder CLASS
MLP
Davide Coccomini & Nicola Messina | AICamp 2021
Hybrids
A possible configuration!
Combining EfficientNet and Vision Transformers for Video Deepfake Detection | Coccomini et al.
Transformers in Vision: From Zero to Hero
Recap
Use a pure Vision Transformer
Improve the internal attention mechanism
Use an alternative transformation
Combine CNNs with Vision Transformers
How can we use Transformers in Vision?
ViT-G/14
CoAtNet-7
ViT-MoE-15B
1°
2°
3°
90.45% top-1 accuracy
Vision Transformer
Pretrained on JFT
1843M parameters
90.88% top-1 accuracy
Conv + Vision Transformer
Pretrained on JFT
2440M parameters
90.35% top-1 accuracy
Vision Transformer
Pretrained on JFT
14700M parameters
ImageNet Ranking
APPLICATIONS!!!
Video Supervised Learning DINO
Source: Advancing the state of the art in computer vision with self-supervised Transformers and
10x more efficient training – Facebook AI
Source: Advancing the state of the art in computer vision with self-supervised Transformers and
10x more efficient training – Facebook AI
Source: Paint Transformer: Feed Forward Neural Painting with Stroke Prediction
Source: Image GPT – By OpenAI
Transformers in Vision: From Zero to Hero
Thank You for
the Attention!
Any question?

More Related Content

PDF
Transforming deep into transformers – a computer vision approach
PDF
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
PDF
ViT (Vision Transformer) Review [CDM]
PPTX
Transformer in Vision
PPTX
Transformers In Vision From Zero to Hero (DLI).pptx
PPTX
Introduction to Visual transformers
PPTX
ViT.pptx
PPTX
ConvNeXt: A ConvNet for the 2020s explained
Transforming deep into transformers – a computer vision approach
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
ViT (Vision Transformer) Review [CDM]
Transformer in Vision
Transformers In Vision From Zero to Hero (DLI).pptx
Introduction to Visual transformers
ViT.pptx
ConvNeXt: A ConvNet for the 2020s explained

What's hot (20)

PDF
Attention is All You Need (Transformer)
PDF
Deep Learning - Convolutional Neural Networks
PPTX
Natural language processing and transformer models
PPTX
Convolutional Neural Network and Its Applications
PPTX
Attention Is All You Need
PDF
Introduction to Deep Learning, Keras, and TensorFlow
PDF
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
PPTX
Deep learning
PPTX
Transformers AI PPT.pptx
PPTX
What Is Deep Learning? | Introduction to Deep Learning | Deep Learning Tutori...
PPTX
AlexNet(ImageNet Classification with Deep Convolutional Neural Networks)
PPTX
Image Classification using deep learning
PPTX
Convolution Neural Network (CNN)
PPTX
Recurrent neural network
PDF
Generative Adversarial Networks and Their Applications
PPTX
Regularization in deep learning
PDF
Convolutional Neural Networks (CNN)
PDF
An introduction to Deep Learning
PDF
Training Neural Networks
PPTX
U-Netpresentation.pptx
Attention is All You Need (Transformer)
Deep Learning - Convolutional Neural Networks
Natural language processing and transformer models
Convolutional Neural Network and Its Applications
Attention Is All You Need
Introduction to Deep Learning, Keras, and TensorFlow
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Deep learning
Transformers AI PPT.pptx
What Is Deep Learning? | Introduction to Deep Learning | Deep Learning Tutori...
AlexNet(ImageNet Classification with Deep Convolutional Neural Networks)
Image Classification using deep learning
Convolution Neural Network (CNN)
Recurrent neural network
Generative Adversarial Networks and Their Applications
Regularization in deep learning
Convolutional Neural Networks (CNN)
An introduction to Deep Learning
Training Neural Networks
U-Netpresentation.pptx
Ad

Similar to Transformers in Vision: From Zero to Hero (20)

PPTX
Transformers in vision and its challenges and comparision with CNN
PDF
Transformer based approaches for visual representation learning
PDF
BriefHistoryTransformerstransformers.pdf
PPTX
Demystifying NLP Transformers: Understanding the Power and Architecture behin...
PDF
“How Transformers are Changing the Direction of Deep Learning Architectures,”...
PDF
Intro to Transformers.pdf
PPTX
Reading_0413_var_Transformers.pptx
PDF
Transformers in 2021
PDF
attention is all you need.pdf attention is all you need.pdfattention is all y...
PDF
“How Transformers Are Changing the Nature of Deep Learning Models,” a Present...
PPTX
FDP_atal_on transformer_NLP_by_example.pptx
PPTX
Transformer Zoo (a deeper dive)
PDF
Transformer Introduction (Seminar Material)
PPTX
Introduction_to_Deep_learning_Standford_university by Angelica Sun
PDF
Visual Transformers
PPTX
[Paper Reading] Attention is All You Need
PDF
The Transformer - Xavier Giró - UPC Barcelona 2021
PPTX
240318_JW_labseminar[Attention Is All You Need].pptx
PPTX
Transformer Zoo
PPTX
SLIDES OF LECTURE ABOUT TRANSFORMERS FOR VISION TASKS
Transformers in vision and its challenges and comparision with CNN
Transformer based approaches for visual representation learning
BriefHistoryTransformerstransformers.pdf
Demystifying NLP Transformers: Understanding the Power and Architecture behin...
“How Transformers are Changing the Direction of Deep Learning Architectures,”...
Intro to Transformers.pdf
Reading_0413_var_Transformers.pptx
Transformers in 2021
attention is all you need.pdf attention is all you need.pdfattention is all y...
“How Transformers Are Changing the Nature of Deep Learning Models,” a Present...
FDP_atal_on transformer_NLP_by_example.pptx
Transformer Zoo (a deeper dive)
Transformer Introduction (Seminar Material)
Introduction_to_Deep_learning_Standford_university by Angelica Sun
Visual Transformers
[Paper Reading] Attention is All You Need
The Transformer - Xavier Giró - UPC Barcelona 2021
240318_JW_labseminar[Attention Is All You Need].pptx
Transformer Zoo
SLIDES OF LECTURE ABOUT TRANSFORMERS FOR VISION TASKS
Ad

More from Bill Liu (20)

PDF
Walk Through a Real World ML Production Project
PDF
Redefining MLOps with Model Deployment, Management and Observability in Produ...
PDF
Productizing Machine Learning at the Edge
PDF
Deep AutoViML For Tensorflow Models and MLOps Workflows
PDF
Metaflow: The ML Infrastructure at Netflix
PDF
Practical Crowdsourcing for ML at Scale
PDF
Building large scale transactional data lake using apache hudi
PDF
Deep Reinforcement Learning and Its Applications
PDF
Big Data and AI in Fighting Against COVID-19
PDF
Highly-scalable Reinforcement Learning RLlib for Real-world Applications
PDF
Build computer vision models to perform object detection and classification w...
PDF
Causal Inference in Data Science and Machine Learning
PDF
Weekly #106: Deep Learning on Mobile
PDF
Weekly #105: AutoViz and Auto_ViML Visualization and Machine Learning
PDF
AISF19 - On Blending Machine Learning with Microeconomics
PDF
AISF19 - Travel in the AI-First World
PDF
AISF19 - Unleash Computer Vision at the Edge
PDF
AISF19 - Building Scalable, Kubernetes-Native ML/AI Pipelines with TFX, KubeF...
PDF
Toronto meetup 20190917
PPTX
Feature Engineering for NLP
Walk Through a Real World ML Production Project
Redefining MLOps with Model Deployment, Management and Observability in Produ...
Productizing Machine Learning at the Edge
Deep AutoViML For Tensorflow Models and MLOps Workflows
Metaflow: The ML Infrastructure at Netflix
Practical Crowdsourcing for ML at Scale
Building large scale transactional data lake using apache hudi
Deep Reinforcement Learning and Its Applications
Big Data and AI in Fighting Against COVID-19
Highly-scalable Reinforcement Learning RLlib for Real-world Applications
Build computer vision models to perform object detection and classification w...
Causal Inference in Data Science and Machine Learning
Weekly #106: Deep Learning on Mobile
Weekly #105: AutoViz and Auto_ViML Visualization and Machine Learning
AISF19 - On Blending Machine Learning with Microeconomics
AISF19 - Travel in the AI-First World
AISF19 - Unleash Computer Vision at the Edge
AISF19 - Building Scalable, Kubernetes-Native ML/AI Pipelines with TFX, KubeF...
Toronto meetup 20190917
Feature Engineering for NLP

Recently uploaded (20)

PDF
Accuracy of neural networks in brain wave diagnosis of schizophrenia
PPTX
cloud_computing_Infrastucture_as_cloud_p
PPTX
TechTalks-8-2019-Service-Management-ITIL-Refresh-ITIL-4-Framework-Supports-Ou...
PDF
DASA ADMISSION 2024_FirstRound_FirstRank_LastRank.pdf
PDF
Web App vs Mobile App What Should You Build First.pdf
PPTX
SOPHOS-XG Firewall Administrator PPT.pptx
PDF
A novel scalable deep ensemble learning framework for big data classification...
PDF
Microsoft Solutions Partner Drive Digital Transformation with D365.pdf
PDF
Encapsulation theory and applications.pdf
PDF
ENT215_Completing-a-large-scale-migration-and-modernization-with-AWS.pdf
PDF
Assigned Numbers - 2025 - Bluetooth® Document
PDF
Hindi spoken digit analysis for native and non-native speakers
PDF
Hybrid model detection and classification of lung cancer
PDF
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
PDF
NewMind AI Weekly Chronicles - August'25-Week II
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
1 - Historical Antecedents, Social Consideration.pdf
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Accuracy of neural networks in brain wave diagnosis of schizophrenia
cloud_computing_Infrastucture_as_cloud_p
TechTalks-8-2019-Service-Management-ITIL-Refresh-ITIL-4-Framework-Supports-Ou...
DASA ADMISSION 2024_FirstRound_FirstRank_LastRank.pdf
Web App vs Mobile App What Should You Build First.pdf
SOPHOS-XG Firewall Administrator PPT.pptx
A novel scalable deep ensemble learning framework for big data classification...
Microsoft Solutions Partner Drive Digital Transformation with D365.pdf
Encapsulation theory and applications.pdf
ENT215_Completing-a-large-scale-migration-and-modernization-with-AWS.pdf
Assigned Numbers - 2025 - Bluetooth® Document
Hindi spoken digit analysis for native and non-native speakers
Hybrid model detection and classification of lung cancer
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
NewMind AI Weekly Chronicles - August'25-Week II
Digital-Transformation-Roadmap-for-Companies.pptx
gpt5_lecture_notes_comprehensive_20250812015547.pdf
Unlocking AI with Model Context Protocol (MCP)
1 - Historical Antecedents, Social Consideration.pdf
Profit Center Accounting in SAP S/4HANA, S4F28 Col11

Transformers in Vision: From Zero to Hero

  • 1. Transformers In Vision From Zero to Hero! Davide Coccomini & Nicola Messina Davide Coccomini Nicola Messina PhD Candidate Italian National Research Council PhD Student Italian National Research Council Reach me on … Reach me on …
  • 2. What do you think when you hear the word «Transformer»? Davide Coccomini & Nicola Messina | AICamp 2021
  • 4. The Transformer «today» Input Output Multi-Head Attention Feed Forward Feed Forward Add & Norm Add & Norm Add & Norm Add & Norm Multi-Head Attention Multi-Head Attention + + Salta (90%) | Odia (9%) | Perchè (1%) The cat jumps the wall Il gatto Positional Encoding Positional Encoding V K Q Nx Mx Davide Coccomini & Nicola Messina | AICamp 2021 V K Q V K Q Encoder Decoder Add & Norm Linear + Softmax Memory
  • 5. Outline Some history: from RNNs to Transformers Transformers’ attention and self-attention mechanisms The power of the Transformer Encoder From text to images: Vision Transformers Transformers: The beginnings From images to videos The scale and data problem Convolutional Neural Networks and Vision Transformers Some interesting real-world applications Transformers in Vision
  • 6. Videos Images Text History Introduced transformers in NLP 2017 Vision Transformers 2020 2021 Transformers for video understanding Now Computer Vision Revolution! Transformers Davide Coccomini & Nicola Messina | AICamp 2021
  • 7. A step back: Recurrent Networks (RNNs) E The cat jumps the wall h0 hstart E h1 E h2 E h3 E h4 D h5 D h6 D h7 D h8 D Il gatto salta il h9 D <end> Davide Coccomini & Nicola Messina | AICamp 2021 muro Encoder Final sentence embedding Decoder <s>
  • 8. Problems 1. We forget tokens too far in the past 2. We need to wait the previous token to compute the next hidden-state E The cat jumps the wall h0 hstart E h1 E h2 E h3 E h4 D h5 D h6 D h7 D h8 D Il gatto salta il muro h9 D <end> <s> Davide Coccomini & Nicola Messina | AICamp 2021
  • 9. Solving problem 1 "We forget tokens too far in the past" E The cat jumps the wall h0 hstar t E h1 E h2 E h3 E h4 D D D Il gatto salta + + + context = Attention Solution Add an attention mechanism + h5 h6 <s>
  • 10. Solving problem 2 "We need to wait the previous token to compute the next hidden-state" 2017 paper "Attention Is All You Need" Solution Throw away recurrent connections E The cat jumps the wall h0 hstar t E h1 E h2 E h3 E h4 D D D Il gatto salta + + + context = Attention + h5 h6 <s> Davide Coccomini & Nicola Messina | AICamp 2021
  • 11. Full Transformer Architecture Input Output Multi-Head Attention Feed Forward Feed Forward Add & Norm Add & Norm Add & Norm Add & Norm Multi-Head Attention Multi-Head Attention + + “Salta” (90%) | “Odia” (9%) | “Perchè” (1%) “The cat jumps the wall” “Il gatto” Positional Encoding Positional Encoding V K Q Nx Mx Davide Coccomini & Nicola Messina | AICamp 2021 V K Q V K Q Encoder Decoder Add & Norm Linear + Softmax Memory
  • 12. Transformer's Attention Mechanism Target tokens “from the point of view” of the source sequence Queries Target Sequence Source Sequence FFN FFN FFN FFN Keys & Values FFN FFN ∙ ∙ ∙ ∙ Norm & Softmax Dot product Il gatto salta The cat jumps the … … FFN
  • 13. Key Value “A4” “N9” “O7” “A4” “N2” Transformer's Attention Mechanism From a different perspective Query “N5” Weighted average “gatto” token, built by aggregating value vectors in the source dictionary Lookup Table Target Sequence Source Sequence Il gatto salta The cat jumps the … wall Soft-matching Davide Coccomini & Nicola Messina | AICamp 2021
  • 14. Attention and Self-Attention Self-Attention • Source = Target • Key, Queries, Values obtained from the same sentence • Captures intra-sequence dependencies Attention • Source ≠ Target • Queries from Source • Key, Values from Target • Captures inter-sequence dependencies I gave my dog Charlie some food Ho dato da mangiare al mio cane Charlie I gave my dog Charlie some food To whom? What? Who? Multi-Head Attention V K Q Multi-Head Attention V K Q Target Source Davide Coccomini & Nicola Messina | AICamp 2021
  • 15. LINEAR LINEAR LINEAR LINEAR LINEAR LINEAR LINEAR LINEAR LINEAR DOT NORMALIZE MATMUL CONCAT + DENSE VALUES KEYS QUERIES Multi-Head Self-Attention Multiple instantiations of the attention mechanism h h h Davide Coccomini & Nicola Messina | AICamp 2021 The cat is running h slices
  • 16. Full Transformer Architecture Input Output Feed Forward Feed Forward Add & Norm Add & Norm Add & Norm Add & Norm Multi-Head Attention Multi-Head Attention + + Salta (90%) | Odia (9%) | Perchè (1%) The cat jumps the wall Il gatto Positional Encoding Positional Encoding V K Q Nx Mx Davide Coccomini & Nicola Messina | AICamp 2021 V K Q V K Q Encoder Decoder Add & Norm Linear + Softmax Memory Key Value “A4” “N9” “O7” “A4” “N2” Lookup Table (source sequence) Multi-Head Self Attention Multi-Head Attention
  • 17. Is there any problem with Transformers? Attention calculation is O(n2) Self-Attention I gave my dog Charlie some food I gave my dog Charlie some food I gave my dog Charlie some food . . . . . . . . . . Davide Coccomini & Nicola Messina | AICamp 2021
  • 18. I gave my dog Charlie some food I gave my dog Charlie some food Attention calculation is O(n2) Self-Attention Davide Coccomini & Nicola Messina | AICamp 2021
  • 19. The Power of the Transformer Encoder • Many achievements using only the Encoder • BERT (Devlin et al., 2018) Next Sentence Prediction {0, 1} Transformer Encoder (N layers) I gave my dog Charlie some food <CLS> <SEP> He ate it Positional Encoding Embedding Layer Masked Language Modelling «ate» Memory
  • 20. Transformers in Computer Vision Can we use the self-attention mechanism in images? Davide Coccomini & Nicola Messina | AICamp 2021
  • 21. Transformers in Computer Vision 256px 256px 3906250000 calculations Impossible! 62500 pixels • The transformer works with a set of tokens • What are tokens in images? Davide Coccomini & Nicola Messina | AICamp 2021
  • 22. • Tokens as the features from an object detector Transformers in Computer Vision Tokens! ROI Pooling ROI Pooling ROI Pooling Davide Coccomini & Nicola Messina | AICamp 2021
  • 23. “An image is worth 16x16 words” Image to Patches Tokens! Linear Projection 256px 256px 16px 16px Vision Transformers (ViTs) Davide Coccomini & Nicola Messina | AICamp 2021 “An image is worth 16x16 words” | Dosovitskiy et al., 2020
  • 24. Vision Transformers (ViTs) 0 * 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 MLP Head CLASS Linear Projection of Flattened Patches Transformer Encoder Davide Coccomini & Nicola Messina | AICamp 2021
  • 25. Image Classification on ImageNet Davide Coccomini & Nicola Messina | AICamp 2021
  • 26. What about video? Davide Coccomini & Nicola Messina | AICamp 2021
  • 27. TimeSformers Combine space and time attention with Divided Space-Time Attention! frame t - δ frame t frame t + δ Space Time Time Davide Coccomini & Nicola Messina | AICamp 2021 Is Space-Time Attention All You Need for Video Understanding? | Gedas Bertasius et al.
  • 28. TimeSformers Up to several minutes of analysis! Davide Coccomini & Nicola Messina | AICamp 2021
  • 29. Wait… Can I use different types of attention???
  • 31. Transformers use a lot of memory! Attention Feed Forward Feed Forward . . . . 2 GB 2 GB 2 GB Davide Coccomini & Nicola Messina | AICamp 2021 USED MEMORY 2 GB 4 GB A LOT!
  • 32. Efficient Transformers Attention + FeedForward + - - Attention FeedForward Davide Coccomini & Nicola Messina | AICamp 2021 A new efficient Transformer variant | Lukasz Kaiser USED MEMORY 2 GB 4 GB Rev Attention!
  • 33. Patch Partition Davide Coccomini & Nicola Messina | AICamp 2021 Linear Embedding x2 Stage 1 Swin Transformer Block Patch Merging Swin Transformer Block x2 Stage 2 Patch Merging Swin Transformer Block x6 Stage 3 Patch Merging Swin Transformer Block x2 Stage 4 𝑯 × 𝑾 × 𝟑 𝑯 𝟒 × 𝑾 𝟒 × 𝟒𝟖 𝑯 𝟒 × 𝑾 𝟒 × 𝑪 𝑯 𝟖 × 𝑾 𝟖 × 𝟐𝑪 𝑯 𝟏𝟔 × 𝑾 𝟏𝟔 × 𝟒𝑪 𝑯 𝟑𝟐 × 𝑾 𝟑𝟐 × 𝟖𝑪 Swin Transformers Swin Transformer: Hierarchical Vision Transformer using Shifted Windows | Ze Liu et al.
  • 34. 16x 4x 8x 16x Swin Transformers Swin Transformer Vision Transformer 16x 16x Davide Coccomini & Nicola Messina | AICamp 2021 Shifted Window based Self-Attention
  • 35. Self-Attention Layer l Self-Attention Layer l+1 Davide Coccomini & Nicola Messina | AICamp 2021 Swin Transformers
  • 36. Source: Swin Transformer Object Detection Demo – By DeepReader https://www.youtube.com/watch?v=FQVS_0Bja6o
  • 37. Can we do without Self-Attention?
  • 38. It is «just» a transformation! What essentially is the attention mechanism? Davide Coccomini & Nicola Messina | AICamp 2021 Attention Mechanism
  • 39. Input Attention Calculation Embeddings Feed Forward Add & Normalize Dense Output Prediction Add & Normalize Embeddings Fourier Network Fourier Transformation Davide Coccomini & Nicola Messina | AICamp 2021
  • 40. Why Fourier? It’s just a transformation! Image from mriquestion.com Fourier Transform Davide Coccomini & Nicola Messina | AICamp 2021
  • 41. Fourier Transform Fourier Transform Transforming over the hidden domain Transforming over the sequence domain Fourier Network What does it transform? Input Vectors Davide Coccomini & Nicola Messina | AICamp 2021 FNet: Mixing Tokens with Fourier Transforms | James Lee-Thorp et al.
  • 42. MLP Mixer 1 2 3 4 5 6 7 8 9 Per-patch Fully Connected N x (Mixer Layer) Davide Coccomini & Nicola Messina | AICamp 2021 Global Average Pooling Fully-Connected CLASS MLP-Mixer: An all-MLP Architecture for Vision | Ilya Tolstikhin et al.
  • 43. Layer Norm MLP MLP MLP MLP Layer Norm MLP MLP MLP MLP Mixer Layer Davide Coccomini & Nicola Messina | AICamp 2021 Transforming over the sequence domain Transforming over the hidden domain
  • 45. Convolutional Neural Network Vision Transformer Not Improving Anymore Still Improving Learned Knowledge What happens during training? Davide Coccomini & Nicola Messina | AICamp 2021
  • 46. Why are they different? Able to find long-term dependencies Learns inductive biases Need very large dataset for training Lack of global understanding Locality Sensitive Translation Invariant Davide Coccomini & Nicola Messina | AICamp 2021 Convolutional Neural Network Vision Transformers
  • 47. A different point of view ViTs are both local and global! The ViT learns only global information with low amount of data 0.7 0.01 Davide Coccomini & Nicola Messina | AICamp 2021 Heads focus on farther patches
  • 48. A different point of view ViTs are both local and global! The ViT learns also local information with more data Davide Coccomini & Nicola Messina | AICamp 2021 0.7 0.01 Higher layers heads still focus on farther patches 0.01 0.4 0.3 Lower layers heads focus on both farther and closer patches 0.6 0.06
  • 49. A different point of view They learn different representations! Similar representations through the layers Different representations through the layers Davide Coccomini & Nicola Messina | AICamp 2021 Do Vision Transformers See Like Convolutional Neural Networks? | Maithra Raghu et al.
  • 50. A different point of view Vision Transformers are very robust! Davide Coccomini & Nicola Messina | AICamp 2021 Occlusion Distribution Shift Adversarial Perturbation Permutation Intriguing Properties of Vision Transformers | Muzammal Naseer et al.
  • 51. Can we obtain the best of the two architectures?
  • 53. 28 x 28 24 x 24 8 x 8 12 x 12 4 x 4 Davide Coccomini & Nicola Messina | AICamp 2021 What happens in CNNs? Hey! They are patches!
  • 54. Convolutional Neural Network Transformer Encoder CLASS MLP Davide Coccomini & Nicola Messina | AICamp 2021 Hybrids A possible configuration! Combining EfficientNet and Vision Transformers for Video Deepfake Detection | Coccomini et al.
  • 56. Recap Use a pure Vision Transformer Improve the internal attention mechanism Use an alternative transformation Combine CNNs with Vision Transformers How can we use Transformers in Vision?
  • 57. ViT-G/14 CoAtNet-7 ViT-MoE-15B 1° 2° 3° 90.45% top-1 accuracy Vision Transformer Pretrained on JFT 1843M parameters 90.88% top-1 accuracy Conv + Vision Transformer Pretrained on JFT 2440M parameters 90.35% top-1 accuracy Vision Transformer Pretrained on JFT 14700M parameters ImageNet Ranking
  • 59. Video Supervised Learning DINO Source: Advancing the state of the art in computer vision with self-supervised Transformers and 10x more efficient training – Facebook AI
  • 60. Source: Advancing the state of the art in computer vision with self-supervised Transformers and 10x more efficient training – Facebook AI
  • 61. Source: Paint Transformer: Feed Forward Neural Painting with Stroke Prediction
  • 62. Source: Image GPT – By OpenAI
  • 64. Thank You for the Attention! Any question?

Editor's Notes

  • #50: - They plot CKA similarities between all pairs of layers across different model architectures. We observe that ViTs have relatively uniform layer similarity structure, with a clear grid-like pattern and large similarity between lower and higher layers. By contrast, the ResNet models show clear stages in similarity structure, with smaller similarity scores between lower and higher layers.