Transformers in Vision: From Zero to Hero

Transformers In Vision
From Zero to Hero!
Davide Coccomini & Nicola Messina
Davide Coccomini Nicola Messina
PhD Candidate
Italian National
Research Council
PhD Student
Italian National
Research Council
Reach me on …
Reach me on …

What do you think when you hear the
word «Transformer»?
Davide Coccomini & Nicola Messina | AICamp 2021

The Transformer «today»
Input Output
Multi-Head
Attention
Feed Forward
Feed Forward
Add & Norm
Add & Norm
Add & Norm
Add & Norm
Multi-Head
Attention
Multi-Head
Attention
+ +
Salta (90%) | Odia (9%) | Perchè (1%)
The cat jumps the wall Il gatto
Positional
Encoding
Positional
Encoding
V K Q
Nx
Mx
V K Q V K Q
Encoder
Decoder
Add & Norm
Linear + Softmax
Memory

Outline
Some history: from RNNs to Transformers
Transformers’ attention and self-attention mechanisms
The power of the Transformer Encoder
From text to images: Vision Transformers
Transformers: The beginnings
From images to videos
The scale and data problem
Convolutional Neural Networks and Vision Transformers
Some interesting real-world applications
Transformers in Vision

Videos
Images
Text History
Introduced transformers in NLP
2017
Vision Transformers
2020
2021
Transformers for video
understanding
Now Computer Vision Revolution!
Transformers

A step back: Recurrent Networks (RNNs)
E
The cat jumps the wall
h0
hstart
E
h1
E
h2
E
h3
E
h4
D h5
D
h6
D
h7
D
h8
D
Il gatto salta il
h9
D
<end>
muro
Encoder
Final sentence
embedding
Decoder
<s>

Problems
1. We forget tokens too far in the past
2. We need to wait the previous token to compute the next hidden-state
E
h0
hstart
E
h1
E
h2
E
h3
E
h4
D h5
D h6
D h7
D h8
D
Il gatto salta il muro
h9
D
<end>
<s>

Solving problem 1
"We forget tokens too far in the past"
E
h0
hstar
t
E
h1
E
h2
E
h3
E
h4
D D D
Il gatto salta
+ + + context
=
Attention
Solution
Add an attention mechanism
+
h5 h6
<s>

Solving problem 2
"We need to wait the previous token to compute the next hidden-state"
2017 paper
"Attention Is All You Need"
Solution
Throw away recurrent
connections
E
h0
hstar
t
E
h1
E
h2
E
h3
E
h4
D D D
Il gatto salta
+ + + context
=
Attention
+
h5 h6
<s>

Full Transformer Architecture
Input Output
Multi-Head
Attention
Feed Forward
Feed Forward
Add & Norm
Add & Norm
Add & Norm
Add & Norm
Multi-Head
Attention
Multi-Head
Attention
+ +
“Salta” (90%) | “Odia” (9%) | “Perchè” (1%)
“The cat jumps the wall” “Il gatto”
Positional
Encoding
Positional
Encoding
V K Q
Nx
Mx
V K Q V K Q
Encoder
Decoder
Add & Norm
Linear + Softmax
Memory

Transformer's Attention Mechanism
Target tokens “from the
point of view” of the
source sequence
Queries
Target
Sequence
Source
Sequence
FFN
FFN
FFN
FFN
Keys & Values
FFN
FFN
∙
∙
∙
∙
Norm
&
Softmax
Dot
product
Il
gatto
salta
The
cat
jumps
the
…
…
FFN

Key Value
“A4”
“N9”
“O7”
“A4”
“N2”
Transformer's Attention Mechanism
From a different perspective
Query
“N5”
Weighted
average
“gatto” token, built by
aggregating value vectors in
the source dictionary
Lookup Table
Target
Sequence
Source
Sequence
Il
gatto
salta
The
cat
jumps
the
…
wall
Soft-matching

Attention and Self-Attention
Self-Attention
• Source = Target
• Key, Queries, Values
obtained from the
same sentence
• Captures intra-sequence dependencies
Attention
• Source ≠ Target
• Queries from Source
• Key, Values from Target
• Captures inter-sequence dependencies
I gave my dog Charlie some food
Ho dato da mangiare al mio cane Charlie
To whom? What?
Who?
Multi-Head
Attention
V K Q
Multi-Head
Attention
V K Q
Target Source

LINEAR
LINEAR
LINEAR
LINEAR
LINEAR
LINEAR
LINEAR
LINEAR
LINEAR
DOT NORMALIZE
MATMUL
CONCAT + DENSE
VALUES
KEYS
QUERIES
Multi-Head Self-Attention
Multiple instantiations of the attention mechanism
h
h
h
The cat is running
h slices

Full Transformer Architecture
Input Output
Feed Forward
Feed Forward
Add & Norm
Add & Norm
Add & Norm
Add & Norm
Multi-Head
Attention
Multi-Head
Attention
+ +
Salta (90%) | Odia (9%) | Perchè (1%)
The cat jumps the wall Il gatto
Positional
Encoding
Positional
Encoding
V K Q
Nx
Mx
V K Q V K Q
Encoder
Decoder
Add & Norm
Linear + Softmax
Memory
Key Value
“A4”
“N9”
“O7”
“A4”
“N2”
Lookup Table (source sequence)
Multi-Head
Self Attention
Multi-Head
Attention

Is there any problem
with Transformers?
Attention
calculation is O(n2)
Self-Attention
. . . . . . . . . .

I
gave
my
dog
Charlie
some
food
Attention
calculation is O(n2)
Self-Attention

The Power of the Transformer Encoder
• Many achievements using only the Encoder
• BERT (Devlin et al., 2018)
Next Sentence
Prediction {0, 1}
Transformer Encoder (N layers)
<CLS> <SEP> He ate it
Positional Encoding
Embedding Layer
Masked Language
Modelling «ate»
Memory

Transformers in Computer Vision
Can we use the self-attention mechanism in images?

256px
256px
3906250000
calculations
Impossible!
62500
pixels
• The transformer works with a set of tokens
• What are tokens in images?

• Tokens as the features from an object detector
Tokens!
ROI Pooling
ROI Pooling
ROI Pooling

“An image is worth 16x16 words”
Image
to
Patches
Tokens!
Linear
Projection
256px
256px
16px
16px
Vision Transformers (ViTs)
“An image is worth 16x16 words” | Dosovitskiy et al., 2020

Vision Transformers (ViTs)
0 * 1 2 3 4 5 6 7 8 9
1 2 3
4 5 6
7 8 9
MLP
Head
CLASS
Linear Projection of Flattened Patches
Transformer Encoder

Image Classification on ImageNet

What about video?

TimeSformers
Combine space and time attention with Divided Space-Time Attention!
frame
t
-
δ
frame
t
frame
t
+
δ
Space
Time
Time
Is Space-Time Attention All You Need for Video Understanding? | Gedas Bertasius et al.

TimeSformers
Up to several minutes of
analysis!

Wait… Can I use different types of
attention???

Transformers use a lot of memory!
Attention
Feed Forward
Feed Forward
.
.
.
.
2 GB
2 GB
2 GB
USED MEMORY
2 GB
4 GB
A LOT!

Efficient Transformers
Attention
+
FeedForward
+
-
-
Attention
FeedForward
A new efficient Transformer variant | Lukasz Kaiser
USED MEMORY
2 GB
4 GB
Rev Attention!

Patch
Partition
Linear
Embedding
x2
Stage 1
Swin
Transformer
Block
Patch
Merging
Swin
Transformer
Block
x2
Stage 2
Patch
Merging
Swin
Transformer
Block
x6
Stage 3
Patch
Merging
Swin
Transformer
Block
x2
Stage 4
𝑯 × 𝑾 × 𝟑
𝑯
𝟒
×
𝑾
𝟒
× 𝟒𝟖
𝑯
𝟒
×
𝑾
𝟒
× 𝑪
𝑯
𝟖
×
𝑾
𝟖
× 𝟐𝑪
𝑯
𝟏𝟔
×
𝑾
𝟏𝟔
× 𝟒𝑪
𝑯
𝟑𝟐
×
𝑾
𝟑𝟐
× 𝟖𝑪
Swin Transformers
Swin Transformer: Hierarchical Vision Transformer using Shifted Windows | Ze Liu et al.

16x 4x
8x
16x
Swin Transformers
Swin Transformer
Vision Transformer
16x
16x
Shifted Window based Self-Attention

Self-Attention Layer l Self-Attention Layer l+1
Swin Transformers

Source: Swin Transformer Object Detection Demo – By DeepReader
https://www.youtube.com/watch?v=FQVS_0Bja6o

Can we do without Self-Attention?

It is «just» a transformation!
What essentially is the attention
mechanism?
Attention
Mechanism

Input
Attention
Calculation
Embeddings
Feed Forward
Add & Normalize
Dense
Output Prediction
Add & Normalize
Embeddings
Fourier Network
Fourier
Transformation

Why Fourier?
It’s just a transformation!
Image from mriquestion.com
Fourier Transform

Fourier
Transform
Fourier
Transform
Transforming over
the hidden domain
Transforming over
the sequence
domain
Fourier Network
What does it transform?
Input Vectors
FNet: Mixing Tokens with Fourier Transforms | James Lee-Thorp et al.

MLP Mixer
1 2 3
4 5 6
7 8 9
Per-patch Fully Connected
N x (Mixer Layer)
Global Average Pooling
Fully-Connected
CLASS
MLP-Mixer: An all-MLP Architecture for Vision | Ilya Tolstikhin et al.

Layer
Norm
MLP
MLP
MLP
MLP
Layer
Norm
MLP
MLP
MLP
MLP
Mixer Layer
Transforming over
the sequence
domain
Transforming over
the hidden
domain

Convolutional Neural
Network
Vision Transformer
Not Improving
Anymore Still Improving
Learned Knowledge
What happens during training?

Why are they different?
Able to find
long-term dependencies
Learns
inductive biases
Need very large
dataset for training
Lack of global
understanding
Locality
Sensitive
Translation
Invariant
Convolutional
Neural
Network
Vision
Transformers

A different point of view
ViTs are both local and global!
The ViT learns only global information
with low amount of data
0.7
0.01
Heads focus on farther patches

ViTs are both local and global!
The ViT learns also local information
with more data
0.7
0.01
Higher layers heads still focus on farther
patches
0.01
0.4
0.3
Lower layers heads focus on both farther
and closer patches
0.6
0.06

They learn different representations!
Similar representations
through the layers
Different representations
through the layers
Do Vision Transformers See Like Convolutional Neural Networks? | Maithra Raghu et al.

Vision Transformers are very robust!
Occlusion Distribution Shift Adversarial Perturbation Permutation
Intriguing Properties of Vision Transformers | Muzammal Naseer et al.

Can we obtain the best of the two
architectures?

28 x 28 24 x 24 8 x 8
12 x 12 4 x 4
What happens in CNNs?
Hey! They are patches!

Convolutional
Neural
Network
Transformer
Encoder CLASS
MLP
Hybrids
A possible configuration!
Combining EfficientNet and Vision Transformers for Video Deepfake Detection | Coccomini et al.

Recap
Use a pure Vision Transformer
Improve the internal attention mechanism
Use an alternative transformation
Combine CNNs with Vision Transformers
How can we use Transformers in Vision?

ViT-G/14
CoAtNet-7
ViT-MoE-15B
1°
2°
3°
90.45% top-1 accuracy
Vision Transformer
Pretrained on JFT
1843M parameters
Conv + Vision Transformer
Pretrained on JFT
2440M parameters
Vision Transformer
Pretrained on JFT
14700M parameters
ImageNet Ranking

Video Supervised Learning DINO
Source: Advancing the state of the art in computer vision with self-supervised Transformers and
10x more efficient training – Facebook AI

Source: Advancing the state of the art in computer vision with self-supervised Transformers and
10x more efficient training – Facebook AI

Source: Paint Transformer: Feed Forward Neural Painting with Stroke Prediction

Source: Image GPT – By OpenAI

Thank You for
the Attention!
Any question?

Transformers in Vision: From Zero to Hero

More Related Content

What's hot (20)

Similar to Transformers in Vision: From Zero to Hero (20)

More from Bill Liu (20)

Recently uploaded (20)

Transformers in Vision: From Zero to Hero

Editor's Notes