Transformer based approaches for visual representation learning

報告担当：鈴木良平
Jun. 11, 2021
Transformer-based approaches for
visual representation learning

autoregression 自己回帰（以前の出力を参照しつつ系列の出力を行うこと）
CNN 畳み込みニューラルネットワーク
embedding 埋め込み（データのベクトル表現への変換）
equivariance 同変性（処理Aと変換fの順番を入れ替えても結果が等しいこと）
inductive bias 帰納バイアス（モデル設計で暗黙的に与えられるデータ仮定）
invariance 不変性（処理Aの前に変換fを行っても結果が等しいこと）
MLP 多層パーセプトロン
NLP 自然言語処理
pretraining 事前学習
self-attention 自己注意機構
Glossary
2

Today’s papers
Vaswani et al. (Google Brain, UToronto),
NeurIPS 2017
Dosovitskiy et al. (Google Brain),
ICLR 2021
Caron et al. (FAIR, INRIA, Sorbonne),
arXiv preprint 2021 3

● CNNs (e.g., VGG, ResNet) have been the de facto standard for
visual tasks in deep learning
○ Convolution provides favorable properties for image processing
● Recently alternative approaches are emerging
○ Transformer-based methods e.g., Attention-CNN, ViT
○ MLP-based methods e.g., MLP-Mixer, gMLP
● In particular, NLP-inspired Transformer-based approaches have
shown promising performance and interesting properties
Context
4

Review: Convolutional Neural Network (CNN)
● Convolution = multi-channel filtering by learnable kernels
● Typical modern CNNs contain convolution, activation, pooling,
skip-connection, normalization (e.g., BN, IN, GN), etc.
kernel =
linear map of finite-size window
5

Visual inductive biases of CNNs
Inductive bias: implicitly introduced regularizations on the solution by model
design, which is useful for utilizing the data characteristics
● Locality (局所性)
○ Natural images have spatial hierarchy
○ Sequence of convolutions works as an imitative process
receptive field gradually grows
through multiple convolutions
https://towardsdatascience.com/journey-from-machine-learning-to-deep-learning-8a807e8f3c1c
6

Visual inductive biases of CNNs
● Translation invariance / equivariance (並進不変性・同変性)
○ invariance: equivariance:
○ Convolution is naturally a translation equivariant operation
○ Equivariance is in fact broken in CNNs [Zhang 2019, Kayhan 2020]
translation invariance:
we want the model to output a
same answer for shifted images
translation equivariance:
CNN-after-shift is equivalent to
shift-after-CNN
rotational equivariance is also
sometimes imposed
[Graham et al. 2020]
cat 98% cat 98%
7

Intrinsic problems of CNNs
● Difficulty on handling long-range / irregular-shape dependency
○ CNNs can recognize interaction between two distant points only
after a large number of convolutions using large receptive field
● Low-resolution, blurred representation
○ Partially solved by skip-connections (e.g., U-Net, HRNet)
How to recognize the
interaction between the
bird and the flower?
HRNet [Sun et al. 2019]
8

Ideas from NLP
Self-attention
● Convolution: gather information from the nearby positions
● Self-attention: gather information from the related (attended) positions
○ originally developed in language models
Large-scale pretraining
● Most specific problems provide limited amount of data
● ImageNet-pretraining has already been broadly used in CV
● Pretraining with massively large datasets has shown amazing results
in NLP, e.g., GPT-3. (ImageNet 1.2M images vs. GPT-3 500B tokens)
image from: http://bliulab.net/selfAT_fold/
9

Paper 1: Attention is All You Need
● Proposed Transformer (=attention-based translation model)
● One of the most important ML papers (cited more than 20,000 times!)
● Attention is All You Need is All You Need (many subsequent papers
have proposed “improved” models, but the progress was reported to be quite small [1])
[1] Narang et al. arXiv preprint 2021
10
Vaswani et al. (Google Brain, UToronto)
NeurIPS 2017

Encoder-decoder model
Many “translation” models can be formulated as enc-dec pair
● Encoder extracts meaningful features from the input
● Decoder composes output from the extracted features
“I am a student”
“Je suis un étudiant”
Dec
Enc
latent variable
Transformed input data containing the necessary
information for translation task in a more useful form
→ can be utilized for multiple downstream tasks
Training of meaningful encoder (feature predictor)
= representation learning
11

Transformer
encoder
decoder
input:
“I am a student”
past output:
“Je suis un”
next output:
“étudiant”
12

Flow of processing
I am a student
Attention Block
Attention Block
Encoder Block
<start> Je suis un
Decoder Block
Attention Block
Attention Block
Decoder Block
étudiant
fed to attention
modules
Encoder Block
13

Parallelized training
I am a student
Attention Block
Attention Block
Encoder Block
<start> Je suis un
Decoder Block
Attention Block
Attention Block
Decoder Block
étudiant
Encoder Block
Je suis un
?
? ? ?
ground truth
14
Causal mask

Self-attention
Instead of spatial convolution, we want to aggregate the vector at
position i by aggregating information from related positions.
convolution
self-attention
Questions:
● How to know the “related” positions?
● How to aggregate the information (vectors)
from the found related positions? 15

Query-Key-Value attention
At each position i, we convert the vector into three vectors
query: , key: , and value: ,
then define the relativity between position i and j by .
Output is weighted average of weighted by the relativities.
内積
16
in matrix expression decoder fetches encoded
features by attention

Multi-head attention
Attention is basically a weighted-average → limited capability for
representing multiple types of relationships
e.g., “I like this lecture”
Multi-head attention first generates h sets of (Q,K,V),
then apply the standard attention in parallel to them.
→ each branch has different attention targets
Results are aggregated by a linear layer after joining.
17
“I” and “this lecture” have different relationships to the word “like”

Input processing
Input embedding
● Converts the raw input tokens into vector format by projection
Positional encoding
● Injects the information of (absolute) position of each token into
the embedded vector by sinusoidal functions
18
encoding dimension
absolute
position

Experimental results
Marked the highest score on English-to-DE/FR translation tasks at
1/100 training cost compared to the best methods at 2017...
19

cf. Image GPT
Transformer’s autoregressive
generation can naturally be applied
to the next-pixel prediction task
Transformer variant (GPT-2)
trained on ImageNet shows
impressive image completion
results!
[Chen et al., ICML2020]
20
https://openai.com/blog/image-gpt/

Paper 2: Vision Transformer (ViT)
Question: can we completely discard convolutions to tackle real
image recognition problems like classification?
→ Pure Transformer architecture pre-trained with very large
dataset can perform better than modern CNNs.
21
Dosovitskiy et al. (Google Brain),
ICLR 2021

The ViT model
ViT uses small splits of the input image as the “tokens” (words)
Supervised training using calculated features from the encoder
22
classification
task

Performance compared to SoTA CNNs
Experiment: fine-tuned top-1 accuracy after supervised pretraining
vs. BiT-L (supervised ResNet) and Noisy Student (semi-supervised EfficientNet)
23
JFT300M: Google’s internal
dataset consisting of 300
millions images
ImageNet 21k: superset of
ImageNet (1k) consisting of
21,000 classes, 14M images
※BiT and NS trained with JFT

Scalability
24
Small dataset: CNN performs much better than ViT
Large dataset: CNN saturates / ViT steadily improves

Learned patch embedding
The first part of ViT is embedding 16x16 patches into vector tokens
What kind of information is extracted in this stage?
→ CNN filter-like patterns (cf. Gabor filters) are found
25

Learned locality
ViT uses learnable positional embedding instead of sine encoding
→ embeddings at nearby positions become similar to each other
Attention heads at shallower layers attend to various distances
26
attending to
local relations
attending to
global relations

Recent discoveries on ViT and related methods
27
Massive ViT pretrained with massive
dataset shows great performance gain
→ 90.45% ImageNet top-1
Principled combination of convolution
and attention is more important
→ 88.56% without massive dataset

Scaling property [Zhai 2021]
With more computational budget and more dataset size,
performance gain seems to be obtained without saturation
28

Paper 3: Self-supervised ViT (DINO)
29

Self-supervised learning of ViTs
Self-supervised learning
training of model with supervision that is generated from unlabeled dataset
● e.g., next-word prediction, contrastive learning, self-distillation
Why important? → richer training signals than predicting a single class label
ViT paper studied masked patch prediction
→ worse pretraining performance
　 (79.9% self-supervised << 84% supervised)
30
task: predicting the mean color of masked patches

BYOL-like distillation framework both applicable to CNN/ViT
1. From input image x, make two crops x1
and x2
.
2. Calculate the representations g(x1
) and g(x2
)
by student and teacher networks, respectively.
3. Update the student to match the representations
by seeing them as prob. distributions p1
and p2
.
4. Update the teacher as the moving average of
the student network’s parameters.
Tricks: feeding small crops to student,
centering of teacher features, epoch-wise teacher update, etc.
DINO: knowledge distillation with no labels
31

Results on transfer learning after pretraining
● Improvement over supervised pretraining was reported.
● Comparable performance to ViT with massive supervised dataset can
be obtained with only ImageNet data
32

Emerging property: attention maps as segmentation
Attention maps of the output token at the final layer found to attend
to semantic objects without any segment supervision
33
Final attention
map of this token
colors mean different attention heads
supervised ViT does not have such
property

Cost to calculate all-to-all attention
● Computation complexity of self-attention is O(N2
), prohibiting
processing of large sequence size (= high-resolution ViT)
● Required computational budget for pre-training is also very high
Unknown potential for dense prediction
● Unsupervised segmentation by DINO is very interesting, but it is not at
the level of real applications yet.
● CNNs have great power of dense prediction e.g. image generation,
segmentation, depth estimation. Can ViT do these tasks?
Limitations of Transformer-based methods
34

Another interesting topic on ViT
Importance of optimization algorithm: with SAM, ViT can perform
better than CNNs without large-scale pretraining
35

● NLP-inspired Transformer (pure attention) models show impressive
results also on image recognition problems
● Their performance well scale as the model/dataset size increase
● Very interesting properties like unsupervised segmentation are found
My impressions
● Attention seems to have a potential to detect complex visual entities
such as infiltration in WSI that needs multiple-scale observation
● How is translation equivariance realized in ViTs?
● Problem is on the computational (monetary) budget more and more
(most important Transformer papers come from Google, Facebook, OpenAI, MS, …)
Summary
36

Transformer based approaches for visual representation learning

More Related Content

What's hot (20)

Similar to Transformer based approaches for visual representation learning (20)

More from Ryohei Suzuki (20)

Recently uploaded (20)

Transformer based approaches for visual representation learning