SlideShare a Scribd company logo
報告担当:鈴木良平
Jun. 11, 2021
Transformer-based approaches for
visual representation learning
autoregression 自己回帰(以前の出力を参照しつつ系列の出力を行うこと)
CNN 畳み込みニューラルネットワーク
embedding 埋め込み(データのベクトル表現への変換)
equivariance 同変性(処理Aと変換fの順番を入れ替えても結果が等しいこと)
inductive bias 帰納バイアス(モデル設計で暗黙的に与えられるデータ仮定)
invariance 不変性(処理Aの前に変換fを行っても結果が等しいこと)
MLP 多層パーセプトロン
NLP 自然言語処理
pretraining 事前学習
self-attention 自己注意機構
Glossary
2
Today’s papers
Vaswani et al. (Google Brain, UToronto),
NeurIPS 2017
Dosovitskiy et al. (Google Brain),
ICLR 2021
Caron et al. (FAIR, INRIA, Sorbonne),
arXiv preprint 2021 3
● CNNs (e.g., VGG, ResNet) have been the de facto standard for
visual tasks in deep learning
○ Convolution provides favorable properties for image processing
● Recently alternative approaches are emerging
○ Transformer-based methods e.g., Attention-CNN, ViT
○ MLP-based methods e.g., MLP-Mixer, gMLP
● In particular, NLP-inspired Transformer-based approaches have
shown promising performance and interesting properties
Context
4
Review: Convolutional Neural Network (CNN)
● Convolution = multi-channel filtering by learnable kernels
● Typical modern CNNs contain convolution, activation, pooling,
skip-connection, normalization (e.g., BN, IN, GN), etc.
kernel =
linear map of finite-size window
5
Visual inductive biases of CNNs
Inductive bias: implicitly introduced regularizations on the solution by model
design, which is useful for utilizing the data characteristics
● Locality (局所性)
○ Natural images have spatial hierarchy
○ Sequence of convolutions works as an imitative process
receptive field gradually grows
through multiple convolutions
https://towardsdatascience.com/journey-from-machine-learning-to-deep-learning-8a807e8f3c1c
6
Visual inductive biases of CNNs
● Translation invariance / equivariance (並進不変性・同変性)
○ invariance: equivariance:
○ Convolution is naturally a translation equivariant operation
○ Equivariance is in fact broken in CNNs [Zhang 2019, Kayhan 2020]
translation invariance:
we want the model to output a
same answer for shifted images
translation equivariance:
CNN-after-shift is equivalent to
shift-after-CNN
rotational equivariance is also
sometimes imposed
[Graham et al. 2020]
cat 98% cat 98%
7
Intrinsic problems of CNNs
● Difficulty on handling long-range / irregular-shape dependency
○ CNNs can recognize interaction between two distant points only
after a large number of convolutions using large receptive field
● Low-resolution, blurred representation
○ Partially solved by skip-connections (e.g., U-Net, HRNet)
How to recognize the
interaction between the
bird and the flower?
HRNet [Sun et al. 2019]
8
Ideas from NLP
Self-attention
● Convolution: gather information from the nearby positions
● Self-attention: gather information from the related (attended) positions
○ originally developed in language models
Large-scale pretraining
● Most specific problems provide limited amount of data
● ImageNet-pretraining has already been broadly used in CV
● Pretraining with massively large datasets has shown amazing results
in NLP, e.g., GPT-3. (ImageNet 1.2M images vs. GPT-3 500B tokens)
image from: http://bliulab.net/selfAT_fold/
9
Paper 1: Attention is All You Need
● Proposed Transformer (=attention-based translation model)
● One of the most important ML papers (cited more than 20,000 times!)
● Attention is All You Need is All You Need (many subsequent papers
have proposed “improved” models, but the progress was reported to be quite small [1])
[1] Narang et al. arXiv preprint 2021
10
Vaswani et al. (Google Brain, UToronto)
NeurIPS 2017
Encoder-decoder model
Many “translation” models can be formulated as enc-dec pair
● Encoder extracts meaningful features from the input
● Decoder composes output from the extracted features
“I am a student”
“Je suis un étudiant”
Dec
Enc
latent variable
Transformed input data containing the necessary
information for translation task in a more useful form
→ can be utilized for multiple downstream tasks
Training of meaningful encoder (feature predictor)
= representation learning
11
Transformer
encoder
decoder
input:
“I am a student”
past output:
“Je suis un”
next output:
“étudiant”
12
Flow of processing
I am a student
Attention Block
Attention Block
Encoder Block
<start> Je suis un
Decoder Block
Attention Block
Attention Block
Decoder Block
étudiant
fed to attention
modules
Encoder Block
13
Parallelized training
I am a student
Attention Block
Attention Block
Encoder Block
<start> Je suis un
Decoder Block
Attention Block
Attention Block
Decoder Block
étudiant
Encoder Block
Je suis un
?
? ? ?
ground truth
14
Causal mask
Self-attention
Instead of spatial convolution, we want to aggregate the vector at
position i by aggregating information from related positions.
convolution
self-attention
Questions:
● How to know the “related” positions?
● How to aggregate the information (vectors)
from the found related positions? 15
Query-Key-Value attention
At each position i, we convert the vector into three vectors
query: , key: , and value: ,
then define the relativity between position i and j by .
Output is weighted average of weighted by the relativities.
内積
16
in matrix expression decoder fetches encoded
features by attention
Multi-head attention
Attention is basically a weighted-average → limited capability for
representing multiple types of relationships
e.g., “I like this lecture”
Multi-head attention first generates h sets of (Q,K,V),
then apply the standard attention in parallel to them.
→ each branch has different attention targets
Results are aggregated by a linear layer after joining.
17
“I” and “this lecture” have different relationships to the word “like”
Input processing
Input embedding
● Converts the raw input tokens into vector format by projection
Positional encoding
● Injects the information of (absolute) position of each token into
the embedded vector by sinusoidal functions
18
encoding dimension
absolute
position
Experimental results
Marked the highest score on English-to-DE/FR translation tasks at
1/100 training cost compared to the best methods at 2017...
19
cf. Image GPT
Transformer’s autoregressive
generation can naturally be applied
to the next-pixel prediction task
Transformer variant (GPT-2)
trained on ImageNet shows
impressive image completion
results!
[Chen et al., ICML2020]
20
https://openai.com/blog/image-gpt/
Paper 2: Vision Transformer (ViT)
Question: can we completely discard convolutions to tackle real
image recognition problems like classification?
→ Pure Transformer architecture pre-trained with very large
dataset can perform better than modern CNNs.
21
Dosovitskiy et al. (Google Brain),
ICLR 2021
The ViT model
ViT uses small splits of the input image as the “tokens” (words)
Supervised training using calculated features from the encoder
22
classification
task
Performance compared to SoTA CNNs
Experiment: fine-tuned top-1 accuracy after supervised pretraining
vs. BiT-L (supervised ResNet) and Noisy Student (semi-supervised EfficientNet)
23
JFT300M: Google’s internal
dataset consisting of 300
millions images
ImageNet 21k: superset of
ImageNet (1k) consisting of
21,000 classes, 14M images
※BiT and NS trained with JFT
Scalability
24
Small dataset: CNN performs much better than ViT
Large dataset: CNN saturates / ViT steadily improves
Learned patch embedding
The first part of ViT is embedding 16x16 patches into vector tokens
What kind of information is extracted in this stage?
→ CNN filter-like patterns (cf. Gabor filters) are found
25
Learned locality
ViT uses learnable positional embedding instead of sine encoding
→ embeddings at nearby positions become similar to each other
Attention heads at shallower layers attend to various distances
26
attending to
local relations
attending to
global relations
Recent discoveries on ViT and related methods
27
Massive ViT pretrained with massive
dataset shows great performance gain
→ 90.45% ImageNet top-1
Principled combination of convolution
and attention is more important
→ 88.56% without massive dataset
Scaling property [Zhai 2021]
With more computational budget and more dataset size,
performance gain seems to be obtained without saturation
28
Paper 3: Self-supervised ViT (DINO)
29
Self-supervised learning of ViTs
Self-supervised learning
training of model with supervision that is generated from unlabeled dataset
● e.g., next-word prediction, contrastive learning, self-distillation
Why important? → richer training signals than predicting a single class label
ViT paper studied masked patch prediction
→ worse pretraining performance
  (79.9% self-supervised << 84% supervised)
30
task: predicting the mean color of masked patches
BYOL-like distillation framework both applicable to CNN/ViT
1. From input image x, make two crops x1
and x2
.
2. Calculate the representations g(x1
) and g(x2
)
by student and teacher networks, respectively.
3. Update the student to match the representations
by seeing them as prob. distributions p1
and p2
.
4. Update the teacher as the moving average of
the student network’s parameters.
Tricks: feeding small crops to student,
centering of teacher features, epoch-wise teacher update, etc.
DINO: knowledge distillation with no labels
31
Results on transfer learning after pretraining
● Improvement over supervised pretraining was reported.
● Comparable performance to ViT with massive supervised dataset can
be obtained with only ImageNet data
32
Emerging property: attention maps as segmentation
Attention maps of the output token at the final layer found to attend
to semantic objects without any segment supervision
33
Final attention
map of this token
colors mean different attention heads
supervised ViT does not have such
property
Cost to calculate all-to-all attention
● Computation complexity of self-attention is O(N2
), prohibiting
processing of large sequence size (= high-resolution ViT)
● Required computational budget for pre-training is also very high
Unknown potential for dense prediction
● Unsupervised segmentation by DINO is very interesting, but it is not at
the level of real applications yet.
● CNNs have great power of dense prediction e.g. image generation,
segmentation, depth estimation. Can ViT do these tasks?
Limitations of Transformer-based methods
34
Another interesting topic on ViT
Importance of optimization algorithm: with SAM, ViT can perform
better than CNNs without large-scale pretraining
35
● NLP-inspired Transformer (pure attention) models show impressive
results also on image recognition problems
● Their performance well scale as the model/dataset size increase
● Very interesting properties like unsupervised segmentation are found
My impressions
● Attention seems to have a potential to detect complex visual entities
such as infiltration in WSI that needs multiple-scale observation
● How is translation equivariance realized in ViTs?
● Problem is on the computational (monetary) budget more and more
(most important Transformer papers come from Google, Facebook, OpenAI, MS, …)
Summary
36

More Related Content

PPTX
Vision Transformer(ViT) / An Image is Worth 16*16 Words: Transformers for Ima...
PPTX
Reinforced Cross-Modal Matching and Self-Supervised Imitation Learning for Vi...
PPTX
Mobilenet
PPTX
Recent advances in applications of augmented reality
PPTX
Object detection with deep learning
PPTX
You Only Look Once: Unified, Real-Time Object Detection
PPTX
Deep learning presentation
PDF
Point net
Vision Transformer(ViT) / An Image is Worth 16*16 Words: Transformers for Ima...
Reinforced Cross-Modal Matching and Self-Supervised Imitation Learning for Vi...
Mobilenet
Recent advances in applications of augmented reality
Object detection with deep learning
You Only Look Once: Unified, Real-Time Object Detection
Deep learning presentation
Point net

What's hot (20)

PDF
SSII2022 [SS1] ニューラル3D表現の最新動向〜 ニューラルネットでなんでも表せる?? 〜​
PPTX
Introduction to Visual transformers
PDF
Tutorial on Deep Generative Models
PPTX
SuperGlue; Learning Feature Matching with Graph Neural Networks (CVPR'20)
PDF
ニューラルネットと深層学習の歴史
PPTX
Interface agents
ODP
Simple Introduction to AutoEncoder
PPT
Intro to Deep learning - Autoencoders
PDF
論文紹介 Pixel Recurrent Neural Networks
PPT
Augmented Reality In Education
PPTX
Depth estimation using deep learning
PPTX
Image classification with Deep Neural Networks
PDF
Cs231n 2017 lecture13 Generative Model
PDF
Data-Centric AIの紹介
PDF
Introduction to Generative Adversarial Networks (GANs)
PPTX
Deep Learning Explained
PPT
WSN IN IOT
PPTX
Scott Lundberg, Microsoft Research - Explainable Machine Learning with Shaple...
PDF
YOLOv4: optimal speed and accuracy of object detection review
PPTX
Virtual reality ppt
SSII2022 [SS1] ニューラル3D表現の最新動向〜 ニューラルネットでなんでも表せる?? 〜​
Introduction to Visual transformers
Tutorial on Deep Generative Models
SuperGlue; Learning Feature Matching with Graph Neural Networks (CVPR'20)
ニューラルネットと深層学習の歴史
Interface agents
Simple Introduction to AutoEncoder
Intro to Deep learning - Autoencoders
論文紹介 Pixel Recurrent Neural Networks
Augmented Reality In Education
Depth estimation using deep learning
Image classification with Deep Neural Networks
Cs231n 2017 lecture13 Generative Model
Data-Centric AIの紹介
Introduction to Generative Adversarial Networks (GANs)
Deep Learning Explained
WSN IN IOT
Scott Lundberg, Microsoft Research - Explainable Machine Learning with Shaple...
YOLOv4: optimal speed and accuracy of object detection review
Virtual reality ppt
Ad

Similar to Transformer based approaches for visual representation learning (20)

PDF
BriefHistoryTransformerstransformers.pdf
PDF
Conv xg
PDF
Visual Transformers
PDF
Crafting Recommenders: the Shallow and the Deep of it!
PDF
IPT.pdf
PPTX
Dssg talk CNN intro
PDF
Variational Autoencoders VAE - Santiago Pascual - UPC Barcelona 2018
PDF
Integrating Artificial Intelligence with IoT
PDF
Week 3 Deep Learning And POS Tagging Hands-On
PDF
Deep learning for molecules, introduction to chainer chemistry
PDF
最近の研究情勢についていくために - Deep Learningを中心に -
PDF
attention is all you need.pdf attention is all you need.pdfattention is all y...
PDF
Transfer Learning and Domain Adaptation - Ramon Morros - UPC Barcelona 2018
PDF
Icon18revrec sudeshna
PDF
Transfer Learning and Domain Adaptation (DLAI D5L2 2017 UPC Deep Learning for...
PDF
Deep Learning
PDF
Methodological study of opinion mining and sentiment analysis techniques
PDF
Deep Learning - 인공지능 기계학습의 새로운 트랜드 :김인중
PPTX
[Revised] Intro to CNN
PDF
Hand Written Digit Classification
BriefHistoryTransformerstransformers.pdf
Conv xg
Visual Transformers
Crafting Recommenders: the Shallow and the Deep of it!
IPT.pdf
Dssg talk CNN intro
Variational Autoencoders VAE - Santiago Pascual - UPC Barcelona 2018
Integrating Artificial Intelligence with IoT
Week 3 Deep Learning And POS Tagging Hands-On
Deep learning for molecules, introduction to chainer chemistry
最近の研究情勢についていくために - Deep Learningを中心に -
attention is all you need.pdf attention is all you need.pdfattention is all y...
Transfer Learning and Domain Adaptation - Ramon Morros - UPC Barcelona 2018
Icon18revrec sudeshna
Transfer Learning and Domain Adaptation (DLAI D5L2 2017 UPC Deep Learning for...
Deep Learning
Methodological study of opinion mining and sentiment analysis techniques
Deep Learning - 인공지능 기계학습의 새로운 트랜드 :김인중
[Revised] Intro to CNN
Hand Written Digit Classification
Ad

More from Ryohei Suzuki (20)

PPTX
Paper memo: persistent homology on biological problems
PPTX
Paper memo: Optimal-Transport Analysis of Single-Cell Gene Expression Identif...
PDF
Basic Concepts of Entanglement Measures
PPTX
Disentangled Representation Learning of Deep Generative Models
PPTX
論文紹介: "MolGAN: An implicit generative model for small molecular graphs"
PPTX
Report: "MolGAN: An implicit generative model for small molecular graphs"
PPTX
等号と不等号の物理学
PPTX
Wolf et al. "Graph abstraction reconciles clustering with trajectory inferen...
PPTX
コンピュータは知恵熱を出すか?
PPTX
身体の中の小宇宙:免疫研究の最前線
PPTX
Single-cell pseudo-temporal ordering 近年の技術動向
PPTX
Collaborative 3D Modeling by the Crowd
PPTX
汝は計算機なりや?
PPTX
アナログとはなんだろう。―古くて新しい、もう一つの計算―
PPTX
AnnoTone (CHI 2015)
PPTX
色字共感覚と書記素学習
PPTX
AnnoTone: 高周波音の映像収録時 埋め込みによる編集支援
PPTX
立体音響とインタラクション
PPTX
SIGGRAPH 2014 Preview -"Shape Collection" Session
PPTX
Overview of User Interfaces
Paper memo: persistent homology on biological problems
Paper memo: Optimal-Transport Analysis of Single-Cell Gene Expression Identif...
Basic Concepts of Entanglement Measures
Disentangled Representation Learning of Deep Generative Models
論文紹介: "MolGAN: An implicit generative model for small molecular graphs"
Report: "MolGAN: An implicit generative model for small molecular graphs"
等号と不等号の物理学
Wolf et al. "Graph abstraction reconciles clustering with trajectory inferen...
コンピュータは知恵熱を出すか?
身体の中の小宇宙:免疫研究の最前線
Single-cell pseudo-temporal ordering 近年の技術動向
Collaborative 3D Modeling by the Crowd
汝は計算機なりや?
アナログとはなんだろう。―古くて新しい、もう一つの計算―
AnnoTone (CHI 2015)
色字共感覚と書記素学習
AnnoTone: 高周波音の映像収録時 埋め込みによる編集支援
立体音響とインタラクション
SIGGRAPH 2014 Preview -"Shape Collection" Session
Overview of User Interfaces

Recently uploaded (20)

PPTX
GEN. BIO 1 - CELL TYPES & CELL MODIFICATIONS
PDF
SEHH2274 Organic Chemistry Notes 1 Structure and Bonding.pdf
PDF
CAPERS-LRD-z9:AGas-enshroudedLittleRedDotHostingaBroad-lineActive GalacticNuc...
PDF
. Radiology Case Scenariosssssssssssssss
PDF
The scientific heritage No 166 (166) (2025)
PPTX
Classification Systems_TAXONOMY_SCIENCE8.pptx
PPTX
7. General Toxicologyfor clinical phrmacy.pptx
PPT
The World of Physical Science, • Labs: Safety Simulation, Measurement Practice
PPTX
G5Q1W8 PPT SCIENCE.pptx 2025-2026 GRADE 5
PPTX
Taita Taveta Laboratory Technician Workshop Presentation.pptx
PPTX
ognitive-behavioral therapy, mindfulness-based approaches, coping skills trai...
PDF
HPLC-PPT.docx high performance liquid chromatography
PPTX
2Systematics of Living Organisms t-.pptx
PDF
Placing the Near-Earth Object Impact Probability in Context
PPTX
neck nodes and dissection types and lymph nodes levels
PPTX
famous lake in india and its disturibution and importance
PPT
POSITIONING IN OPERATION THEATRE ROOM.ppt
PPTX
EPIDURAL ANESTHESIA ANATOMY AND PHYSIOLOGY.pptx
PPT
protein biochemistry.ppt for university classes
PPTX
cpcsea ppt.pptxssssssssssssssjjdjdndndddd
GEN. BIO 1 - CELL TYPES & CELL MODIFICATIONS
SEHH2274 Organic Chemistry Notes 1 Structure and Bonding.pdf
CAPERS-LRD-z9:AGas-enshroudedLittleRedDotHostingaBroad-lineActive GalacticNuc...
. Radiology Case Scenariosssssssssssssss
The scientific heritage No 166 (166) (2025)
Classification Systems_TAXONOMY_SCIENCE8.pptx
7. General Toxicologyfor clinical phrmacy.pptx
The World of Physical Science, • Labs: Safety Simulation, Measurement Practice
G5Q1W8 PPT SCIENCE.pptx 2025-2026 GRADE 5
Taita Taveta Laboratory Technician Workshop Presentation.pptx
ognitive-behavioral therapy, mindfulness-based approaches, coping skills trai...
HPLC-PPT.docx high performance liquid chromatography
2Systematics of Living Organisms t-.pptx
Placing the Near-Earth Object Impact Probability in Context
neck nodes and dissection types and lymph nodes levels
famous lake in india and its disturibution and importance
POSITIONING IN OPERATION THEATRE ROOM.ppt
EPIDURAL ANESTHESIA ANATOMY AND PHYSIOLOGY.pptx
protein biochemistry.ppt for university classes
cpcsea ppt.pptxssssssssssssssjjdjdndndddd

Transformer based approaches for visual representation learning

  • 1. 報告担当:鈴木良平 Jun. 11, 2021 Transformer-based approaches for visual representation learning
  • 2. autoregression 自己回帰(以前の出力を参照しつつ系列の出力を行うこと) CNN 畳み込みニューラルネットワーク embedding 埋め込み(データのベクトル表現への変換) equivariance 同変性(処理Aと変換fの順番を入れ替えても結果が等しいこと) inductive bias 帰納バイアス(モデル設計で暗黙的に与えられるデータ仮定) invariance 不変性(処理Aの前に変換fを行っても結果が等しいこと) MLP 多層パーセプトロン NLP 自然言語処理 pretraining 事前学習 self-attention 自己注意機構 Glossary 2
  • 3. Today’s papers Vaswani et al. (Google Brain, UToronto), NeurIPS 2017 Dosovitskiy et al. (Google Brain), ICLR 2021 Caron et al. (FAIR, INRIA, Sorbonne), arXiv preprint 2021 3
  • 4. ● CNNs (e.g., VGG, ResNet) have been the de facto standard for visual tasks in deep learning ○ Convolution provides favorable properties for image processing ● Recently alternative approaches are emerging ○ Transformer-based methods e.g., Attention-CNN, ViT ○ MLP-based methods e.g., MLP-Mixer, gMLP ● In particular, NLP-inspired Transformer-based approaches have shown promising performance and interesting properties Context 4
  • 5. Review: Convolutional Neural Network (CNN) ● Convolution = multi-channel filtering by learnable kernels ● Typical modern CNNs contain convolution, activation, pooling, skip-connection, normalization (e.g., BN, IN, GN), etc. kernel = linear map of finite-size window 5
  • 6. Visual inductive biases of CNNs Inductive bias: implicitly introduced regularizations on the solution by model design, which is useful for utilizing the data characteristics ● Locality (局所性) ○ Natural images have spatial hierarchy ○ Sequence of convolutions works as an imitative process receptive field gradually grows through multiple convolutions https://towardsdatascience.com/journey-from-machine-learning-to-deep-learning-8a807e8f3c1c 6
  • 7. Visual inductive biases of CNNs ● Translation invariance / equivariance (並進不変性・同変性) ○ invariance: equivariance: ○ Convolution is naturally a translation equivariant operation ○ Equivariance is in fact broken in CNNs [Zhang 2019, Kayhan 2020] translation invariance: we want the model to output a same answer for shifted images translation equivariance: CNN-after-shift is equivalent to shift-after-CNN rotational equivariance is also sometimes imposed [Graham et al. 2020] cat 98% cat 98% 7
  • 8. Intrinsic problems of CNNs ● Difficulty on handling long-range / irregular-shape dependency ○ CNNs can recognize interaction between two distant points only after a large number of convolutions using large receptive field ● Low-resolution, blurred representation ○ Partially solved by skip-connections (e.g., U-Net, HRNet) How to recognize the interaction between the bird and the flower? HRNet [Sun et al. 2019] 8
  • 9. Ideas from NLP Self-attention ● Convolution: gather information from the nearby positions ● Self-attention: gather information from the related (attended) positions ○ originally developed in language models Large-scale pretraining ● Most specific problems provide limited amount of data ● ImageNet-pretraining has already been broadly used in CV ● Pretraining with massively large datasets has shown amazing results in NLP, e.g., GPT-3. (ImageNet 1.2M images vs. GPT-3 500B tokens) image from: http://bliulab.net/selfAT_fold/ 9
  • 10. Paper 1: Attention is All You Need ● Proposed Transformer (=attention-based translation model) ● One of the most important ML papers (cited more than 20,000 times!) ● Attention is All You Need is All You Need (many subsequent papers have proposed “improved” models, but the progress was reported to be quite small [1]) [1] Narang et al. arXiv preprint 2021 10 Vaswani et al. (Google Brain, UToronto) NeurIPS 2017
  • 11. Encoder-decoder model Many “translation” models can be formulated as enc-dec pair ● Encoder extracts meaningful features from the input ● Decoder composes output from the extracted features “I am a student” “Je suis un étudiant” Dec Enc latent variable Transformed input data containing the necessary information for translation task in a more useful form → can be utilized for multiple downstream tasks Training of meaningful encoder (feature predictor) = representation learning 11
  • 12. Transformer encoder decoder input: “I am a student” past output: “Je suis un” next output: “étudiant” 12
  • 13. Flow of processing I am a student Attention Block Attention Block Encoder Block <start> Je suis un Decoder Block Attention Block Attention Block Decoder Block étudiant fed to attention modules Encoder Block 13
  • 14. Parallelized training I am a student Attention Block Attention Block Encoder Block <start> Je suis un Decoder Block Attention Block Attention Block Decoder Block étudiant Encoder Block Je suis un ? ? ? ? ground truth 14 Causal mask
  • 15. Self-attention Instead of spatial convolution, we want to aggregate the vector at position i by aggregating information from related positions. convolution self-attention Questions: ● How to know the “related” positions? ● How to aggregate the information (vectors) from the found related positions? 15
  • 16. Query-Key-Value attention At each position i, we convert the vector into three vectors query: , key: , and value: , then define the relativity between position i and j by . Output is weighted average of weighted by the relativities. 内積 16 in matrix expression decoder fetches encoded features by attention
  • 17. Multi-head attention Attention is basically a weighted-average → limited capability for representing multiple types of relationships e.g., “I like this lecture” Multi-head attention first generates h sets of (Q,K,V), then apply the standard attention in parallel to them. → each branch has different attention targets Results are aggregated by a linear layer after joining. 17 “I” and “this lecture” have different relationships to the word “like”
  • 18. Input processing Input embedding ● Converts the raw input tokens into vector format by projection Positional encoding ● Injects the information of (absolute) position of each token into the embedded vector by sinusoidal functions 18 encoding dimension absolute position
  • 19. Experimental results Marked the highest score on English-to-DE/FR translation tasks at 1/100 training cost compared to the best methods at 2017... 19
  • 20. cf. Image GPT Transformer’s autoregressive generation can naturally be applied to the next-pixel prediction task Transformer variant (GPT-2) trained on ImageNet shows impressive image completion results! [Chen et al., ICML2020] 20 https://openai.com/blog/image-gpt/
  • 21. Paper 2: Vision Transformer (ViT) Question: can we completely discard convolutions to tackle real image recognition problems like classification? → Pure Transformer architecture pre-trained with very large dataset can perform better than modern CNNs. 21 Dosovitskiy et al. (Google Brain), ICLR 2021
  • 22. The ViT model ViT uses small splits of the input image as the “tokens” (words) Supervised training using calculated features from the encoder 22 classification task
  • 23. Performance compared to SoTA CNNs Experiment: fine-tuned top-1 accuracy after supervised pretraining vs. BiT-L (supervised ResNet) and Noisy Student (semi-supervised EfficientNet) 23 JFT300M: Google’s internal dataset consisting of 300 millions images ImageNet 21k: superset of ImageNet (1k) consisting of 21,000 classes, 14M images ※BiT and NS trained with JFT
  • 24. Scalability 24 Small dataset: CNN performs much better than ViT Large dataset: CNN saturates / ViT steadily improves
  • 25. Learned patch embedding The first part of ViT is embedding 16x16 patches into vector tokens What kind of information is extracted in this stage? → CNN filter-like patterns (cf. Gabor filters) are found 25
  • 26. Learned locality ViT uses learnable positional embedding instead of sine encoding → embeddings at nearby positions become similar to each other Attention heads at shallower layers attend to various distances 26 attending to local relations attending to global relations
  • 27. Recent discoveries on ViT and related methods 27 Massive ViT pretrained with massive dataset shows great performance gain → 90.45% ImageNet top-1 Principled combination of convolution and attention is more important → 88.56% without massive dataset
  • 28. Scaling property [Zhai 2021] With more computational budget and more dataset size, performance gain seems to be obtained without saturation 28
  • 29. Paper 3: Self-supervised ViT (DINO) 29
  • 30. Self-supervised learning of ViTs Self-supervised learning training of model with supervision that is generated from unlabeled dataset ● e.g., next-word prediction, contrastive learning, self-distillation Why important? → richer training signals than predicting a single class label ViT paper studied masked patch prediction → worse pretraining performance   (79.9% self-supervised << 84% supervised) 30 task: predicting the mean color of masked patches
  • 31. BYOL-like distillation framework both applicable to CNN/ViT 1. From input image x, make two crops x1 and x2 . 2. Calculate the representations g(x1 ) and g(x2 ) by student and teacher networks, respectively. 3. Update the student to match the representations by seeing them as prob. distributions p1 and p2 . 4. Update the teacher as the moving average of the student network’s parameters. Tricks: feeding small crops to student, centering of teacher features, epoch-wise teacher update, etc. DINO: knowledge distillation with no labels 31
  • 32. Results on transfer learning after pretraining ● Improvement over supervised pretraining was reported. ● Comparable performance to ViT with massive supervised dataset can be obtained with only ImageNet data 32
  • 33. Emerging property: attention maps as segmentation Attention maps of the output token at the final layer found to attend to semantic objects without any segment supervision 33 Final attention map of this token colors mean different attention heads supervised ViT does not have such property
  • 34. Cost to calculate all-to-all attention ● Computation complexity of self-attention is O(N2 ), prohibiting processing of large sequence size (= high-resolution ViT) ● Required computational budget for pre-training is also very high Unknown potential for dense prediction ● Unsupervised segmentation by DINO is very interesting, but it is not at the level of real applications yet. ● CNNs have great power of dense prediction e.g. image generation, segmentation, depth estimation. Can ViT do these tasks? Limitations of Transformer-based methods 34
  • 35. Another interesting topic on ViT Importance of optimization algorithm: with SAM, ViT can perform better than CNNs without large-scale pretraining 35
  • 36. ● NLP-inspired Transformer (pure attention) models show impressive results also on image recognition problems ● Their performance well scale as the model/dataset size increase ● Very interesting properties like unsupervised segmentation are found My impressions ● Attention seems to have a potential to detect complex visual entities such as infiltration in WSI that needs multiple-scale observation ● How is translation equivariance realized in ViTs? ● Problem is on the computational (monetary) budget more and more (most important Transformer papers come from Google, Facebook, OpenAI, MS, …) Summary 36