Yolos you only look one sequence

You Only Look at One Sequence (YOLOS):
Rethinking Transformer in Vision through
Object Detection
김병현
이미지처리팀
김선옥, 안종식, 이찬혁, 홍은기

Here comes YOLOS!!
 YOLOS
Transformer based 2D object detection model
Only used Transformer Encoder & MLP Heads
2
YOLOS
YOLOS Performance
comparison with SOTA object detector
YOLOS Detection Example

Here comes YOLOS!!
 YOLOS
Transformer based 2D object detection model
Only used Transformer Encoder & MLP Heads
3
YOLOS
YOLOS Performance
comparison with SOTA object detector
YOLOS Detection Example
Transformer Encoder

Transformer is Born to Transfer
4
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention is all you need. In Advances in neural
information processing systems (pp. 5998-6008).
Transformer is for
sequential data
such as natural
language!!
Transformer

Vision Transformer
 AN IMAGE IS WORTH 16X16 WORDS
5
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X.,
Unterthiner, T., ... & Houlsby, N. (2020). An image is worth 16x16 words:
Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929.

Can an image be a sequential data….?
6
 In Object Detection ….

7
Dog : 0.89 Dog : 0.69 Person : 0.51

8

9
……
……
……
……

10
……
……
……
……
Hard Spatial Information Loss
during Position Embedding

How to Apply Transformer to Object Detection
 ViT-FRCNN
11
Strategy 1 : Concatenate patches to 2D Feature map again
Beal, J., Kim, E., Tzeng, E., Park, D. H., Zhai, A., & Kislyuk, D. (2020). Toward
transformer-based object detection. arXiv preprint arXiv:2012.09958.

 ViT-FRCNN
12
Beal, J., Kim, E., Tzeng, E., Park, D. H., Zhai, A., & Kislyuk, D. (2020). Toward
transformer-based object detection. arXiv preprint arXiv:2012.09958.
Strategy 1 : Concatenate patches to 2D Feature map again

 DETR
13
Strategy 2 :
CNN Feature Extractor + Positional Encoding + Bipartite Matching Loss
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., & Zagoruyko, S.
(2020, August). End-to-end object detection with transformers. In European
Conference on Computer Vision (pp. 213-229). Springer, Cham.

 Swin Transformer
14
Strategy 3 : Patch embedding with different patch size
Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., ... & Guo, B. (2021). Swin
transformer: Hierarchical vision transformer using shifted windows. arXiv preprint
arXiv:2103.14030.

15
Can Transformer perform
2D object detection as a pure
sequence-to-sequence
method?

YOLOS = VIT + Bipartite Loss
17
VIT
Bipartite
Loss
YOLOS
From
DETR

Architecture of YOLOS
19
VIT
Bipartite
Loss
From DETR

20
1. Patch Token &
Patch Embedding

21
2. Transformer
Encoder

22
3. Bipartite Loss
& Detection Token

Component 1 – Patch Token & Patch Embedding
24
Conv2d
Embedding Dimension= 768
16
16
Stride = 16
……
……
Original Image Feature map
1280
960
80
60
768

Component 1 – Patch Token & Patch Embedding
25
Conv2d
768
16
16
Stride = 16
……
……
Original Image
Flattened
Feature map
768
4800

Component 2 – Vision Transformer (Backbone)
26
Patch
token
Flattened
Feature map
Detection
token
Position Embedding

Component 2 – Vision Transformer (Backbone)
27
Multi-Layer
Perceptron
Multi-Layer
Perceptron
Detection
token
No. of Class
x, y, w, h
Sigmoid
Normalized to
[0, 1]

Component 3 – Bipartite Matching Loss
28

29
Prediction Ground Truth
No. of Class x, y, w, h
1.
2.
3.
100.
……
1.
2.
3.
n.
……

30

Experiments - Model Variants
32

Experiments - The Effects of Pre-training
33

Experiments - The Effects of Pre-training
34
Rethinking ImageNet Pre-training (He et al., 2018)
Self Supervised Learning

Experiments Comparisons with CNN
35

Experiments Comparison with DETR
36

Experiments Comparisons with Other Models
37
YOLOS

Meanings of the Results
 Each Token specialized on certain region and size
38
Det-Tok 1 Det-Tok 2 Det-Tok 3 Det-Tok 4 Det-Tok 5
Det-Tok 6 Det-Tok 7 Det-Tok 8 Det-Tok 9 Det-Tok 10
Center coordinates of bounding box predictions
Small, Medium, Large

 Each Token specialized on certain region and size
39

 Category Insensitive
40
Object Categories
No.
of
Objects
Ground Truth
Prediction

Discussion
 이미지 처리팀에서 Discussion 했던 내용들
굳이 트랜스포머를 왜 고집할 이유가 있는가?
• Long distance dependency를 잘 학습한다.1)
• CNN과 달리 Transformer에는 Inductive bias가 없어서
학습이 어렵지만 제대로 학습만 되면 CNN 보다 좋을 수 있다.2)
• CNN과 Transformer 쓰면 상호 보완적이 되지 않을까??
참고 : CNN의 Inductive Bias
→ “Computer Vision Task는 Spatial Information이 학습에 도움이 된다."
본 모델은 NLP 모델에 대한 이해도가 있으면 쉽게 구현 가능
Bipartite Matching Loss 의 Contribution을 다시 한 번 확인
• 비교적 간단한 모델 구조로도 Object Detector 학습 가능
41
1) Intriguing Properties of Vision Transformers https://arxiv.org/pdf/2105.10497.pdf
2) Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., ... & Houlsby, N. (2020).
An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929.

Yolos you only look one sequence

More Related Content

Similar to Yolos you only look one sequence (20)

More from taeseon ryu (20)

Recently uploaded (20)

Yolos you only look one sequence