You Only Look at One Sequence (YOLOS):
Rethinking Transformer in Vision through
Object Detection
김병현
이미지처리팀
김선옥, 안종식, 이찬혁, 홍은기
Here comes YOLOS!!
 YOLOS
Transformer based 2D object detection model
Only used Transformer Encoder & MLP Heads
2
YOLOS
YOLOS Performance
comparison with SOTA object detector
YOLOS Detection Example
Here comes YOLOS!!
 YOLOS
Transformer based 2D object detection model
Only used Transformer Encoder & MLP Heads
3
YOLOS
YOLOS Performance
comparison with SOTA object detector
YOLOS Detection Example
Transformer Encoder
Transformer is Born to Transfer
4
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention is all you need. In Advances in neural
information processing systems (pp. 5998-6008).
Transformer is for
sequential data
such as natural
language!!
Transformer
Vision Transformer
 AN IMAGE IS WORTH 16X16 WORDS
5
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X.,
Unterthiner, T., ... & Houlsby, N. (2020). An image is worth 16x16 words:
Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929.
Can an image be a sequential data….?
6
 In Object Detection ….
Can an image be a sequential data….?
7
Dog : 0.89 Dog : 0.69 Person : 0.51
 In Object Detection ….
Can an image be a sequential data….?
8
 In Object Detection ….
Can an image be a sequential data….?
9
……
……
……
……
 In Object Detection ….
Can an image be a sequential data….?
10
……
……
……
……
Hard Spatial Information Loss
during Position Embedding
 In Object Detection ….
How to Apply Transformer to Object Detection
 ViT-FRCNN
11
Strategy 1 : Concatenate patches to 2D Feature map again
Beal, J., Kim, E., Tzeng, E., Park, D. H., Zhai, A., & Kislyuk, D. (2020). Toward
transformer-based object detection. arXiv preprint arXiv:2012.09958.
How to Apply Transformer to Object Detection
 ViT-FRCNN
12
Beal, J., Kim, E., Tzeng, E., Park, D. H., Zhai, A., & Kislyuk, D. (2020). Toward
transformer-based object detection. arXiv preprint arXiv:2012.09958.
Strategy 1 : Concatenate patches to 2D Feature map again
How to Apply Transformer to Object Detection
 DETR
13
Strategy 2 :
CNN Feature Extractor + Positional Encoding + Bipartite Matching Loss
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., & Zagoruyko, S.
(2020, August). End-to-end object detection with transformers. In European
Conference on Computer Vision (pp. 213-229). Springer, Cham.
How to Apply Transformer to Object Detection
 Swin Transformer
14
Strategy 3 : Patch embedding with different patch size
Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., ... & Guo, B. (2021). Swin
transformer: Hierarchical vision transformer using shifted windows. arXiv preprint
arXiv:2103.14030.
How to Apply Transformer to Object Detection
15
Can Transformer perform
2D object detection as a pure
sequence-to-sequence
method?
Q & A
Q & A
16
YOLOS = VIT + Bipartite Loss
17
VIT
Bipartite
Loss
YOLOS
From
DETR
Architecture of YOLOS
18
Architecture of YOLOS
19
VIT
Bipartite
Loss
From DETR
Architecture of YOLOS
20
1. Patch Token &
Patch Embedding
Architecture of YOLOS
21
2. Transformer
Encoder
Architecture of YOLOS
22
3. Bipartite Loss
& Detection Token
Q & A
Q & A
23
Component 1 – Patch Token & Patch Embedding
24
Conv2d
Embedding Dimension= 768
16
16
Stride = 16
……
……
Original Image Feature map
1280
960
80
60
768
Component 1 – Patch Token & Patch Embedding
25
Conv2d
768
16
16
Stride = 16
……
……
Original Image
Flattened
Feature map
768
4800
Component 2 – Vision Transformer (Backbone)
26
Patch
token
Flattened
Feature map
Detection
token
Position Embedding
Component 2 – Vision Transformer (Backbone)
27
Multi-Layer
Perceptron
Multi-Layer
Perceptron
Detection
token
No. of Class
x, y, w, h
Sigmoid
Normalized to
[0, 1]
Component 3 – Bipartite Matching Loss
28
Component 3 – Bipartite Matching Loss
29
Prediction Ground Truth
No. of Class x, y, w, h
1.
No. of Class x, y, w, h
2.
No. of Class x, y, w, h
3.
No. of Class x, y, w, h
100.
……
No. of Class x, y, w, h
1.
No. of Class x, y, w, h
2.
No. of Class x, y, w, h
3.
No. of Class x, y, w, h
n.
……
Component 3 – Bipartite Matching Loss
30
Q & A
Q & A
31
Experiments - Model Variants
32
Experiments - The Effects of Pre-training
33
Experiments - The Effects of Pre-training
34
Rethinking ImageNet Pre-training (He et al., 2018)
Self Supervised Learning
Experiments Comparisons with CNN
35
Experiments Comparison with DETR
36
Experiments Comparisons with Other Models
37
YOLOS
Meanings of the Results
 Each Token specialized on certain region and size
38
Det-Tok 1 Det-Tok 2 Det-Tok 3 Det-Tok 4 Det-Tok 5
Det-Tok 6 Det-Tok 7 Det-Tok 8 Det-Tok 9 Det-Tok 10
Center coordinates of bounding box predictions
Small, Medium, Large
Meanings of the Results
 Each Token specialized on certain region and size
39
Meanings of the Results
 Category Insensitive
40
Object Categories
No.
of
Objects
Ground Truth
Prediction
Discussion
 이미지 처리팀에서 Discussion 했던 내용들
굳이 트랜스포머를 왜 고집할 이유가 있는가?
• Long distance dependency를 잘 학습한다.1)
• CNN과 달리 Transformer에는 Inductive bias가 없어서
학습이 어렵지만 제대로 학습만 되면 CNN 보다 좋을 수 있다.2)
• CNN과 Transformer 쓰면 상호 보완적이 되지 않을까??
참고 : CNN의 Inductive Bias
→ “Computer Vision Task는 Spatial Information이 학습에 도움이 된다."
본 모델은 NLP 모델에 대한 이해도가 있으면 쉽게 구현 가능
Bipartite Matching Loss 의 Contribution을 다시 한 번 확인
• 비교적 간단한 모델 구조로도 Object Detector 학습 가능
41
1) Intriguing Properties of Vision Transformers https://arxiv.org/pdf/2105.10497.pdf
2) Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., ... & Houlsby, N. (2020).
An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929.
Q & A
Q & A
42

More Related Content

PDF
Kommunikation, adm., br
PDF
Endoscopy in cranial and skull base surgery
PPTX
Attention in Deep Learning
PPTX
Neural Models for Information Retrieval
PPTX
2020 12-2-detr
PDF
Transformer in Computer Vision
PDF
論文紹介:End-to-End Object Detection with Transformers
PDF
Visual Transformers
Kommunikation, adm., br
Endoscopy in cranial and skull base surgery
Attention in Deep Learning
Neural Models for Information Retrieval
2020 12-2-detr
Transformer in Computer Vision
論文紹介:End-to-End Object Detection with Transformers
Visual Transformers

Similar to Yolos you only look one sequence (20)

PDF
End-to-End Object Detection with Transformers
PDF
Review: You Only Look One-level Feature
PDF
Real Time Object Detection with Audio Feedback using Yolo v3
PPTX
seminar ppt.pptx
PDF
IRJET - Object Detection using Deep Learning with OpenCV and Python
PDF
PR-284: End-to-End Object Detection with Transformers(DETR)
PDF
ObjectDetectionUsingMachineLearningandNeuralNetworks.pdf
PDF
ViT (Vision Transformer) Review [CDM]
PDF
IISc Internship Report
PDF
Deep learning based object detection basics
PPTX
Fractured Bone Case Study by Slidesgo.pptx
PDF
“How Transformers are Changing the Direction of Deep Learning Architectures,”...
PDF
GAN Report 1 Monthly Report Generative Adversarial Part2
PDF
YOLOv4: optimal speed and accuracy of object detection review
PDF
Deep Learning for X ray Image to Text Generation
PDF
Object Detection with Transformers
PDF
Modern convolutional object detectors
PDF
Object Detection An Overview
PDF
“How Transformers Are Changing the Nature of Deep Learning Models,” a Present...
PDF
REVIEW ON OBJECT DETECTION WITH CNN
End-to-End Object Detection with Transformers
Review: You Only Look One-level Feature
Real Time Object Detection with Audio Feedback using Yolo v3
seminar ppt.pptx
IRJET - Object Detection using Deep Learning with OpenCV and Python
PR-284: End-to-End Object Detection with Transformers(DETR)
ObjectDetectionUsingMachineLearningandNeuralNetworks.pdf
ViT (Vision Transformer) Review [CDM]
IISc Internship Report
Deep learning based object detection basics
Fractured Bone Case Study by Slidesgo.pptx
“How Transformers are Changing the Direction of Deep Learning Architectures,”...
GAN Report 1 Monthly Report Generative Adversarial Part2
YOLOv4: optimal speed and accuracy of object detection review
Deep Learning for X ray Image to Text Generation
Object Detection with Transformers
Modern convolutional object detectors
Object Detection An Overview
“How Transformers Are Changing the Nature of Deep Learning Models,” a Present...
REVIEW ON OBJECT DETECTION WITH CNN
Ad

More from taeseon ryu (20)

PDF
VoxelNet
PDF
OpineSum Entailment-based self-training for abstractive opinion summarization...
PPTX
3D Gaussian Splatting
PDF
JetsonTX2 Python
PPTX
Hyperbolic Image Embedding.pptx
PDF
MCSE_Multimodal Contrastive Learning of Sentence Embeddings_변현정
PDF
LLaMA Open and Efficient Foundation Language Models - 230528.pdf
PDF
YOLO V6
PDF
Dataset Distillation by Matching Training Trajectories
PDF
RL_UpsideDown
PDF
Packed Levitated Marker for Entity and Relation Extraction
PPTX
MOReL: Model-Based Offline Reinforcement Learning
PDF
Scaling Instruction-Finetuned Language Models
PDF
Visual prompt tuning
PDF
PDF
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdf
PDF
Reinforced Genetic Algorithm Learning For Optimizing Computation Graphs.pdf
PDF
The Forward-Forward Algorithm
PPTX
Towards Robust and Reproducible Active Learning using Neural Networks
PDF
BRIO: Bringing Order to Abstractive Summarization
VoxelNet
OpineSum Entailment-based self-training for abstractive opinion summarization...
3D Gaussian Splatting
JetsonTX2 Python
Hyperbolic Image Embedding.pptx
MCSE_Multimodal Contrastive Learning of Sentence Embeddings_변현정
LLaMA Open and Efficient Foundation Language Models - 230528.pdf
YOLO V6
Dataset Distillation by Matching Training Trajectories
RL_UpsideDown
Packed Levitated Marker for Entity and Relation Extraction
MOReL: Model-Based Offline Reinforcement Learning
Scaling Instruction-Finetuned Language Models
Visual prompt tuning
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdf
Reinforced Genetic Algorithm Learning For Optimizing Computation Graphs.pdf
The Forward-Forward Algorithm
Towards Robust and Reproducible Active Learning using Neural Networks
BRIO: Bringing Order to Abstractive Summarization
Ad

Recently uploaded (20)

PPT
Image processing and pattern recognition 2.ppt
PPTX
STERILIZATION AND DISINFECTION-1.ppthhhbx
PDF
Introduction to Data Science and Data Analysis
PDF
Jean-Georges Perrin - Spark in Action, Second Edition (2020, Manning Publicat...
PDF
Votre score augmente si vous choisissez une catégorie et que vous rédigez une...
PDF
Transcultural that can help you someday.
PPTX
SAP 2 completion done . PRESENTATION.pptx
PPTX
modul_python (1).pptx for professional and student
PDF
Systems Analysis and Design, 12th Edition by Scott Tilley Test Bank.pdf
PPTX
sac 451hinhgsgshssjsjsjheegdggeegegdggddgeg.pptx
PDF
[EN] Industrial Machine Downtime Prediction
PPTX
Pilar Kemerdekaan dan Identi Bangsa.pptx
PPTX
FMIS 108 and AISlaudon_mis17_ppt_ch11.pptx
PPTX
IMPACT OF LANDSLIDE.....................
PPTX
Topic 5 Presentation 5 Lesson 5 Corporate Fin
PDF
Microsoft 365 products and services descrption
PPT
Predictive modeling basics in data cleaning process
PDF
Microsoft Core Cloud Services powerpoint
PDF
Navigating the Thai Supplements Landscape.pdf
PPTX
retention in jsjsksksksnbsndjddjdnFPD.pptx
Image processing and pattern recognition 2.ppt
STERILIZATION AND DISINFECTION-1.ppthhhbx
Introduction to Data Science and Data Analysis
Jean-Georges Perrin - Spark in Action, Second Edition (2020, Manning Publicat...
Votre score augmente si vous choisissez une catégorie et que vous rédigez une...
Transcultural that can help you someday.
SAP 2 completion done . PRESENTATION.pptx
modul_python (1).pptx for professional and student
Systems Analysis and Design, 12th Edition by Scott Tilley Test Bank.pdf
sac 451hinhgsgshssjsjsjheegdggeegegdggddgeg.pptx
[EN] Industrial Machine Downtime Prediction
Pilar Kemerdekaan dan Identi Bangsa.pptx
FMIS 108 and AISlaudon_mis17_ppt_ch11.pptx
IMPACT OF LANDSLIDE.....................
Topic 5 Presentation 5 Lesson 5 Corporate Fin
Microsoft 365 products and services descrption
Predictive modeling basics in data cleaning process
Microsoft Core Cloud Services powerpoint
Navigating the Thai Supplements Landscape.pdf
retention in jsjsksksksnbsndjddjdnFPD.pptx

Yolos you only look one sequence

  • 1. You Only Look at One Sequence (YOLOS): Rethinking Transformer in Vision through Object Detection 김병현 이미지처리팀 김선옥, 안종식, 이찬혁, 홍은기
  • 2. Here comes YOLOS!!  YOLOS Transformer based 2D object detection model Only used Transformer Encoder & MLP Heads 2 YOLOS YOLOS Performance comparison with SOTA object detector YOLOS Detection Example
  • 3. Here comes YOLOS!!  YOLOS Transformer based 2D object detection model Only used Transformer Encoder & MLP Heads 3 YOLOS YOLOS Performance comparison with SOTA object detector YOLOS Detection Example Transformer Encoder
  • 4. Transformer is Born to Transfer 4 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention is all you need. In Advances in neural information processing systems (pp. 5998-6008). Transformer is for sequential data such as natural language!! Transformer
  • 5. Vision Transformer  AN IMAGE IS WORTH 16X16 WORDS 5 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., ... & Houlsby, N. (2020). An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929.
  • 6. Can an image be a sequential data….? 6  In Object Detection ….
  • 7. Can an image be a sequential data….? 7 Dog : 0.89 Dog : 0.69 Person : 0.51  In Object Detection ….
  • 8. Can an image be a sequential data….? 8  In Object Detection ….
  • 9. Can an image be a sequential data….? 9 …… …… …… ……  In Object Detection ….
  • 10. Can an image be a sequential data….? 10 …… …… …… …… Hard Spatial Information Loss during Position Embedding  In Object Detection ….
  • 11. How to Apply Transformer to Object Detection  ViT-FRCNN 11 Strategy 1 : Concatenate patches to 2D Feature map again Beal, J., Kim, E., Tzeng, E., Park, D. H., Zhai, A., & Kislyuk, D. (2020). Toward transformer-based object detection. arXiv preprint arXiv:2012.09958.
  • 12. How to Apply Transformer to Object Detection  ViT-FRCNN 12 Beal, J., Kim, E., Tzeng, E., Park, D. H., Zhai, A., & Kislyuk, D. (2020). Toward transformer-based object detection. arXiv preprint arXiv:2012.09958. Strategy 1 : Concatenate patches to 2D Feature map again
  • 13. How to Apply Transformer to Object Detection  DETR 13 Strategy 2 : CNN Feature Extractor + Positional Encoding + Bipartite Matching Loss Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., & Zagoruyko, S. (2020, August). End-to-end object detection with transformers. In European Conference on Computer Vision (pp. 213-229). Springer, Cham.
  • 14. How to Apply Transformer to Object Detection  Swin Transformer 14 Strategy 3 : Patch embedding with different patch size Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., ... & Guo, B. (2021). Swin transformer: Hierarchical vision transformer using shifted windows. arXiv preprint arXiv:2103.14030.
  • 15. How to Apply Transformer to Object Detection 15 Can Transformer perform 2D object detection as a pure sequence-to-sequence method?
  • 16. Q & A Q & A 16
  • 17. YOLOS = VIT + Bipartite Loss 17 VIT Bipartite Loss YOLOS From DETR
  • 20. Architecture of YOLOS 20 1. Patch Token & Patch Embedding
  • 21. Architecture of YOLOS 21 2. Transformer Encoder
  • 22. Architecture of YOLOS 22 3. Bipartite Loss & Detection Token
  • 23. Q & A Q & A 23
  • 24. Component 1 – Patch Token & Patch Embedding 24 Conv2d Embedding Dimension= 768 16 16 Stride = 16 …… …… Original Image Feature map 1280 960 80 60 768
  • 25. Component 1 – Patch Token & Patch Embedding 25 Conv2d 768 16 16 Stride = 16 …… …… Original Image Flattened Feature map 768 4800
  • 26. Component 2 – Vision Transformer (Backbone) 26 Patch token Flattened Feature map Detection token Position Embedding
  • 27. Component 2 – Vision Transformer (Backbone) 27 Multi-Layer Perceptron Multi-Layer Perceptron Detection token No. of Class x, y, w, h Sigmoid Normalized to [0, 1]
  • 28. Component 3 – Bipartite Matching Loss 28
  • 29. Component 3 – Bipartite Matching Loss 29 Prediction Ground Truth No. of Class x, y, w, h 1. No. of Class x, y, w, h 2. No. of Class x, y, w, h 3. No. of Class x, y, w, h 100. …… No. of Class x, y, w, h 1. No. of Class x, y, w, h 2. No. of Class x, y, w, h 3. No. of Class x, y, w, h n. ……
  • 30. Component 3 – Bipartite Matching Loss 30
  • 31. Q & A Q & A 31
  • 32. Experiments - Model Variants 32
  • 33. Experiments - The Effects of Pre-training 33
  • 34. Experiments - The Effects of Pre-training 34 Rethinking ImageNet Pre-training (He et al., 2018) Self Supervised Learning
  • 37. Experiments Comparisons with Other Models 37 YOLOS
  • 38. Meanings of the Results  Each Token specialized on certain region and size 38 Det-Tok 1 Det-Tok 2 Det-Tok 3 Det-Tok 4 Det-Tok 5 Det-Tok 6 Det-Tok 7 Det-Tok 8 Det-Tok 9 Det-Tok 10 Center coordinates of bounding box predictions Small, Medium, Large
  • 39. Meanings of the Results  Each Token specialized on certain region and size 39
  • 40. Meanings of the Results  Category Insensitive 40 Object Categories No. of Objects Ground Truth Prediction
  • 41. Discussion  이미지 처리팀에서 Discussion 했던 내용들 굳이 트랜스포머를 왜 고집할 이유가 있는가? • Long distance dependency를 잘 학습한다.1) • CNN과 달리 Transformer에는 Inductive bias가 없어서 학습이 어렵지만 제대로 학습만 되면 CNN 보다 좋을 수 있다.2) • CNN과 Transformer 쓰면 상호 보완적이 되지 않을까?? 참고 : CNN의 Inductive Bias → “Computer Vision Task는 Spatial Information이 학습에 도움이 된다." 본 모델은 NLP 모델에 대한 이해도가 있으면 쉽게 구현 가능 Bipartite Matching Loss 의 Contribution을 다시 한 번 확인 • 비교적 간단한 모델 구조로도 Object Detector 학습 가능 41 1) Intriguing Properties of Vision Transformers https://arxiv.org/pdf/2105.10497.pdf 2) Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., ... & Houlsby, N. (2020). An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929.
  • 42. Q & A Q & A 42