SlideShare a Scribd company logo
3
Most read
10
Most read
17
Most read
End-to-End Object Detection with Transformers
Hwang seung hyun
Yonsei University Severance Hospital CCIDS
Facebook AI | MICCAI 2020
2020.10.11
Introduction Related Work Methods and
Experiments
01 02 03
Conclusion
04
Yonsei Unversity Severance Hospital CCIDS
Contents
DeTr
Introduction – Background
• Set predictions of object detection – set of bounding boxes and class labels
• Modern detectors address this in an indirect way – surrogate regression, anchors, non-
maximum suppression procedure (NMS), etc.
• Such methods are significantly influenced by postprocessing steps.
Introduction / Related Work / Methods and Experiments / Conclusion
01
Introduction – Proposal
• Propose a direct set prediction approach to bypass the surrogate tasks.
• Adopt an encoder-decoder architecture based on transformers.
• Self-attention mechanisms of transformers, which model all pairwise interactions
between elements in a sequence, helps remove duplicate predictions.
• DEtection TRansformer (DETR) predicts all objects at once in end-to-end manner, with a
set loss function which performs bipartite matching between predicted and GT objects.
• DETR are the conjunction of the bipartite matching loss, transformers, with parallel
decoding.
Introduction / Related Work / Methods and Experiments / Conclusion
02
[Overview of proposed framework]
DETR
DETR
Introduction – Contribution
• DETR simplifies the detection pipeline by dropping multiple hand-designed
components that encode prior knowledge, like spatial anchors or NMS.
• DETR doesn’t require any customized layers and can be reproduced easily in any
framework that contains standard CNN and transformer classes.
• DETR easily extend to more complex tasks like Panoptic Segmentation.
• DETR demonstrates accuracy and run-time performance on par with the Faster R-CNN
baseline methods on COCO object detection dataset.
Introduction / Related Work / Methods and Experiments / Conclusion
03
Related Work
Introduction / Related Work / Methods and Experiments / Conclusion
04
Set Prediction
• No DL model to directly predict sets.
• Need to avoiding near-duplicates → most detectors use post-processings such as NSM.
• Direct set prediction method need global inference schemes that model interactions
between all predicted elements to avoid redundancy.
• For constant-size set prediction, FCN or RNN are used.
Related Work
Introduction / Related Work / Methods and Experiments / Conclusion
05
Transformers and Parallel Decoding
• Transformer is attention-based building
block for machine translation.
• Attention mechanisms aggregate
information from the entire input
sequence → Self-Attention layers
• Parallel sequence generation was
developed in the domains of audio,
machine translation, and speech
recognition.
Related Work
Introduction / Related Work / Methods and Experiments / Conclusion
06
Object Detection
• Two-stage detectors that predict boxes
w.r.t proposals
• Single-stage methods make predictions
w.r.t anchors or a grid of possible
object centers.
• DETR remove this hand-crafted process
by directly predicting the set of
detections with absolute box prediction
w.r.t the input image.
[YOLO]
[R-CNN]
Methods and Experiments
Proposed Framework
Introduction / Related Work / Methods and Experiments / Conclusion
07
• Backbone: Conventional CNN backbone that extract a compact feature representation.
• Transformer Encoder: 1x1 conv reduces the channel dimension → Flattens vector and add positional
encodings to the input of each attention layer. → Each encoder consists of multi-head self-attention
model and FFN.
• Transformer Decoder: Transforms N input embeddings (learnt positional encodings) using multi-headed
self- and encoder-decoder attention mechanisms. Decodes the N objects in parallel at each decoder
layer. Enables global reasons about all objects using pair-wise relations between them.
• Feed-Forward Networks (FFN): 3-layer perceptron with ReLU activation function and a linear projection
layer. It predicts the normalized center coordinates, height, and with of the box, and class label using
softmax function.
Methods and Experiments
Proposed Framework
Introduction / Related Work / Methods and Experiments / Conclusion
08
Methods and Experiments
Proposed Framework
Introduction / Related Work / Methods and Experiments / Conclusion
09
Methods and Experiments
Set Prediction Loss
Introduction / Related Work / Methods and Experiments / Conclusion
10
• DETR infers a fixed-size set of N predictions in a single pass through the decoder.
• Need to score predicted objects – class, position, size.
• Loss produces optimal bipartite matching between predicted and ground truth objects, and then
optimize object-specific (bounding box) losses.
,
• Compute Hungarian loss after each decoder layer for all pairs matched. A linear combination of a
negative log-likelihood for class prediction and a box loss.
• Auxiliary decoding loss in decoder to help the model output the correct number of objects of
each class.
Methods and Experiments
Experiments – Dataset and Settings
Introduction / Related Work / Methods and Experiments / Conclusion
11
• COCO 2017 detection and panoptic segmentation datasets
- 118k training images and 5k validation images
• Compare with Faster R-CNN using AP.
- DETR : ResNet-50 backbone
- DETR-R101: ResNet-110 backbone
- DETR-DC5: Use dilated conv
- DETR-DC5-R101: Dilated DETR-R101
• Trained baseline model for 300 epochs on 16 V100 GPUs for
3days with 4 images per GPU.
Methods and Experiments
Experiments
Introduction / Related Work / Methods and Experiments / Conclusion
12
• DETR with 6 transformer, 6 decoder layers of width 356 and 8 attention heads
• DETR competitive with Faster R-CNN with the same number of parameters.
• Improved performance on large samples, but still lagging in small objects.
→ Processing of global information by the self-attention
Methods and Experiments
Experiments
Introduction / Related Work / Methods and Experiments / Conclusion
13
Methods and Experiments
Experiments
Introduction / Related Work / Methods and Experiments / Conclusion
14
• Encoder seems to separate instances already, simplifying object extraction and
localization for the decoder.
Methods and Experiments
Experiments
Introduction / Related Work / Methods and Experiments / Conclusion
15
• Since encoder has separated instances via global attention, the decoder only needs to
attend to the extremities to extract the class and object boundaries.
Methods and Experiments
Experiments
Introduction / Related Work / Methods and Experiments / Conclusion
16
Methods and Experiments
Experiments – DETR for Panoptic Segmentation
Introduction / Related Work / Methods and Experiments / Conclusion
17
• Added a mask head on top of the decoder outputs that predicts a binary mask for each
of the predicted boxes.
• Mask heads takes output of transformer decoder for each object and computes multi-
head attention scores of this embeddings, generating attention heatmaps per object.
• For final prediction, FPN-like architecture is used.
• Mask heads can be trained either jointly, or in a two steps process (train DETR for boxes
only then freeze all the weights and train only the mask for 25 epochs)
Methods and Experiments
Experiments – DETR for Panoptic Segmentation
Introduction / Related Work / Methods and Experiments / Conclusion
18
Conclusion
Introduction / Related Work / Methods and Experiments / Conclusion
• DETR is a new design for object detection based on transformers and
bipartite matching loss for direct set prediction.
• Achieved comparable results to an optimized Faster R-CNN
• DETR achieved significantly better performance on detecting large
objects.
• DETR showed strength in segmenting stuff classes, owing to the global
reasoning allowed by the encoder attention.
• Challenge in training, optimization, and performances on small objects.
19

More Related Content

PDF
Emerging Properties in Self-Supervised Vision Transformers
PDF
PR-284: End-to-End Object Detection with Transformers(DETR)
PDF
Object Detection with Transformers
PDF
Deformable DETR Review [CDM]
PPTX
Transformer in Vision
PDF
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
PPTX
ViT.pptx
PPTX
Batch normalization presentation
Emerging Properties in Self-Supervised Vision Transformers
PR-284: End-to-End Object Detection with Transformers(DETR)
Object Detection with Transformers
Deformable DETR Review [CDM]
Transformer in Vision
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
ViT.pptx
Batch normalization presentation

What's hot (20)

PPTX
Dbscan algorithom
PPTX
Transformers in Vision: From Zero to Hero
PPTX
Transformers AI PPT.pptx
PDF
Mobilenetv1 v2 slide
PPTX
DETR ECCV20
PPTX
Support vector machines (svm)
PPTX
Tutorial on Object Detection (Faster R-CNN)
PPTX
You Only Look Once: Unified, Real-Time Object Detection
PDF
PR-132: SSD: Single Shot MultiBox Detector
PDF
ViT (Vision Transformer) Review [CDM]
PPT
Intro to Deep learning - Autoencoders
PDF
Seq2Seq (encoder decoder) model
PDF
[2023] Cut and Learn for Unsupervised Object Detection and Instance Segmentation
PDF
Convolutional Neural Networks (CNN)
PDF
Training Neural Networks
PDF
Mask R-CNN
PPT
Perceptron
PPTX
You only look once (YOLO) : unified real time object detection
PDF
Autoencoders
Dbscan algorithom
Transformers in Vision: From Zero to Hero
Transformers AI PPT.pptx
Mobilenetv1 v2 slide
DETR ECCV20
Support vector machines (svm)
Tutorial on Object Detection (Faster R-CNN)
You Only Look Once: Unified, Real-Time Object Detection
PR-132: SSD: Single Shot MultiBox Detector
ViT (Vision Transformer) Review [CDM]
Intro to Deep learning - Autoencoders
Seq2Seq (encoder decoder) model
[2023] Cut and Learn for Unsupervised Object Detection and Instance Segmentation
Convolutional Neural Networks (CNN)
Training Neural Networks
Mask R-CNN
Perceptron
You only look once (YOLO) : unified real time object detection
Autoencoders
Ad

Similar to End-to-End Object Detection with Transformers (20)

PDF
論文紹介:End-to-End Object Detection with Transformers
PPTX
2020 12-2-detr
PDF
Transformer in Computer Vision
PDF
Visual Transformers
PDF
IRJET- Real-Time Object Detection using Deep Learning: A Survey
PDF
GAN Report 1 Monthly Report Generative Adversarial Part2
PPTX
End-to-End Object Detection with Transformers.pptx
PPTX
Yolos you only look one sequence
PDF
REVIEW ON OBJECT DETECTION WITH CNN
PDF
Machine learning based augmented reality for improved learning application th...
PPTX
Fractured Bone Case Study by Slidesgo.pptx
PDF
Review: You Only Look One-level Feature
PDF
“Understanding, Selecting and Optimizing Object Detectors for Edge Applicatio...
PDF
A Brief History of Object Detection / Tommi Kerola
PDF
Objects as points (CenterNet) review [CDM]
PDF
Image Translation with GAN
PDF
IISc Internship Report
PDF
2019 cvpr paper_overview
PDF
2019 cvpr paper overview by Ho Seong Lee
PDF
ObjectDetectionUsingMachineLearningandNeuralNetworks.pdf
論文紹介:End-to-End Object Detection with Transformers
2020 12-2-detr
Transformer in Computer Vision
Visual Transformers
IRJET- Real-Time Object Detection using Deep Learning: A Survey
GAN Report 1 Monthly Report Generative Adversarial Part2
End-to-End Object Detection with Transformers.pptx
Yolos you only look one sequence
REVIEW ON OBJECT DETECTION WITH CNN
Machine learning based augmented reality for improved learning application th...
Fractured Bone Case Study by Slidesgo.pptx
Review: You Only Look One-level Feature
“Understanding, Selecting and Optimizing Object Detectors for Edge Applicatio...
A Brief History of Object Detection / Tommi Kerola
Objects as points (CenterNet) review [CDM]
Image Translation with GAN
IISc Internship Report
2019 cvpr paper_overview
2019 cvpr paper overview by Ho Seong Lee
ObjectDetectionUsingMachineLearningandNeuralNetworks.pdf
Ad

More from Seunghyun Hwang (18)

PDF
An annotation sparsification strategy for 3D medical image segmentation via r...
PDF
Do wide and deep networks learn the same things? Uncovering how neural networ...
PPTX
Deep Learning-based Fully Automated Detection and Quantification of Acute Inf...
PDF
Diagnosis of Maxillary Sinusitis in Water’s view based on Deep learning model
PDF
Energy-based Model for Out-of-Distribution Detection in Deep Medical Image Se...
PDF
Deep Generative model-based quality control for cardiac MRI segmentation
PDF
Segmenting Medical MRI via Recurrent Decoding Cell
PDF
Progressive learning and Disentanglement of hierarchical representations
PDF
Learning Sparse Networks using Targeted Dropout
PDF
A Simple Framework for Contrastive Learning of Visual Representations
PDF
How useful is self-supervised pretraining for Visual tasks?
PDF
ResNeSt: Split-Attention Networks
PDF
DeepStrip: High Resolution Boundary Refinement
PDF
Your Classifier is Secretly an Energy based model and you should treat it lik...
PPTX
A Probabilistic U-Net for Segmentation of Ambiguous Images
PDF
FickleNet: Weakly and Semi-supervised Semantic Image Segmentation using Stoch...
PDF
Mix Conv: Mixed Depthwise Convolutional Kernels
PDF
Large Scale GAN Training for High Fidelity Natural Image Synthesis
An annotation sparsification strategy for 3D medical image segmentation via r...
Do wide and deep networks learn the same things? Uncovering how neural networ...
Deep Learning-based Fully Automated Detection and Quantification of Acute Inf...
Diagnosis of Maxillary Sinusitis in Water’s view based on Deep learning model
Energy-based Model for Out-of-Distribution Detection in Deep Medical Image Se...
Deep Generative model-based quality control for cardiac MRI segmentation
Segmenting Medical MRI via Recurrent Decoding Cell
Progressive learning and Disentanglement of hierarchical representations
Learning Sparse Networks using Targeted Dropout
A Simple Framework for Contrastive Learning of Visual Representations
How useful is self-supervised pretraining for Visual tasks?
ResNeSt: Split-Attention Networks
DeepStrip: High Resolution Boundary Refinement
Your Classifier is Secretly an Energy based model and you should treat it lik...
A Probabilistic U-Net for Segmentation of Ambiguous Images
FickleNet: Weakly and Semi-supervised Semantic Image Segmentation using Stoch...
Mix Conv: Mixed Depthwise Convolutional Kernels
Large Scale GAN Training for High Fidelity Natural Image Synthesis

Recently uploaded (20)

PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PPT
Teaching material agriculture food technology
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
cuic standard and advanced reporting.pdf
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Empathic Computing: Creating Shared Understanding
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Per capita expenditure prediction using model stacking based on satellite ima...
Chapter 3 Spatial Domain Image Processing.pdf
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Advanced methodologies resolving dimensionality complications for autism neur...
Teaching material agriculture food technology
Network Security Unit 5.pdf for BCA BBA.
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
The AUB Centre for AI in Media Proposal.docx
CIFDAQ's Market Insight: SEC Turns Pro Crypto
Reach Out and Touch Someone: Haptics and Empathic Computing
MYSQL Presentation for SQL database connectivity
Dropbox Q2 2025 Financial Results & Investor Presentation
NewMind AI Weekly Chronicles - August'25 Week I
cuic standard and advanced reporting.pdf
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Empathic Computing: Creating Shared Understanding

End-to-End Object Detection with Transformers

  • 1. End-to-End Object Detection with Transformers Hwang seung hyun Yonsei University Severance Hospital CCIDS Facebook AI | MICCAI 2020 2020.10.11
  • 2. Introduction Related Work Methods and Experiments 01 02 03 Conclusion 04 Yonsei Unversity Severance Hospital CCIDS Contents
  • 3. DeTr Introduction – Background • Set predictions of object detection – set of bounding boxes and class labels • Modern detectors address this in an indirect way – surrogate regression, anchors, non- maximum suppression procedure (NMS), etc. • Such methods are significantly influenced by postprocessing steps. Introduction / Related Work / Methods and Experiments / Conclusion 01
  • 4. Introduction – Proposal • Propose a direct set prediction approach to bypass the surrogate tasks. • Adopt an encoder-decoder architecture based on transformers. • Self-attention mechanisms of transformers, which model all pairwise interactions between elements in a sequence, helps remove duplicate predictions. • DEtection TRansformer (DETR) predicts all objects at once in end-to-end manner, with a set loss function which performs bipartite matching between predicted and GT objects. • DETR are the conjunction of the bipartite matching loss, transformers, with parallel decoding. Introduction / Related Work / Methods and Experiments / Conclusion 02 [Overview of proposed framework] DETR
  • 5. DETR Introduction – Contribution • DETR simplifies the detection pipeline by dropping multiple hand-designed components that encode prior knowledge, like spatial anchors or NMS. • DETR doesn’t require any customized layers and can be reproduced easily in any framework that contains standard CNN and transformer classes. • DETR easily extend to more complex tasks like Panoptic Segmentation. • DETR demonstrates accuracy and run-time performance on par with the Faster R-CNN baseline methods on COCO object detection dataset. Introduction / Related Work / Methods and Experiments / Conclusion 03
  • 6. Related Work Introduction / Related Work / Methods and Experiments / Conclusion 04 Set Prediction • No DL model to directly predict sets. • Need to avoiding near-duplicates → most detectors use post-processings such as NSM. • Direct set prediction method need global inference schemes that model interactions between all predicted elements to avoid redundancy. • For constant-size set prediction, FCN or RNN are used.
  • 7. Related Work Introduction / Related Work / Methods and Experiments / Conclusion 05 Transformers and Parallel Decoding • Transformer is attention-based building block for machine translation. • Attention mechanisms aggregate information from the entire input sequence → Self-Attention layers • Parallel sequence generation was developed in the domains of audio, machine translation, and speech recognition.
  • 8. Related Work Introduction / Related Work / Methods and Experiments / Conclusion 06 Object Detection • Two-stage detectors that predict boxes w.r.t proposals • Single-stage methods make predictions w.r.t anchors or a grid of possible object centers. • DETR remove this hand-crafted process by directly predicting the set of detections with absolute box prediction w.r.t the input image. [YOLO] [R-CNN]
  • 9. Methods and Experiments Proposed Framework Introduction / Related Work / Methods and Experiments / Conclusion 07 • Backbone: Conventional CNN backbone that extract a compact feature representation. • Transformer Encoder: 1x1 conv reduces the channel dimension → Flattens vector and add positional encodings to the input of each attention layer. → Each encoder consists of multi-head self-attention model and FFN. • Transformer Decoder: Transforms N input embeddings (learnt positional encodings) using multi-headed self- and encoder-decoder attention mechanisms. Decodes the N objects in parallel at each decoder layer. Enables global reasons about all objects using pair-wise relations between them. • Feed-Forward Networks (FFN): 3-layer perceptron with ReLU activation function and a linear projection layer. It predicts the normalized center coordinates, height, and with of the box, and class label using softmax function.
  • 10. Methods and Experiments Proposed Framework Introduction / Related Work / Methods and Experiments / Conclusion 08
  • 11. Methods and Experiments Proposed Framework Introduction / Related Work / Methods and Experiments / Conclusion 09
  • 12. Methods and Experiments Set Prediction Loss Introduction / Related Work / Methods and Experiments / Conclusion 10 • DETR infers a fixed-size set of N predictions in a single pass through the decoder. • Need to score predicted objects – class, position, size. • Loss produces optimal bipartite matching between predicted and ground truth objects, and then optimize object-specific (bounding box) losses. , • Compute Hungarian loss after each decoder layer for all pairs matched. A linear combination of a negative log-likelihood for class prediction and a box loss. • Auxiliary decoding loss in decoder to help the model output the correct number of objects of each class.
  • 13. Methods and Experiments Experiments – Dataset and Settings Introduction / Related Work / Methods and Experiments / Conclusion 11 • COCO 2017 detection and panoptic segmentation datasets - 118k training images and 5k validation images • Compare with Faster R-CNN using AP. - DETR : ResNet-50 backbone - DETR-R101: ResNet-110 backbone - DETR-DC5: Use dilated conv - DETR-DC5-R101: Dilated DETR-R101 • Trained baseline model for 300 epochs on 16 V100 GPUs for 3days with 4 images per GPU.
  • 14. Methods and Experiments Experiments Introduction / Related Work / Methods and Experiments / Conclusion 12 • DETR with 6 transformer, 6 decoder layers of width 356 and 8 attention heads • DETR competitive with Faster R-CNN with the same number of parameters. • Improved performance on large samples, but still lagging in small objects. → Processing of global information by the self-attention
  • 15. Methods and Experiments Experiments Introduction / Related Work / Methods and Experiments / Conclusion 13
  • 16. Methods and Experiments Experiments Introduction / Related Work / Methods and Experiments / Conclusion 14 • Encoder seems to separate instances already, simplifying object extraction and localization for the decoder.
  • 17. Methods and Experiments Experiments Introduction / Related Work / Methods and Experiments / Conclusion 15 • Since encoder has separated instances via global attention, the decoder only needs to attend to the extremities to extract the class and object boundaries.
  • 18. Methods and Experiments Experiments Introduction / Related Work / Methods and Experiments / Conclusion 16
  • 19. Methods and Experiments Experiments – DETR for Panoptic Segmentation Introduction / Related Work / Methods and Experiments / Conclusion 17 • Added a mask head on top of the decoder outputs that predicts a binary mask for each of the predicted boxes. • Mask heads takes output of transformer decoder for each object and computes multi- head attention scores of this embeddings, generating attention heatmaps per object. • For final prediction, FPN-like architecture is used. • Mask heads can be trained either jointly, or in a two steps process (train DETR for boxes only then freeze all the weights and train only the mask for 25 epochs)
  • 20. Methods and Experiments Experiments – DETR for Panoptic Segmentation Introduction / Related Work / Methods and Experiments / Conclusion 18
  • 21. Conclusion Introduction / Related Work / Methods and Experiments / Conclusion • DETR is a new design for object detection based on transformers and bipartite matching loss for direct set prediction. • Achieved comparable results to an optimized Faster R-CNN • DETR achieved significantly better performance on detecting large objects. • DETR showed strength in segmenting stuff classes, owing to the global reasoning allowed by the encoder attention. • Challenge in training, optimization, and performances on small objects. 19