SlideShare a Scribd company logo
Gang Yu
ζ—· 视 η ” η©Ά ι™’
Object Detection in Recent 3 Years
Beyond RetinaNet and Mask R-CNN
Schedule of Tutorial
β€’ Lecture 1: Beyond RetinaNet and Mask R-CNN (Gang Yu)
β€’ Lecture 2: AutoML for Object Detection (Xiangyu Zhang)
β€’ Lecture 3: Finegrained Visual Analysis (Xiu-shen Wei)
Outline
β€’ Introduction to Object Detection
β€’ Modern Object detectors
β€’ One Stage detector vs Two-stage detector
β€’ Challenges
β€’ Backbone
β€’ Head
β€’ Pretraining
β€’ Scale
β€’ Batch Size
β€’ Crowd
β€’ NAS
β€’ Fine-Grained
β€’ Conclusion
Outline
β€’ Introduction to Object Detection
β€’ Modern Object detectors
β€’ One Stage detector vs Two-stage detector
β€’ Challenges
β€’ Backbone
β€’ Head
β€’ Pretraining
β€’ Scale
β€’ Batch Size
β€’ Crowd
β€’ NAS
β€’ Fine-Grained
β€’ Conclusion
What is object detection?
What is object detection?
Detection - Evaluation Criteria
Average Precision (AP) and mAP
Figures are from wikipedia
Detection - Evaluation Criteria
mmAP
Figures are from http://cocodataset.org
How to perform a detection?
β€’ Sliding window: enumerate all the windows (up to millions of windows)
β€’ VJ detector: cascade chain
β€’ Fully Convolutional network
β€’ shared computation
Robust Real-time Object Detection; Viola, Jones; IJCV 2001
http://www.vision.caltech.edu/html-files/EE148-2005-Spring/pprs/viola04ijcv.pdf
General Detection Before Deep Learning
β€’ Feature + classifier
β€’ Feature
β€’ Haar Feature
β€’ HOG (Histogram of Gradient)
β€’ LBP (Local Binary Pattern)
β€’ ACF (Aggregated Channel Feature)
β€’ …
β€’ Classifier
β€’ SVM
β€’ Bootsing
β€’ Random Forest
Traditional Hand-crafted Feature: HoG
Traditional Hand-crafted Feature: HoG
General Detection Before Deep Learning
Traditional Methods
β€’ Pros
β€’ Efficient to compute (e.g., HAAR, ACF) on CPU
β€’ Easy to debug, analyze the bad cases
β€’ reasonable performance on limited training data
β€’ Cons
β€’ Limited performance on large dataset
β€’ Hard to be accelerated by GPU
Deep Learning for Object Detection
Based on the whether following the β€œproposal and refine”
β€’ One Stage
β€’ Example: Densebox, YOLO (YOLO v2), SSD, Retina Net
β€’ Keyword: Anchor, Divide and conquer, loss sampling
β€’ Two Stage
β€’ Example: RCNN (Fast RCNN, Faster RCNN), RFCN, FPN,
MaskRCNN
β€’ Keyword: speed, performance
A bit of History
Image
Feature
Extractor
classification
localization
(bbox)
One stage detector
Densebox (2015) UnitBox (2016) EAST (2017)
YOLO (2015) Anchor Free
Anchor importedYOLOv2 (2016)
SSD (2015)
RON(2017)
RetinaNet(2017)
DSSD (2017)
two stages detector
Image
Feature
Extractor
classification
localization
(bbox)
Proposal
classification
localization
(bbox)
Refine
RCNN (2014) Fast RCNN(2015)
Faster RCNN (2015)
RFCN (2016)
MultiBox(2014)
RFCN++ (2017)
FPN (2017)
Mask RCNN (2017)
OverFeat(2013)
Outline
β€’ Introduction to Object Detection
β€’ Modern Object detectors
β€’ One Stage detector vs Two-stage detector
β€’ Challenges
β€’ Backbone
β€’ Head
β€’ Pretraining
β€’ Scale
β€’ Batch Size
β€’ Crowd
β€’ NAS
β€’ Fine-Grained
β€’ Conclusion
Modern Object detectors
Backbone Head
β€’ Modern object detectors
β€’ RetinaNet
β€’ f1-f7 for backbone, f3-f7 with 4 convs for head
β€’ FPN with ROIAlign
β€’ f1-f6 for backbone, two fcs for head
β€’ Recall vs localization
β€’ One stage detector: Recall is high but compromising the localization ability
β€’ Two stage detector: Strong localization ability
Postprocess
NMS
One Stage detector: RetinaNet
β€’ FPN Structure
β€’ Focal loss
Focal Loss for Dense Object Detection, Lin etc, ICCV 2017 Best student paper
One Stage detector: RetinaNet
β€’ FPN Structure
β€’ Focal loss
Focal Loss for Dense Object Detection, Lin etc, ICCV 2017 Best student paper
Two-Stage detector: FPN/Mask R-CNN
β€’ FPN Structure
β€’ ROIAlign
Mask R-CNN, He etc, ICCV 2017 Best paper
What is next for object detection?
β€’ The pipeline seems to be mature
β€’ There still exists a large gap between existing state-of-arts and product
requirements
β€’ The devil is in the detail
Outline
β€’ Introduction to Object Detection
β€’ Modern Object detectors
β€’ One Stage detector vs Two-stage detector
β€’ Challenges
β€’ Backbone
β€’ Head
β€’ Pretraining
β€’ Scale
β€’ Batch Size
β€’ Crowd
β€’ NAS
β€’ Fine-Grained
β€’ Conclusion
Challenges Overview
β€’ Backbone
β€’ Head
β€’ Pretraining
β€’ Scale
β€’ Batch Size
β€’ Crowd
β€’ NAS
β€’ Fine-grained
Backbone Head Postprocess
NMS
Challenges - Backbone
β€’ Backbone network is designed for classification task but not for
localization task
β€’ Receptive Field vs Spatial resolution
β€’ Only f1-f5 is pretrained but randomly initializing f6 and f7 (if applicable)
Backbone - DetNet
β€’ DetNet: A Backbone network for Object Detection, Li etc, 2018,
https://arxiv.org/pdf/1804.06215.pdf
Backbone - DetNet
Backbone - DetNet
Backbone - DetNet
Backbone - DetNet
Backbone - DetNet
Challenges - Head
β€’ Speed is significantly improved for the two-stage detector
β€’ RCNN - > Fast RCNN -> Faster RCNN - > RFCN
β€’ How to obtain efficient speed as one stage detector like YOLO, SSD?
β€’ Small Backbone
β€’ Light Head
Head – Light head RCNN
β€’ Light-Head R-CNN: In Defense of Two-Stage Object Detector, 2017,
https://arxiv.org/pdf/1711.07264.pdf
Code: https://github.com/zengarden/light_head_rcnn
Head – Light head RCNN
β€’ Backbone
β€’ L: Resnet101
β€’ S: Xception145
β€’ Thin Feature map
β€’ L:C_{mid} = 256
β€’ S: C_{mid} =64
β€’ C_{out} = 10 * 7 * 7
β€’ R-CNN subnet
β€’ A fc layer is connected to the PS ROI pool/Align
Head – Light head RCNN
Head – Light head RCNN
Head – Light head RCNN
β€’ Mobile Version
β€’ ThunderNet: Towards Real-time Generic Object Detection, Qin etc, Arxiv
2019
β€’ https://arxiv.org/abs/1903.11752
Pretraining – Objects365
β€’ ImageNet pretraining is usually employed for backbone training
β€’ Training from Scratch
β€’ Scratch Det claims GN/BN is important
β€’ Rethinking ImageNet Pretraining validates that training time is important
Pretraining – Objects365
β€’ Objects365 Dataset
Pretraining – Objects365
β€’ Pretraining with Objects365 vs ImageNet vs from Sctratch
Pretraining – Objects365
β€’ Pretraining on Backbone or Pretraining on both backbone and head
Pretraining – Objects365
β€’ Results on VOC Detection & VOC Segmentation
Pretraining – Objects365
β€’ Summary
β€’ Pretraining is important to reduce the training time
β€’ Pretraining with a large dataset is beneficial for the performance
Challenges - Scale
β€’ Scale variations is extremely large for object detection
Challenges - Scale
β€’ Scale variations is extremely large for object detection
β€’ Previous works
β€’ Divide and Conquer: SSD, DSSD, RON, FPN, …
β€’ Limited Scale variation
β€’ Scale Normalization for Image Pyramids, Singh etc, CVPR2018
β€’ Slow inference speed
β€’ How to address extremely large scale variation without compromising
inference speed?
Scale - SFace
β€’ SFace: An Efficient Network for Face Detection in Large Scale Variations,
2018, http://cn.arxiv.org/pdf/1804.06559.pdf
β€’ Anchor-based:
β€’ Good localization for the scales which are covered by anchors
β€’ Difficult to address all the scale ranges of faces
β€’ Anchor-free:
β€’ Able to cover various face scales
β€’ Not good for the localization ability
Scale - SFace
Scale - SFace
Scale - SFace
Scale - SFace
β€’ Summary:
β€’ Integrate anchor-based and anchor-free for the scale issue
β€’ A new benchmark for face detection with large scale variations: 4K Face
Challenges - Batchsize
β€’ Small mini-batchsize for general object detection
β€’ 2 for R-CNN, Faster RCNN
β€’ 16 for RetinaNet, Mask RCNN
β€’ Problem with small mini-batchsize
β€’ Long training time
β€’ Insufficient BN statistics
β€’ Inbalanced pos/neg ratio
Batchsize – MegDet
β€’ MegDet: A Large Mini-Batch Object Detector, CVPR2018,
https://arxiv.org/pdf/1711.07240.pdf
Batchsize – MegDet
β€’ Techniques
β€’ Learning rate warmup
β€’ Cross-GPU Batch Normalization
Challenges - Crowd
β€’ NMS is a post-processing step to eliminate multiple responses on one object
instance
β€’ Reasonable for mild crowdness like COCO and VOC
β€’ Will Fail in the case when the objects are in a crowd
Challenges - Crowd
β€’ A few works have been devoted to this topic
β€’ Softnms, Bodla etc, ICCV 2017, http://www.cs.umd.edu/~bharat/snms.pdf
β€’ Relation Networks, Hu etc, CVPR 2018,
https://arxiv.org/pdf/1711.11575.pdf
β€’ Lacking a good benchmark for evaluation in the literature
Crowd - CrowdHuman
β€’ CrowdHuman: A Benchmark for Detecting Human in a Crowd, 2018,
https://arxiv.org/pdf/1805.00123.pdf, http://www.crowdhuman.org/
β€’ A benchmark with Head, Visible Human, Full body bounding-box
β€’ Generalization ability for other head/pedestrian datasets
β€’ Crowdness
Crowd - CrowdHuman
Crowd-CrowdHuman
Crowd-CrowdHuman
β€’ Generalization
β€’ Head
β€’ Pedestrian
β€’ COCO
Conclusion
β€’ The task of object detection is still far from solved
β€’ Details are important to further improve the performance
β€’ Backbone
β€’ Head
β€’ Pretraining
β€’ Scale
β€’ Batchsize
β€’ Crowd
β€’ The improvement of object detection will be a significantly boost for the
computer vision industry
εΉΏε‘Šιƒ¨εˆ†
β€’ Megvii Detection ηŸ₯δΉŽδΈ“ζ 
Email: yugang@megvii.com
Object Detection Beyond Mask R-CNN and RetinaNet I

More Related Content

PPTX
Object Detection using Deep Neural Networks
PPTX
Object detection - RCNNs vs Retinanet
PDF
Object Detection Using R-CNN Deep Learning Framework
PPTX
CNN Tutorial
PDF
GANs and Applications
PPTX
Mask R-CNN
PDF
Single Image Super Resolution Overview
PDF
PR-132: SSD: Single Shot MultiBox Detector
Object Detection using Deep Neural Networks
Object detection - RCNNs vs Retinanet
Object Detection Using R-CNN Deep Learning Framework
CNN Tutorial
GANs and Applications
Mask R-CNN
Single Image Super Resolution Overview
PR-132: SSD: Single Shot MultiBox Detector

What's hot (20)

PPTX
Convolutional Neural Network - CNN | How CNN Works | Deep Learning Course | S...
PDF
Recurrent Neural Networks. Part 1: Theory
PDF
Convolutional Neural Networks (CNN)
PDF
Deep Learning - Overview of my work II
PDF
Deep learning based object detection basics
PDF
1μ‹œκ°„λ§Œμ— GAN(Generative Adversarial Network) μ™„μ „ μ •λ³΅ν•˜κΈ°
PDF
Super resolution in deep learning era - Jaejun Yoo
PPTX
Word embeddings, RNN, GRU and LSTM
PPTX
Tutorial on Object Detection (Faster R-CNN)
PPTX
Understanding RNN and LSTM
PPTX
Transformers AI PPT.pptx
PPT
Cnn method
PPTX
A Deep Journey into Super-resolution
PPTX
Convolution Neural Network (CNN)
PPTX
Resnet.pptx
PDF
Image segmentation with deep learning
PDF
Introduction to object detection
PDF
Unsupervised learning represenation with DCGAN
PDF
Faster R-CNN: Towards real-time object detection with region proposal network...
Convolutional Neural Network - CNN | How CNN Works | Deep Learning Course | S...
Recurrent Neural Networks. Part 1: Theory
Convolutional Neural Networks (CNN)
Deep Learning - Overview of my work II
Deep learning based object detection basics
1μ‹œκ°„λ§Œμ— GAN(Generative Adversarial Network) μ™„μ „ μ •λ³΅ν•˜κΈ°
Super resolution in deep learning era - Jaejun Yoo
Word embeddings, RNN, GRU and LSTM
Tutorial on Object Detection (Faster R-CNN)
Understanding RNN and LSTM
Transformers AI PPT.pptx
Cnn method
A Deep Journey into Super-resolution
Convolution Neural Network (CNN)
Resnet.pptx
Image segmentation with deep learning
Introduction to object detection
Unsupervised learning represenation with DCGAN
Faster R-CNN: Towards real-time object detection with region proposal network...
Ad

Similar to Object Detection Beyond Mask R-CNN and RetinaNet I (20)

PDF
IRJET- Real-Time Object Detection using Deep Learning: A Survey
PDF
Modern convolutional object detectors
PDF
[CVPR 2018] Utilizing unlabeled or noisy labeled data (classification, detect...
PDF
η‰©δ»Άε΅ζΈ¬θˆ‡θΎ¨θ­˜ζŠ€θ‘“
PDF
Cvpr 2017 Summary Meetup
PDF
Brodmann17 CVPR 2017 review - meetup slides
PPTX
Object detection with deep learning
PDF
ObjectDetectionUsingMachineLearningandNeuralNetworks.pdf
PDF
Backbone search for object detection for applications in intrusion warning sy...
PDF
Stadnford University practical presentation.pdf
PPTX
Deep learning based object detection
PDF
IRJET- Real-Time Object Detection System using Caffe Model
PDF
β€œUnderstanding, Selecting and Optimizing Object Detectors for Edge Applicatio...
PDF
Deep Learning for Computer Vision: Object Detection (UPC 2016)
PDF
Object Detetcion using SSD-MobileNet
PDF
Computer vision for transportation
PDF
β€œUnderstanding DNN-Based Object Detectors,” a Presentation from Au-Zone Techn...
PDF
Object Detection - MΓ­riam Bellver - UPC Barcelona 2018
PDF
Recent Object Detection Research & Person Detection
PDF
Advanced deep learning based object detection methods
IRJET- Real-Time Object Detection using Deep Learning: A Survey
Modern convolutional object detectors
[CVPR 2018] Utilizing unlabeled or noisy labeled data (classification, detect...
η‰©δ»Άε΅ζΈ¬θˆ‡θΎ¨θ­˜ζŠ€θ‘“
Cvpr 2017 Summary Meetup
Brodmann17 CVPR 2017 review - meetup slides
Object detection with deep learning
ObjectDetectionUsingMachineLearningandNeuralNetworks.pdf
Backbone search for object detection for applications in intrusion warning sy...
Stadnford University practical presentation.pdf
Deep learning based object detection
IRJET- Real-Time Object Detection System using Caffe Model
β€œUnderstanding, Selecting and Optimizing Object Detectors for Edge Applicatio...
Deep Learning for Computer Vision: Object Detection (UPC 2016)
Object Detetcion using SSD-MobileNet
Computer vision for transportation
β€œUnderstanding DNN-Based Object Detectors,” a Presentation from Au-Zone Techn...
Object Detection - MΓ­riam Bellver - UPC Barcelona 2018
Recent Object Detection Research & Person Detection
Advanced deep learning based object detection methods
Ad

More from Wanjin Yu (15)

PDF
Architecture Design for Deep Neural Networks III
PDF
Intelligent Multimedia Recommendation
PDF
Architecture Design for Deep Neural Networks II
PDF
Architecture Design for Deep Neural Networks I
PDF
Causally regularized machine learning
PDF
Object Detection Beyond Mask R-CNN and RetinaNet III
PDF
Object Detection Beyond Mask R-CNN and RetinaNet II
PDF
Visual Search and Question Answering II
PDF
Intelligent Image Enhancement and Restoration - From Prior Driven Model to Ad...
PDF
Intelligent Image Enhancement and Restoration - From Prior Driven Model to Ad...
PDF
Intelligent Image Enhancement and Restoration - From Prior Driven Model to Ad...
PDF
Intelligent Image Enhancement and Restoration - From Prior Driven Model to Ad...
PDF
Human Behavior Understanding: From Human-Oriented Analysis to Action Recognit...
PDF
Human Behavior Understanding: From Human-Oriented Analysis to Action Recognit...
PDF
Big Data Intelligence: from Correlation Discovery to Causal Reasoning
Architecture Design for Deep Neural Networks III
Intelligent Multimedia Recommendation
Architecture Design for Deep Neural Networks II
Architecture Design for Deep Neural Networks I
Causally regularized machine learning
Object Detection Beyond Mask R-CNN and RetinaNet III
Object Detection Beyond Mask R-CNN and RetinaNet II
Visual Search and Question Answering II
Intelligent Image Enhancement and Restoration - From Prior Driven Model to Ad...
Intelligent Image Enhancement and Restoration - From Prior Driven Model to Ad...
Intelligent Image Enhancement and Restoration - From Prior Driven Model to Ad...
Intelligent Image Enhancement and Restoration - From Prior Driven Model to Ad...
Human Behavior Understanding: From Human-Oriented Analysis to Action Recognit...
Human Behavior Understanding: From Human-Oriented Analysis to Action Recognit...
Big Data Intelligence: from Correlation Discovery to Causal Reasoning

Recently uploaded (20)

PDF
Unit-1 introduction to cyber security discuss about how to secure a system
PPTX
Introuction about WHO-FIC in ICD-10.pptx
PDF
Decoding a Decade: 10 Years of Applied CTI Discipline
PPTX
CHE NAA, , b,mn,mblblblbljb jb jlb ,j , ,C PPT.pptx
PDF
πŸ’° π”πŠπ“πˆ πŠπ„πŒπ„ππ€ππ†π€π πŠπˆππ„π‘πŸ’πƒ π‡π€π‘πˆ 𝐈𝐍𝐈 πŸπŸŽπŸπŸ“ πŸ’°
Β 
PPT
tcp ip networks nd ip layering assotred slides
PPTX
artificial intelligence overview of it and more
PDF
APNIC Update, presented at PHNOG 2025 by Shane Hermoso
Β 
PPTX
international classification of diseases ICD-10 review PPT.pptx
PPTX
presentation_pfe-universite-molay-seltan.pptx
PDF
WebRTC in SignalWire - troubleshooting media negotiation
PPTX
Slides PPTX World Game (s) Eco Economic Epochs.pptx
PPTX
Internet___Basics___Styled_ presentation
PDF
Cloud-Scale Log Monitoring _ Datadog.pdf
PPTX
PptxGenJS_Demo_Chart_20250317130215833.pptx
PPTX
QR Codes Qr codecodecodecodecocodedecodecode
PPTX
Job_Card_System_Styled_lorem_ipsum_.pptx
PDF
Sims 4 Historia para lo sims 4 para jugar
PPTX
Introduction to Information and Communication Technology
PDF
Tenda Login Guide: Access Your Router in 5 Easy Steps
Unit-1 introduction to cyber security discuss about how to secure a system
Introuction about WHO-FIC in ICD-10.pptx
Decoding a Decade: 10 Years of Applied CTI Discipline
CHE NAA, , b,mn,mblblblbljb jb jlb ,j , ,C PPT.pptx
πŸ’° π”πŠπ“πˆ πŠπ„πŒπ„ππ€ππ†π€π πŠπˆππ„π‘πŸ’πƒ π‡π€π‘πˆ 𝐈𝐍𝐈 πŸπŸŽπŸπŸ“ πŸ’°
Β 
tcp ip networks nd ip layering assotred slides
artificial intelligence overview of it and more
APNIC Update, presented at PHNOG 2025 by Shane Hermoso
Β 
international classification of diseases ICD-10 review PPT.pptx
presentation_pfe-universite-molay-seltan.pptx
WebRTC in SignalWire - troubleshooting media negotiation
Slides PPTX World Game (s) Eco Economic Epochs.pptx
Internet___Basics___Styled_ presentation
Cloud-Scale Log Monitoring _ Datadog.pdf
PptxGenJS_Demo_Chart_20250317130215833.pptx
QR Codes Qr codecodecodecodecocodedecodecode
Job_Card_System_Styled_lorem_ipsum_.pptx
Sims 4 Historia para lo sims 4 para jugar
Introduction to Information and Communication Technology
Tenda Login Guide: Access Your Router in 5 Easy Steps

Object Detection Beyond Mask R-CNN and RetinaNet I

  • 1. Gang Yu ζ—· 视 η ” η©Ά ι™’ Object Detection in Recent 3 Years Beyond RetinaNet and Mask R-CNN
  • 2. Schedule of Tutorial β€’ Lecture 1: Beyond RetinaNet and Mask R-CNN (Gang Yu) β€’ Lecture 2: AutoML for Object Detection (Xiangyu Zhang) β€’ Lecture 3: Finegrained Visual Analysis (Xiu-shen Wei)
  • 3. Outline β€’ Introduction to Object Detection β€’ Modern Object detectors β€’ One Stage detector vs Two-stage detector β€’ Challenges β€’ Backbone β€’ Head β€’ Pretraining β€’ Scale β€’ Batch Size β€’ Crowd β€’ NAS β€’ Fine-Grained β€’ Conclusion
  • 4. Outline β€’ Introduction to Object Detection β€’ Modern Object detectors β€’ One Stage detector vs Two-stage detector β€’ Challenges β€’ Backbone β€’ Head β€’ Pretraining β€’ Scale β€’ Batch Size β€’ Crowd β€’ NAS β€’ Fine-Grained β€’ Conclusion
  • 5. What is object detection?
  • 6. What is object detection?
  • 7. Detection - Evaluation Criteria Average Precision (AP) and mAP Figures are from wikipedia
  • 8. Detection - Evaluation Criteria mmAP Figures are from http://cocodataset.org
  • 9. How to perform a detection? β€’ Sliding window: enumerate all the windows (up to millions of windows) β€’ VJ detector: cascade chain β€’ Fully Convolutional network β€’ shared computation Robust Real-time Object Detection; Viola, Jones; IJCV 2001 http://www.vision.caltech.edu/html-files/EE148-2005-Spring/pprs/viola04ijcv.pdf
  • 10. General Detection Before Deep Learning β€’ Feature + classifier β€’ Feature β€’ Haar Feature β€’ HOG (Histogram of Gradient) β€’ LBP (Local Binary Pattern) β€’ ACF (Aggregated Channel Feature) β€’ … β€’ Classifier β€’ SVM β€’ Bootsing β€’ Random Forest
  • 13. General Detection Before Deep Learning Traditional Methods β€’ Pros β€’ Efficient to compute (e.g., HAAR, ACF) on CPU β€’ Easy to debug, analyze the bad cases β€’ reasonable performance on limited training data β€’ Cons β€’ Limited performance on large dataset β€’ Hard to be accelerated by GPU
  • 14. Deep Learning for Object Detection Based on the whether following the β€œproposal and refine” β€’ One Stage β€’ Example: Densebox, YOLO (YOLO v2), SSD, Retina Net β€’ Keyword: Anchor, Divide and conquer, loss sampling β€’ Two Stage β€’ Example: RCNN (Fast RCNN, Faster RCNN), RFCN, FPN, MaskRCNN β€’ Keyword: speed, performance
  • 15. A bit of History Image Feature Extractor classification localization (bbox) One stage detector Densebox (2015) UnitBox (2016) EAST (2017) YOLO (2015) Anchor Free Anchor importedYOLOv2 (2016) SSD (2015) RON(2017) RetinaNet(2017) DSSD (2017) two stages detector Image Feature Extractor classification localization (bbox) Proposal classification localization (bbox) Refine RCNN (2014) Fast RCNN(2015) Faster RCNN (2015) RFCN (2016) MultiBox(2014) RFCN++ (2017) FPN (2017) Mask RCNN (2017) OverFeat(2013)
  • 16. Outline β€’ Introduction to Object Detection β€’ Modern Object detectors β€’ One Stage detector vs Two-stage detector β€’ Challenges β€’ Backbone β€’ Head β€’ Pretraining β€’ Scale β€’ Batch Size β€’ Crowd β€’ NAS β€’ Fine-Grained β€’ Conclusion
  • 17. Modern Object detectors Backbone Head β€’ Modern object detectors β€’ RetinaNet β€’ f1-f7 for backbone, f3-f7 with 4 convs for head β€’ FPN with ROIAlign β€’ f1-f6 for backbone, two fcs for head β€’ Recall vs localization β€’ One stage detector: Recall is high but compromising the localization ability β€’ Two stage detector: Strong localization ability Postprocess NMS
  • 18. One Stage detector: RetinaNet β€’ FPN Structure β€’ Focal loss Focal Loss for Dense Object Detection, Lin etc, ICCV 2017 Best student paper
  • 19. One Stage detector: RetinaNet β€’ FPN Structure β€’ Focal loss Focal Loss for Dense Object Detection, Lin etc, ICCV 2017 Best student paper
  • 20. Two-Stage detector: FPN/Mask R-CNN β€’ FPN Structure β€’ ROIAlign Mask R-CNN, He etc, ICCV 2017 Best paper
  • 21. What is next for object detection? β€’ The pipeline seems to be mature β€’ There still exists a large gap between existing state-of-arts and product requirements β€’ The devil is in the detail
  • 22. Outline β€’ Introduction to Object Detection β€’ Modern Object detectors β€’ One Stage detector vs Two-stage detector β€’ Challenges β€’ Backbone β€’ Head β€’ Pretraining β€’ Scale β€’ Batch Size β€’ Crowd β€’ NAS β€’ Fine-Grained β€’ Conclusion
  • 23. Challenges Overview β€’ Backbone β€’ Head β€’ Pretraining β€’ Scale β€’ Batch Size β€’ Crowd β€’ NAS β€’ Fine-grained Backbone Head Postprocess NMS
  • 24. Challenges - Backbone β€’ Backbone network is designed for classification task but not for localization task β€’ Receptive Field vs Spatial resolution β€’ Only f1-f5 is pretrained but randomly initializing f6 and f7 (if applicable)
  • 25. Backbone - DetNet β€’ DetNet: A Backbone network for Object Detection, Li etc, 2018, https://arxiv.org/pdf/1804.06215.pdf
  • 31. Challenges - Head β€’ Speed is significantly improved for the two-stage detector β€’ RCNN - > Fast RCNN -> Faster RCNN - > RFCN β€’ How to obtain efficient speed as one stage detector like YOLO, SSD? β€’ Small Backbone β€’ Light Head
  • 32. Head – Light head RCNN β€’ Light-Head R-CNN: In Defense of Two-Stage Object Detector, 2017, https://arxiv.org/pdf/1711.07264.pdf Code: https://github.com/zengarden/light_head_rcnn
  • 33. Head – Light head RCNN β€’ Backbone β€’ L: Resnet101 β€’ S: Xception145 β€’ Thin Feature map β€’ L:C_{mid} = 256 β€’ S: C_{mid} =64 β€’ C_{out} = 10 * 7 * 7 β€’ R-CNN subnet β€’ A fc layer is connected to the PS ROI pool/Align
  • 34. Head – Light head RCNN
  • 35. Head – Light head RCNN
  • 36. Head – Light head RCNN β€’ Mobile Version β€’ ThunderNet: Towards Real-time Generic Object Detection, Qin etc, Arxiv 2019 β€’ https://arxiv.org/abs/1903.11752
  • 37. Pretraining – Objects365 β€’ ImageNet pretraining is usually employed for backbone training β€’ Training from Scratch β€’ Scratch Det claims GN/BN is important β€’ Rethinking ImageNet Pretraining validates that training time is important
  • 39. Pretraining – Objects365 β€’ Pretraining with Objects365 vs ImageNet vs from Sctratch
  • 40. Pretraining – Objects365 β€’ Pretraining on Backbone or Pretraining on both backbone and head
  • 41. Pretraining – Objects365 β€’ Results on VOC Detection & VOC Segmentation
  • 42. Pretraining – Objects365 β€’ Summary β€’ Pretraining is important to reduce the training time β€’ Pretraining with a large dataset is beneficial for the performance
  • 43. Challenges - Scale β€’ Scale variations is extremely large for object detection
  • 44. Challenges - Scale β€’ Scale variations is extremely large for object detection β€’ Previous works β€’ Divide and Conquer: SSD, DSSD, RON, FPN, … β€’ Limited Scale variation β€’ Scale Normalization for Image Pyramids, Singh etc, CVPR2018 β€’ Slow inference speed β€’ How to address extremely large scale variation without compromising inference speed?
  • 45. Scale - SFace β€’ SFace: An Efficient Network for Face Detection in Large Scale Variations, 2018, http://cn.arxiv.org/pdf/1804.06559.pdf β€’ Anchor-based: β€’ Good localization for the scales which are covered by anchors β€’ Difficult to address all the scale ranges of faces β€’ Anchor-free: β€’ Able to cover various face scales β€’ Not good for the localization ability
  • 49. Scale - SFace β€’ Summary: β€’ Integrate anchor-based and anchor-free for the scale issue β€’ A new benchmark for face detection with large scale variations: 4K Face
  • 50. Challenges - Batchsize β€’ Small mini-batchsize for general object detection β€’ 2 for R-CNN, Faster RCNN β€’ 16 for RetinaNet, Mask RCNN β€’ Problem with small mini-batchsize β€’ Long training time β€’ Insufficient BN statistics β€’ Inbalanced pos/neg ratio
  • 51. Batchsize – MegDet β€’ MegDet: A Large Mini-Batch Object Detector, CVPR2018, https://arxiv.org/pdf/1711.07240.pdf
  • 52. Batchsize – MegDet β€’ Techniques β€’ Learning rate warmup β€’ Cross-GPU Batch Normalization
  • 53. Challenges - Crowd β€’ NMS is a post-processing step to eliminate multiple responses on one object instance β€’ Reasonable for mild crowdness like COCO and VOC β€’ Will Fail in the case when the objects are in a crowd
  • 54. Challenges - Crowd β€’ A few works have been devoted to this topic β€’ Softnms, Bodla etc, ICCV 2017, http://www.cs.umd.edu/~bharat/snms.pdf β€’ Relation Networks, Hu etc, CVPR 2018, https://arxiv.org/pdf/1711.11575.pdf β€’ Lacking a good benchmark for evaluation in the literature
  • 55. Crowd - CrowdHuman β€’ CrowdHuman: A Benchmark for Detecting Human in a Crowd, 2018, https://arxiv.org/pdf/1805.00123.pdf, http://www.crowdhuman.org/ β€’ A benchmark with Head, Visible Human, Full body bounding-box β€’ Generalization ability for other head/pedestrian datasets β€’ Crowdness
  • 59. Conclusion β€’ The task of object detection is still far from solved β€’ Details are important to further improve the performance β€’ Backbone β€’ Head β€’ Pretraining β€’ Scale β€’ Batchsize β€’ Crowd β€’ The improvement of object detection will be a significantly boost for the computer vision industry
  • 60. εΉΏε‘Šιƒ¨εˆ† β€’ Megvii Detection ηŸ₯δΉŽδΈ“ζ  Email: yugang@megvii.com