Recent Object Detection Research & Person Detection

Recent Object Detection Development
& Person Detection Survey
kv

Outline
- Review Object Detection
- Research Trends: Anchor-free detector
- Person Detection

Object Detection
● Is deep learning dominated domain
● Modularized design, reuable
○ Components
○ Pipeline
○ Feature scaling design
Object Detection in 20 Years: A Survey
RCNN

General Object Detector Arch.
Backbone Neck Head
Backbone
Neck
Head
Dense
Head
One-Stage
● YOLO
● SSD
● RetinaNet
Two-Stage
● Faster-RCNN
● TunderNet

Component in Object Detection Pipeline
Backbone (feature extractor)
- ResNet50, ResNeXt, MobileNet
- Hourglass, DLA
Neck (in-net preprocessor)
- RPN
Dense Head
- FPN, BPN, HRPN
Head (task)
- AnchorHead
- retina, ssd
- fcos, ctdet
- BoxHead
Loss function
- CE, BCE
- Focal loss
- L1, Smooth L1
Computation Module
- Deformable Conv (v1, v2)
- GN (Group Normalization)
- SyncBN
- NMS, SoftNMS
- GA (Guided Anchoring)

Two-Stage: Faster RCNN
per ROI computation
per image computation
ResNet
RPN
Softmax
RoIPool
BoxReg
MLP

Scale in Object Detection
Backbone
● Without scale
○ ConvNet
● With scale
○ DLA
○ Hourglass
○ Modified-ResNet

Backbone Parameters
Backbone name Top1 # of parameters FLOPs/2
ResNet-50 22.28 25,557,032 3,877.95M
DLA-34 25.36 15,742,104 3,071.37M
ResNet-101 21.90 44,549,160 7,597.95M
Hourglass
reference: https://github.com/osmr/imgclsmob/blob/master/pytorch/README.md

Object Detection & Person Detection
Person detection ≈ class-agnostic object detection with crowdness prob.

Object Detection & Person Detection
● Crowdedness & Occlusion
● Scale & fine-grained
● Unusal pose
● Non-person, distractor
● Night scene
● Background distribution (domain shift)

Datasets
COCOPerson
CrowdHuman
Caltech
pedestrian
WiderPerson
WiderPerson19
CUHK Person
dataset #of img #of person density
COCO
Person
64,115 257,252 4.01
CrowdHuman 15,000 339,565 22.64
WiderPerson 9,000 399,786 39.87
CUHK Person 18,184 99,809 5.48
WiderPerson19
sur/ad
8,240/
88,260
58,190/
248,993
7.05/
2.82
Caltech
pedestrian
72,782 13,674 0.32
CityPerson 2,975 19,654 6.61
train, test, benchmark

Dataset: CrowdHuman
Annotations
● Full box
● Visible box
● Head box
Features
● Aim Crowdness issue

Dataset: WiderPerson
TMM2019 http://www.cbsr.ia.ac.cn/users/jwan/papers/TMM2019-WiderPerson.pdf
Features
● Questionable annotation quality
● Limited scence distribution (by observation)
Annotations
● Full box
● class, tag

Dataset: WiderPerson
TMM2019 http://www.cbsr.ia.ac.cn/users/jwan/papers/TMM2019-WiderPerson.pdf
Features
● More balanced location distribution

Dataset: WiderPerson2019
https://wider-challenge.org/2019.html
Features
● vehicle & surveillance
● low quality but high
resolution images

Observations
COCOPerson
CrowdHuman
Caltech
pedestrian
WiderPerson
General Image
Vehicle
Surveillance
CUHK Person
Market1501
WiderPerson19

Observations
● Model train on COCOPerson can not perform well on real scenario (Not confirmed)
● COCOPerson contains some not reasonable annotation
● WilderPerson dataset is too noisy to use directly
● Full box is hard; visible box may cause higher fp rate
● CrowdHuman is hard but it aims to conquer crowdedness problem

Crowdedness Problem: Repulsion Loss
Attraction
RepGT (Repulsion Term)
RepBox (Repulsion Term)

Crowdedness Problem: Repulsion Loss

Crowdedness Problem: Apative-NMS
Apative-NMS
● Dynamic suppression according to
target density
● Subnetwork to learn density
scores

Crowdedness Problem: Apative-NMS

Drawbacks of anchor box
● Large #of anchors (SSD 40k, Retinanet 100k)
○ faster-rcnn low proposal still performs good
● Introduce extra hyperparameters
● May fail when mult-scale senario
● Imbalance between positive & negative anchors

Recent Trend in Object Detection

Era of anchor-free detector
One-Stage: Fast, Simple
Two-Stage: High Precision
(Recall)
Anchor-Free: Hybrid both
methods
2018
- 8/3 CornerNet (pair)
2019
- 1/23 ExtremeNet (4 pts)
- 4/2 FCOS
- 4/8 FoveaBox
- 4/18 CornerNet-Lite
- 4/19 CenterNet (triplet)
- 4/23 Center and Scale Prediction (CSP)
- 4/25 Objects as Points (CenterNet)
- 10/21 CSID (CSP+ID)

Algo Relations
Anchor-Free
TripletExtremeNet
FCOS
Single
point
CSP
CSID
Multiple
points
CornerNet
CenterNet

CornerNet
Object as paired keypoints

CornerNet
Object as a pair of keypoints (top-left & bottom-right)
Find Corner
Associative
Embedding
Grouping

CornerNet
Corner Pooling
Top-Left
Bottom-Right
Backbone matters:
Hourglass provides 8 AP
than FPN

CornerNet
Corner Pooling
Top-Left
Bottom-Right
● One dimensional embedding

CornerNet: Loss function
● Pixel-wise regression on heatmap with focal loss
● Smooth L1 on offset map
Heatmap OffsetGrouping

CenterNet: Keypoint Triplets
Problem of CornerNet
● Sensitive due to edge (top 100)
● High false positive rate
Improvement
● Correct prediction by checking the
central parts
Object as a keypoint triplet

CenterNet: Keypoint Triplets
Corner Pool
Associative
Embedding
Grouping
Center Pool

FCOS
Object as a point + 4d vector ● Balance between postivie &
negative samples
● Ambiguous case ~ 1.4% in COCO
● Hint for center

FCOS
Backbone + PFN + Head (classical arch)

FCOS: Centerness
Important Feature
● Center-ness eliminates ambiguous
samples
● Class score times center-ness score
@NMS

FCOS: Improvements
● 1x and 2x mean the model is
trained for 90K and 180K
iterations, respectively.
● center means center sample is
used in our training.
● liou means the model use linear
iou loss function. (1 - iou)
● giou means the use giou loss
function. (1 - giou)

Objects as Points (+2 vals)
● Simple method
○ One feature map that represents all scales
○ No bounding box matching
○ No non maximum suppression
● Better speed-accuracy trade-off

Objects as Points: “The true CenterNet”
Hourglass
● Use DCNv2 instead Conv
● Heatmap supports 2D, 3D, pose
estimation

Objects as Points: “The true CenterNet”
● Pixel-wise regression with focal loss
● Not normalize scale map
● Size reg. constant 0.1
● L1 loss (rather Smooth L1) on offset loss
● Training longer performs better (140 to 230)

CSP: Center & Scale Prediction
Prediction
● Center (Heatmap)
● Scale (Height)
Fix aspect ratio @0.41
(according to dataset)
Object as a point + 1 scalar

Why Choose Height?
Why Predict Center?

CSID: Center, Scale, Identity and Density aware
ID-Map learns two measures simultaneously
● Density of predicted center
● Identity of predicted center

CSID: Center, Scale, Identity and Density aware
ID-NMS

Algo Relations
Anchor-Free
TripletExtremeNet
FCOS
Single
point
CSP
CSID
Multiple
points
CornerNet
CenterNet
How points are groupped?
● Pooling
● Associative
embeddings
How ceneter is located?
● Centerness reg.
● Center target
● Domain contraints

Comparison
Algorithm CornerNet Triplet FCOS CenterNet CSP CSID
#of points 2 3 1 1 1 1, 1
Scale Backbone Backbone FPN Backbone FPN Backbone
Grouping
method
Corner Pool
Loss
Center Pool
Corner Pool
Loss
- - - ID Loss
Density loss
Key feature Pool
Embedding
Pool Centerness Simple Const.
aspect ratio
ID Map
Post-processing NMS Soft-NMS NMS - NMS ID-NMS

Benchmarks: COCO
Algorithm Backbone AP AP@0.50 AP@0.75 APs APm APl
inference
time
YOLOv3 DarkNet-53 33 57 34.4 18.3 25.4 41.9 20 fps
RetinaNet ResNeXt-101-FPN 40.8 61.1 44.1 24.1 44.2 51.2 5.4 fps
CornerNet Hourglass-104 40.5 56.5 43.1 19.4 42.7 53.9 4.1 fps
FCOS ResNet-101-FPN 41.5 60.7 45 24.4 44.8 51.6 -
FCOS + imp ResNeXt-64x4d-101-FPN 44.7 64.1 48.4 27.6 47.5 55.6 -
CenterNet DLA-34 39.2 57.1 42.8 19.9 43 51.4 28 fps
CenterNet Hourglass-104 42.1 61.1 45.9 24.1 45.5 52.8 7.8 fps
Centernet-Triple
t Hourglass-52 41.6 59.4 44.2 22.5 43.1 54.1 3.7 fps
Centernet-Triple
t Hourglass-104 44.9 62.4 48.1 25.6 47.4 57.4 2.9 fps

Benchmarks: CityPerson
Algorithm
Name Backbone Reasonable Heavy Partial Bare inference time
FRCNN VGG-16 15.4 - - - -
OR-CNN VGG-16 12.8 55.7 15.3 6.7 -
RepLoss ResNet-50 13.2 56.9 16.8 7.6 -
CSP ResNet-50 11 49.3 10.4 7.3 3 fps
Adaptive-NMS ResNet-50 10.8 54 11.4 6.2 -
CSID DLA-34 8.8 46.6 8.3 5.8 6.25 fps

Training Frameworks
● Tensorflow Object Detection API
● mmdetection (CUHK)
● simpledet (TuSimple)
● Detectron, Detectron2

Conclusions
● Crowdedness is the major obstacle in person detection
● Anchor-free detector seems flexible & extensible to object task
● Center-based method + post-processing + specialized loss
○ CSID
○ CenterNet + A-NMS + RepLoss
● Trade-off between backbone & scaling level
○ ConvNet + FPN
○ DLA
● Still a challenging topic

Paper Lists: Person Detection
● CityPersons: A Diverse Dataset for Pedestrian Detection
● WiderPerson: A Diverse Dataset for Dense Pedestrian Detection in the Wild
● CrowdHuman: A Benchmark for Detecting Human in a Crowd
● CenterNet: Keypoint Triplets for Object Detection
● Objects as Points
● FoveaBox: Beyond Anchor-based Object Detector
● Feature Selective Anchor-Free Module for Single-Shot Object Detection
● FCOS: Fully Convolutional One-Stage Object Detection
● Center and Scale Prediction: A Box-free Approach for Object Detection
● Bottom-up Object Detection by Grouping Extreme and Center Points
● CSID: Center, Scale, Identity and Density-aware Pedestrian Detection in a Crowd
● Repulsion Loss: Detecting Pedestrians in a Crowd
● Adaptive NMS: Refining Pedestrian Detection in a Crowd
● Discriminative Feature Transformation for Occluded Pedestrian Detection
● PedHunter: Occlusion Robust Pedestrian Detector in Crowded Scenes
● Occlusion-aware R-CNN: Detecting Pedestrians in a Crowd
● Double Anchor R-CNN for Human Detection in a Crowd

Recent Object Detection Research & Person Detection

More Related Content

What's hot (20)

Similar to Recent Object Detection Research & Person Detection (20)

More from Kai-Wen Zhao (8)

Recently uploaded (20)

Recent Object Detection Research & Person Detection