2. HIGHLIGHTS
METHOD
OVERVIEW
PRIOR
COMPARISON
LIMITING PROMPT
PRE-LABELING
FUTURE TASKS
Descriptions of the proposed method and how it relates to the
“limiting prompt” as part of method.
Descriptions of contributions and what technical differences
against the prior arts.
A part of the proposed method that is contributing as a teacher
model or pre-labeling tool as a zero-shot object detector that
combines the state of arts of object detector and natural
language processing (NLP).
Descriptions of pending action items based on the scheduled
material as well as other action item(s) that may arise as part
of the discussion.
2024 Monthly Report 2
3. [METHOD OVERVIEW]
PURPOSE
• A method that uses generative network to
translate the environment of an image and
add certain object(s) based on bounding box-
based prompts to limited the location and
types of added objects.
• An object detector that could detect a target
domain with <20% of labeled target domain
dataset, 100% labeled source domain, and
corresponding generative dataset with
comparable accuracy rate to using 100% of
labeled target domain dataset.
2024 Monthly Report 3
4. [METHOD OVERVIEW]
PROPOSED
METHOD
2024 Monthly Report 4
• The proposed method uses a generative network to generate an image based on the target
domain style but the content of the image is coming from a prompt that is limited based on the
label of source domain; thus eliminating the requirement of manual label.
• The generated target domain, 20% target domain, and 100% source domain are trained to the
object detector and benchmarked against the 100% target domain based on target test.
5. PRIOR COMPARISONS
2024 Monthly Report 5
Attributes Proposed Method
Fine-grained Feature
Imitation [1]
Teacher-student Network
[2]
Label Smoothing
Regularization [3]
Training Paradigm
KD with self-supervised
teacher.
KD with fine-grained
feature imitation.
KD with supervised
teacher.
KD with smoothing
regularization.
Dataset
Combination of real and
synthetic dataset.
Real dataset Real dataset Real dataset
Teacher Model Unsupervised Supervised Weakly-Supervised Supervised
Incremental Learning
Adaptability
Yes No Yes No
[1] T. Wang et al, Distilling object detectors with fine-grained feature imitation, CVPR, 2019. https://arxiv.org/abs/1906.03609
[2] A. Banitalebi-Dehkordi, Revisiting knowledge distillation for object detection, 2021. https://arxiv.org/pdf/2105.10633.pdf
[3] L. Yuan et al, Revisiting knowledge distillation via label smoothing regularization, 2020. https://arxiv.org/abs/1909.11723
6. LIMITING
PROMPT PRE-
LABELING
2024 Monthly Report 6
A self-supervised learning framework on a transformer-based
architecture with zero-shot detector that combines DETR [4] for
visual cue and GLIP [5] for text cue.
[4] N. Carion et al. End-to-end object detection with transformers. European conference on computer
vision, pages 213–229. Springer, 2020. https://arxiv.org/pdf/2203.03605.pdf
[5] L.H. Li et al. Grounded language-image pre-training. ArXiv. 2022. https://arxiv.org/pdf/2112.03857.pdf
7. 2024 Monthly Report 7
[LIMITING PROMPT PRE-LABELING]
WHAT IS A ZERO-SHOT OBJECT DETECTION?
Suppose we’d like to train an object detector for a specific
object classes: pedestrian, traffic sign, and car; then there
would be a list of actions to do prior training:
• data collection (10%),
• data preparation (5%),
• data annotation (70%),
• data augmentation (10%), and
• data preprocessing (5%).
A zero-shot object detection is a powerful tool that mainly
consist of a large detector architecture (thus inefficient)
with a capability to detect a vast range of classes.
Sample of a zero-shot object detector [6]
[6] H. Zhang et al. DETR with improved denoising anchor boxes for end-to-end
object detection, 2022. https://arxiv.org/abs/2203.03605
8. 2024 Monthly Report 8
[LIMITING PROMPT PRE-LABELING]
WHY CAN’T WE USE ANY PRE-TRAINED MODEL?
• Any pre-trained model here refers to conservative
object detector that had been trained with a certain
dataset for example: MS-COCO [7].
• Although such detector may provide high accuracy
results, it lacks the robustness of network in terms of
detecting vast range of classes.
• For example, in MS-COCO, there is no “traffic sign”
classes [7, 8], which means that the model would be
unable to facilitate labeling for this class or any other
class at will.
Sample of a detector on MS-COCO
[7] T.Y. Lin et al. Microsoft COCO: Common objects in context, 2014.
https://arxiv.org/abs/1405.0312
[8] https://github.com/matlab-deep-learning/Object-Detection-Using-
Pretrained-YOLO-v2/blob/main/+helper/coco-classes.txt
9. 2024 Monthly Report 9
[LIMITING PROMPT PRE-LABELING]
ZERO-SHOT DETECTOR AS PRE-LABELER
• Transformer-based detector DETR
[4] with grounded (i.e. limited) pre-
training.
• It has the capability to detect
arbitrary objects with text-based
prompts such as class names or
referring expressions.
• Extensive network with two
network DETR as the detector and
GLIP as the NLP processor; with
both construct a zero-shot detector
with vast classes [5].
[4] H. Zhang et al. DETR with improved denoising anchor boxes for end-to-end object detection, 2022. https://arxiv.org/abs/2203.03605
[5] S. Liu et al. Grounding DINO: Marrying DINO with grounded pre-training for open-set object detection, 2023. https://arxiv.org/pdf/2303.05499.pdf
10. 2024 Monthly Report 10
[LIMITING PROMPT PRE-LABELING]
HOW DOES IT WORK? (1/5)
• It’s an end-to-end architecture which contains a backbone, a transformer encoder-decoder, and multiple prediction heads.
• The backbone works as a convolutional network would do that is to extract features from the input image; specifically high-
level features that associated with spatial information and hierarchical features.
• The transformer encoder processes feature maps through layers of transformer that consists of patch-wise multi-head self-
attention mechanisms and feed-forward neural networks to capture contextual information between each patch of features.
• The transformer decoder processes the feature vectors from encoder through layers of transformer that also consists of patch-
wise multi-head self-attention and feed-forward neural networks to generate set of object queries or predictions.
• Loss function is based on bipartite matching loss that sums localization and classification losses.
11. 2024 Monthly Report 11
[LIMITING PROMPT PRE-LABELING]
HOW DOES IT WORK? (2/5)
• During bounding box matching (prior to loss function calculation), it is common to have incorrect matches due to constraint of
a fixed IoU threshold that may induce erroneous loss value.
• DETR uses a contrastive denoised training, that basically treat the predictions and ground truths as corresponding attention
masks. Therefore instead of matching is done on bounding box level, matching is done on attention mask level (or denoised
level). This is supposed to solve the error nous loss value.
12. 2024 Monthly Report 12
[LIMITING PROMPT PRE-LABELING]
HOW DOES IT WORK? (3/5)
• This method improves the utilization of DETR from an ordinary detector
into a zero-shot detector by combining it with a parallel stream of GLIP
or an NLP model.
• Visual stream is processed by DETR. Text stream is processed by GLIP.
• Essentially, text stream is going to be used as a limiting prompt that
would provide cues to object detector in regressing visual stream for a
particular class of interest.
13. 2024 Monthly Report 13
[LIMITING PROMPT PRE-LABELING]
HOW DOES IT WORK? (4/5)
• Feature enhancer is a bridge between visual and text stream
features.
• To unify these features, it uses multiple feature enhancer layers
such as deformable self-attention layer for enhancing features
from visual stream and regular self-attention layer for enhancing
features from text stream.
• Deformable self-attention works to enhance visual features by
allowing the model to focus on specific regions of interest within the
images.
• Regular self-attention mechanisms are applied to text features,
enabling the model to capture the semantic relationships and context
within the textual data.
14. 2024 Monthly Report 14
[LIMITING PROMPT PRE-LABELING]
HOW DOES IT WORK? (5/5)
• A cross-modality decoder is then used to integrate text and
image modality features.
• The cross-modality decoder operates by processing the fused
features and decoder queries through a series of attention
layers and feed-forward networks.
• These layers allow the decoder to effectively capture the
relationships between the visual and textual information,
enabling it to refine the object detections and assign
appropriate labels.
• After this step, the model proceeds with the final steps in the
object detection including bounding box prediction, class
specific confidence filtering and label assignment.
15. 2024 Monthly Report 15
To test the performance of limiting prompt pre-labeling, the following sample image was used. Notice that the text prompt,
helps the detector to predict which objects based on the semantics and contexts of the text.
[LIMITING PROMPT PRE-LABELING]
EXPERIMENTAL RESULTS (1/2)
17. [FUTURE TASKS]
PIPELINE SCHEDULE
PRELI MI N A R Y STUDIE S
AND PROP OS A L
LIMITI N G
PROM P T
2024 FEB APR JUN AUG OCT DEC
2024 JAN MAR MAY JUL SEP NOV
DATASE T
BENCHMA RK
2024 Monthly Report 17
GENERAT IVE
NETWORK DESIGN
OBJECT DETECTOR
D ESIGN
MANUSC R I P T
DRAFT
FEED BACK AND
REFINE ME N T
ARXIV
SUBMIS S I ON &
HANDOVE R
DAC202 5
SUBMIS S I ON
18. 2024 Monthly Report 18
[FUTURE TASKS]
ACTION ITEMS
• Generative adversarial network that learns BDD-100K
dataset (U.S. dataset) with clear (1st domain) and snowy (2nd
domain).
• Inference said generative adversarial network on IDD dataset
(India dataset) to produce snowy domain.
• Use limited labels from pre-label tool to make sure that the
generative adversarial network does not convert any objects or
class of interest but only converts the background without
generating any false objects as well.