SlideShare a Scribd company logo
Citation: Yuan, Y.; Wu, Y.; Zhao, L.;
Pang, Y.; Liu, Y. Multiple Object
Tracking in Drone Aerial Videos by a
Holistic Transformer and Multiple
Feature Trajectory Matching Pattern.
Drones 2024, 8, 349. https://doi.org/
10.3390/drones8080349
Academic Editor: Xiwang Dong
Received: 23 June 2024
Revised: 22 July 2024
Accepted: 26 July 2024
Published: 28 July 2024
Copyright: © 2024 by the authors.
Licensee MDPI, Basel, Switzerland.
This article is an open access article
distributed under the terms and
conditions of the Creative Commons
Attribution (CC BY) license (https://
creativecommons.org/licenses/by/
4.0/).
drones
Article
Multiple Object Tracking in Drone Aerial Videos by a Holistic
Transformer and Multiple Feature Trajectory Matching Pattern
Yubin Yuan , Yiquan Wu *, Langyue Zhao, Yaxuan Pang and Yuqi Liu
College of Electronic and Information Engineering, Nanjing University of Aeronautics and Astronautics,
Nanjing 211106, China; harley_yuan@nuaa.edu.cn (Y.Y.); zlangyue@nuaa.edu.cn (L.Z.);
hins_pang@nuaa.edu.cn(Y.P.); tolyuqi@nuaa.edu.cn (Y.L.)
* Correspondence: imagestrong@nuaa.edu.cn; Tel.: +86-137-7666-7415
Abstract: Drone aerial videos have immense potential in surveillance, rescue, agriculture, and urban
planning. However, accurately tracking multiple objects in drone aerial videos faces challenges like
occlusion, scale variations, and rapid motion. Current joint detection and tracking methods often
compromise accuracy. We propose a drone multiple object tracking algorithm based on a holistic
transformer and multiple feature trajectory matching pattern to overcome these challenges. The
holistic transformer captures local and global interaction information, providing precise detection and
appearance features for tracking. The tracker includes three components: preprocessing, trajectory
prediction, and matching. Preprocessing categorizes detection boxes based on scores, with each
category adopting specific matching rules. Trajectory prediction employs the visual Gaussian mixture
probability hypothesis density method to integrate visual detection results to forecast object motion
accurately. The multiple feature pattern introduces Gaussian, Appearance, and Optimal subpattern
assignment distances for different detection box types (GAO trajectory matching pattern) in the data
association process, enhancing tracking robustness. We perform comparative validations on the
vision-meets-drone (VisDrone) and the unmanned aerial vehicle benchmarks; the object detection
and tracking (UAVDT) datasets affirm the algorithm’s effectiveness: it obtained 38.8% and 61.7%
MOTA, respectively. Its potential for seamless integration into practical engineering applications
offers enhanced situational awareness and operational efficiency in drone-based missions.
Keywords: multiple object tracking; transformer; detection confidence; multiple feature matching
1. Introduction
In recent years, with the rapid development of drone technology, drone aerial videos
have become an effective means of acquiring high-resolution, wide-coverage areas and
hold significant potential in various applications such as surveillance, rescue operations,
agriculture, and urban planning [1]. Drone aerial videos capture a wide range of object
categories, including human activities, vehicles, buildings and infrastructure, and natural
environments, among others, providing rich data that endow drones with the capability
to monitor and track various objects in different application scenarios. In this context,
multiple object tracking (MOT) has become particularly important for processing drone
aerial videos, allowing systems to track and monitor multiple objects, thus enabling a
more comprehensive range of applications such as object tracking, behavior analysis, and
environmental monitoring. However, multi-object tracking in drone aerial videos faces
numerous challenges, including object occlusion, object variations at different scales, rapid
object motion, complex environmental conditions, and data noise. Traditional multi-object
tracking methods have limitations in addressing these issues, thus requiring more advanced
techniques to enhance tracking performance [2].
The majority of multi-object tracking methods for drone aerial videos are based on
detection. These methods initially identify objects in each frame using object detection
Drones 2024, 8, 349. https://doi.org/10.3390/drones8080349 https://www.mdpi.com/journal/drones
Drones 2024, 8, 349 2 of 27
algorithms, then employ data association, motion estimation, and filter updating to resolve
occlusion and scale variations. Long-term tracking may involve object re-identification
to handle object loss. Maintaining object trajectory information and conducting analyses
improves the system’s robustness in complex environments. Additionally, to further
improve efficiency, some researchers synchronize detection and tracking, integrating both
technologies to address the challenges posed by the wide variety and complex appearances
of objects in drone aerial videos.
Transformer models, known for their self-attention mechanism and parallel comput-
ing capabilities, have revolutionized natural language processing and computer vision [3].
Their versatility extends from vision transformers to full transformer models, and they
enable breakthroughs in tasks like image classification, object detection, and semantic
segmentation; they even branch into action recognition, object tracking, and scene flow esti-
mation. In drone aerial video analyses, transformers offer fresh perspectives for multi-object
detection and tracking. Unlike convolutional neural networks, transformers emphasize
global context interactions alongside local contexts, enhancing understanding of spatial
relationships. However, the computational expense of fine-grained self-attention in high-
resolution images poses challenges. Recent studies explore solutions like coarse-grained
global or fine-grained local self-attention to alleviate the computational burden, albeit at
the cost of simultaneously modeling short- and long-distance visual dependencies [4].
Given these challenges and the transformative potential of transformer models, we are
motivated to explore and develop advanced multi-object tracking methods that leverage
the strengths of transformers. Therefore, we propose a multi-object tracking method named
GAO-Tracker, which is based on a holistic transformer and multiple feature trajectory
matching pattern, to address various challenges in drone aerial videos. Our goal is to
overcome the limitations of traditional approaches and enhance the performance and
robustness of MOT in drone aerial videos, enabling more accurate and reliable tracking in
diverse and complex environments.
The remaining sections of this paper are organized as follows: Section 2, Related
Work, reviews and discusses the latest advancements in the field of multiple object track-
ing (MOT). We analyze current mainstream and cutting-edge technologies, including
object-feature-based methods, joint detection and tracking methods, and transformer-based
methods, providing a solid theoretical foundation and practical background for this re-
search. Section 3, Methodology, details our proposed GAO-Tracker method for multi-object
tracking. We delve into the core concepts, including the use of a holistic transformer
and multiple feature trajectory matching pattern to address various challenges in drone
aerial videos. We describe the model structure, algorithm workflow, and implementation
details. Section 4, Experiments, presents extensive experiments and performance evalu-
ations of GAO-Tracker. We test the method on several public datasets and compare it
with state-of-the-art methods. The results demonstrate GAO-Tracker’s superior perfor-
mance and robustness in complex scenarios. Section 5, Discussion, provides an in-depth
analysis of the experimental results. We discuss GAO-Tracker’s performance in different
scenarios, analyze its strengths and limitations, and suggest potential improvements and
future research directions. Section 6, Conclusion, summarizes the main contributions and
findings of this paper. We reiterate GAO-Tracker’s innovations in enhancing multi-object
tracking performance in drone aerial videos and discuss its prospects and potential for
practical applications.
2. Related Work
This section aims to comprehensively review and discuss the latest research advance-
ments in the field of multiple object tracking. By deeply analyzing current mainstream and
cutting-edge technologies, we establish a solid theoretical foundation and practical back-
ground for this study. First, we focus on the basic framework and challenges of multiple
object tracking. Then, we detail several core methods: object-feature-based multi-object
tracking methods, which achieve continuous tracking by extracting and utilizing the ap-
Drones 2024, 8, 349 3 of 27
pearance, motion, and other feature information of objects; joint detection and tracking
multi-object methods, which tightly integrate object detection and tracking tasks to en-
hance the overall performance and efficiency of the system; and finally, transformer-based
multi-object tracking methods, given the transformer model’s outstanding performance in
sequence data processing. We explore how these methods utilize attention mechanisms to
achieve precise and robust object tracking in complex scenarios. Through this review and
analysis, we not only present the latest achievements in the MOT field but also highlight
the current research gaps and shortcomings, leading to the research motivation and main
contributions of this paper. Our goal is to provide new insights and solutions for the
development of multi-object tracking technology.
2.1. Multiple Object Tracking
Multi-object tracking is a highly regarded technology, and its wide range of applica-
tions has attracted widespread interest among scholars. In the early stages of research,
researchers primarily focused on applying optimization algorithms to derive object trajec-
tories [5]. The IOUTracker, which relies solely on the bounding box intersection over union
(IOU), was the simplest early multi-object tracking method [6]. Researchers gradually
introduced motion models and Kalman filters to predict the positions of objects in the next
frame [7]. Although these improvements made multi-object tracking algorithms faster and
significantly improved their performance, the algorithms performed poorly in complex
occlusion and object loss situations. To address these challenges, researchers introduced
re-identification (ReID) features as appearance models, using visual features of objects be-
tween different frames to match objects and improve the accuracy of associations between
trajectories and detection results [8]. In addition to ReID, some studies have utilized image
segmentation techniques to identify and track objects, thereby better handling occlusion sit-
uations [9]. Furthermore, some researchers have begun to use recurrent neural networks or
attention mechanisms to model the spatiotemporal relationships between objects, thereby
improving tracking accuracy and stability. However, these methods often employ a single
matching approach, neglecting the different characteristics of different types of objects.
Moreover, introducing these different technological approaches into tracking systems can
result in suboptimal tracking results, limiting effectiveness.
2.2. Object-Feature-Based Multi-Object Tracking Methods
Benefiting from the rapid development of object detectors, object feature modeling has
become widely used in multi-object tracking algorithms from the perspective of drones. It
achieves multi-object tracking by capturing unique features of objects such as color, texture,
and optical flow. These extracted features must be distinctive in order to discriminate
different objects in the feature space effectively. Once these features are extracted, similarity
criteria can be utilized to find the most similar objects in the next frame, thus enabling multi-
object tracking. SCTrack adopts a three-stage data association method that combines object
appearance models, spatial distances, and explicit occlusion handling units. The system
relies on the motion patterns of tracked objects and considers environmental constraints,
thus exhibiting good performance in handling occluded objects [10]. To address the issue of
the subjective setting of fusion ratios between appearance and motion, which often merge
appearance similarity and motion consistency in the latest frame, the appearance similarity
between objects and surrounding objects is computed, object motion is predicted using
Social LSTM networks, and weighted appearance similarity and motion predictions are
used to generate associations between the current object and the object in the previous
frame [11]. However, due to the significant increase in computational costs, false detections,
drone aerial backgrounds, and other issues associated with handling large numbers of
object detections and association computations, these methods need to overcome various
challenges in maintaining accuracy while mitigating computational costs, false detections,
object associations, and so on.
Drones 2024, 8, 349 4 of 27
2.3. Joint Detection and Tracking Multi-Object Methods
To enhance the computational speed of the entire drone aerial multi-object tracking
system, researchers have actively explored methods that combine object detection and
feature extraction to achieve greater sharing in computation. JDE was the first attempt at
this approach and innovatively integrated the feature extraction branch into the single-stage
detector YOLOv3 [12]. Conversely, Fairmont balanced the handling between detection
and recognition tasks by adopting the anchor-free detector CenterNet to reduce anchor
ambiguities [13]. In addition to these joint detection and feature embedding methods,
several other single-stage trackers have emerged. GLOA designed global–local perception
blocks to extract scale variance feature information from input frames. Adding identity
embedding branches to the prediction heads outputs more discriminative identity informa-
tion [14]. CenterTrack [15] and Chained Tracker [16], on the other hand, use multi-frame
methods to predict bounding boxes in consecutive frames, facilitating efficient short-term
associations that eventually form long-term object trajectories. However, it is essential to
note that these technologies often generate many identity switches due to the difficulty of
capturing long-term dependencies. Additionally, these methods cannot simultaneously
consider multiple features of objects and differences in features among different categories,
resulting in the easy loss of tracking for some small objects.
2.4. Transformer-Based Multi-Object Tracking Methods
In recent years, transformer-based models have achieved significant success in the field
of computer vision, primarily excelling in the domain of object detection. This has given
rise to several transformer-based methods making strides in drone multi-object tracking.
Some methods based on DETR [17] and its derivative models, such as TransTrack [18],
TrackFormer [19], and MOTR [20], represent the front of online tracking and training
progress in the field of MOT. Swin-JDE leverages transformers and comprehensively
considers three factors—detection confidence, appearance embedding distance, and IoU
distance—to match each trajectory and the detection information. Furthermore, MOTR
achieves end-to-end object tracking by iteratively updating tracking queries, eliminating the
need for complex post-processing steps. MeMOT [21], similar to MOTR, utilizes attention
mechanisms to predict by focusing on object states. Despite pioneering new tracking
paradigms, these methods still fall short of advanced tracking algorithms. While standard
self-attention can capture fine-grained short- and long-distance interactions, executing
attention on high-resolution feature maps incurs high computational costs, leading to
explosive growth in time and memory costs. This paper addresses this issue through a
holistic self-attention module.
Therefore, we proposes a multi-object tracking method named GAO-Tracker based on
a holistic transformer and multiple feature trajectory matching pattern to address various
challenges in drone aerial videos. The effectiveness of the proposed method is validated
through a series of experiments and quantitative analyses, and we compare it with excellent
methods of the same kind and provide new insights and methods for multi-object tracking
in drone applications. The main contributions are as follows:
(1) A framework named GAO-Tracker, which integrates object detection and tracking
in a joint detection and tracking framework for drone aerial videos, is proposed. The
framework employs a holistic transformer as the core model for object detection and
includes a GAO trajectory matching algorithm based on object features in drone aerial
videos to achieve efficient and precise multi-object tracking.
(2) The holistic transformer, which combines fine-grained local interactions and coarse-
grained global interactions, is proposed. The framework includes an object detector holistic
trans-detector using a joint anchor-free detection head to achieve accurate object detection
in drone aerial videos.
(3) A multi-object trajectory prediction and matching module named the GAO-trajectory
matching pattern is proposed; it comprehensively considers the appearance features, mo-
tion characteristics, and size features of objects and trajectories. It includes three matching
Drones 2024, 8, 349 5 of 27
modes: Gaussian-IOU, Appear-IOU, and OSPA-IOU, fully exploiting various object and
trajectory information to achieve robust tracking of multiple objects in drone aerial videos.
(4) Using the prior information of the object’s position from the previous frame and
combining it with object visual features, a visual Gaussian mixture probability hypothesis
density (VGM-PHD) trajectory predictor tailored to the features of drone aerial videos is
designed to provide accurate trajectory information for trajectory matching.
3. Methodology
The proposed multi-object tracking system for drone aerial videos consists of the holis-
tic trans-detector module and the GAO-trajectory matching pattern trajectory association
module. The holistic trans-detector model is an anchor-free object detector and feature
extraction module that integrates holistic self-attention, combining fine-grained local and
coarse-grained global interactions. In this new mechanism, each token finely attends to
its nearest surrounding tokens and coarsely attends to its distant surrounding tokens,
effectively capturing short-term and long-term visual dependencies. The GAO-trajectory
matching pattern trajectory association module handles the data association process by si-
multaneously considering detection confidence, appearance embedding distance, and IOU
distance, thereby enhancing the tracking robustness of the MOT model. The framework is
illustrated in Figure 1.
Figure 1. GAO-Tracker framework.
3.1. Holistic Trans-Detector: Object Detection and Feature Extraction
In order to adapt to high-resolution visual tasks, high-resolution feature maps can be
obtained in the early stages. The entire model adopts a hierarchical design consisting of
four stages, each reducing the resolution of the input feature map and expanding the recep-
tive field layer by layer, like a CNN. The framework is shown in Figure 2. At the beginning
of the input, patch embedding is done, which cuts the image into individual blocks and
embeds them into the embedding. Each stage is composed of multiple holistic transformer
layers. The specific structure of the holistic transformer layer is shown in Figure 3; it is
mainly composed of LayerNorm, MLP (multi-layer perceptron), and holistic attention.
Drones 2024, 8, 349 6 of 27
Figure 2. Holistic trans-detector.
Figure 3. Holistic transformer.
An image with a resolution of H × W × 3 is first divided into blocks of size 4 × 4,
resulting in H
4 × W
4 × (4 × 4 × 3) patches. Then, these patches are projected into features
of dimension d using a convolutional layer for which the kernel size and stride are both
equal to 4. Given this spatial feature map, it is passed through four stages of concatenated
holistic transformer layers. In each stage, the holistic transformer block consists of 2, 2, 18,
and 2 holistic transformer layers, respectively. The selected configuration aims to capture
complex features at different levels of abstraction gradually. In the initial stage, there are
two layers, each aimed at capturing low-level features. In the middle stage, 18 layers focus
on learning high-level and complex features. In the final stage, two layers refine these
features to achieve precise tracking. After each stage, a patch embedding layer is added
to reduce the spatial dimensions of the feature map by half while doubling the feature
dimension. Finally, the feature maps from all four stages are sent to the detection head,
which simultaneously outputs appearance feature vectors of the objects for multi-object
trajectory matching.
Traditional transformer models face high computational and memory costs with large-
scale input data due to the global self-attention mechanism, which considers all tokens in
the input sequence. A holistic transformer addresses this by partitioning the input feature
map into sub-windows and conducting attention operations on each sub-window, reducing
computation and memory usage.
For a feature map of size M × N for x ∈ RM×N×d, we first divide it into partitions
of size 4 × 4, with each partition serving as a feature perception core in order to perform
attention perception within a localized context. Then, we locate the surrounding context
for each window instead of individual tokens. Sub-window pooling is a core component
of a holistic transformer and divides the input feature map into smaller sub-windows,
thereby reducing the number of tokens each attention operation needs to focus on. This
segmentation and pooling transforms global attention operations into local operations,
making the model more scalable and efficient. The process is illustrated in Figure 4.
Drones 2024, 8, 349 7 of 27
Figure 4. Holistic self-attention. We initially partition the feature map into 4 × 4 grids. While the
central 4 × 4 grid serves as the query window, we extract tokens at three granularity levels of 1 × 1,
2 × 2, and 4 × 4, respectively, from surrounding regions to serve as its keys and values. This results in
tokens with dimensions of 8 × 8, 6 × 6, and 5 × 5. Ultimately, these tokens from the three levels are
concatenated to compute the keys and values for the 4 × 4 = 16 tokens (queries) within the window.
Suppose the input feature map is denoted as x ∈ RM×N×d, where M × N represents
the spatial dimensions, and d represents the feature dimensions. Sub-window pooling is
performed in parallel on the feature map at three levels l ∈ {1, 2, 4}, dividing the input
feature map x into grids of size l × l for spatial sub-window pooling, followed by a simple
linear layer f l
p to perform spatial sub-window pooling, as shown in Equation (1).
xl
= f l
p(x̂) ∈ R
M
l × N
l ×d
(1)
where x̂ = Restructure(x) ∈ R( M
l × N
l ×d)×(l×l)
. The pooled feature maps at different levels
l provide rich fine-grained and coarse-grained information.
3.1.1. Attention Computation
After obtaining the pooled feature maps at all levels, three linear projection layers fq,
fk, and fv are used to compute the query for the first layer and the key and value for all
layers, as shown in Equations (2)–(4).
Q = fq

xl

(2)
Kl
= fk

xl

(3)
Drones 2024, 8, 349 8 of 27
Vl
= fv

xl

(4)
To perform holistic self-attention, extracting surrounding tokens for each query token
in the feature map is necessary. For the queries within the i-th window Qi ∈ Rsp×sp×d
,
keys Ki ∈ Rs×d and values Vi ∈ Rs×d are extracted from the surrounding Kl and Vl of the
window, where l represents the size of the keys and values, and s is the sum of all holistic
regions from all levels, i.e., s = 8 × 8 + 6 × 6 + 5 × 5. Finally, the holistic self-attention for
Qi is computed as shown in Equation (5).
Attention(Qi, Ki, Vi) = So f tmax
QiKT
i
√
d
+ B
!
Vi (5)
where B = {Bl} is a learnable relative position bias. For the first layer, it is parameterized as
Bl ∈ R7×7, while for other holistic levels, considering their different granularities towards
queries, all queries within the window are treated equally. Bl ∈ Rsl
r×sl
r is then used to
represent the relative position deviation tokens between the query window and each pooled
sl
r × sl
r.
The relative position deviation takes into account the positional relationships between
different sub-windows. This allows the model to understand the dependencies between
different positions better, thus enabling more accurate attention computation. The intro-
duction of relative position deviation enhances the flexibility and expressive power of the
model, enabling it to adapt better to different types of input data.
Since the attention operations for each sub-window are independent, modern hard-
ware and parallel computing frameworks can be leveraged to accelerate the model’s
training and inference processes.
3.1.2. Detection Head
We designed an anchor-free prediction head based on the CenterNet architecture and
divided it into detection and appearance branches. Through holistic transformer feature
extraction, the output feature map is provided to both branches for object detection and
appearance embedding. The detection branch consists of three heads, which are used to
predict the heatmap, the offset of the object’s center point, and the object’s size, respectively.
The heatmap head is utilized to predict the center position of the object, with an
output dimension of h × w × Cls, where h and w represent the height and width of the
input feature map, and Cls is the number of detection classes. Each class has its own
heatmap output, with each Gaussian peak in the heatmap representing the center position
of the detected object. Assuming there are N objects in the current training sample, let

ci
x, ci
y

represent the center position of the i-th object in i ∈ [1, N]. Then, the heatmap
corresponding to the current training sample is calculated as shown in Equation (6).
Mxy =
N
∑
i=1
exp

−
(x − ⌊ci
x
4 ⌋)2 + (y − ⌊
ci
y
4 ⌋)2
2σ2
c

 f (6)
Here, the operator ⌊a⌋ returns the nearest and smallest integer to a, while σc is the
standard deviation parameter. M ∈ Rh×w×Cls represents the output of the heatmap head,
and Mxy serves as the value of M at position (x, y).
The box size and center offset heads are used to predict the BBox and the offset of the
object’s center point, respectively. Let BBoxi = xi
lt, yi
lt, xi
rb, yi
rb

represent the BBox of the
i-th object, where xi
lt, yi
lt

and xi
rb, yi
rb

represent the top-left and bottom-right coordinates
of the object, respectively. Simultaneously, the offset of the center point of the i-th object is
defined as shown in Equation (7).
Drones 2024, 8, 349 9 of 27
oi
xy ≜

δi
x, δi
y

=
ci
x
4
− ⌊
ci
x
4
⌋,
ci
y
4
− ⌊
ci
y
4
⌋
!
(7)
This helps improve the accuracy of predicting the center position of the object. The
term ô ∈ Rh×w×2 represents the output of the center offset head, and ôi
xy represents the
offset prediction of the i-th object at position (x, y) on ô.
The appearance branch is responsible for generating embedding features that assist in
identifying the object. Each head consists of a 3 × 3 convolutional layer with 256 channels,
followed by a 1 × 1 convolutional layer to produce the final output. The embedding heads
of the appearance branch calculate the appearance feature vectors of the object, which are
used in the association matching operation for multi-object tracking tasks. Specifically,
these appearance feature vectors can be used for association matching to calculate the
similarity between the tracker and the detected object. A 128-dimensional vector at position
(x, y) represents the appearance feature vector of the object at that location.
3.2. GAO Trajectory Matching Pattern
Our GAO trajectory matching pattern considers detection confidence, appearance
embedding distance, and IoU distance to associate all tracking trajectories with all detection
Bboxes. Figure 5 illustrates the architecture of the module. When receiving detection results
from the detector output, we add detection Bboxes with confidence scores higher than
0.5 to the high-score detection Bbox set, and those between 0.2 and 0.5 are added to the
low-score detection Bbox set.
Figure 5. GAO trajectory matching module.
Initially, predicted trajectories are matched with high-score detection boxes using the
Appear-IOU matching method. Unmatched trajectories then undergo secondary matching
with low-score detection boxes via Gau-IOU matching, with any remaining unmatched
low-score boxes removed. Subsequently, high-score detection boxes that were not initially
matched are re-evaluated using (optimal subpattern assignment) OSPA-IOU matching with
previously unmatched trajectories from the previous frame. High-score boxes unmatched
after both attempts are considered new trajectories, while trajectories that have been
continuously unmatched for 30 frames are removed from tracking, with flexibility to adjust
based on the video frame rate.
Successful matches update tracking through the update process with matched de-
tection frames. Trajectory prediction involves modeling visual objects’ trajectories as a
random finite set, utilizing the visual Gaussian mixture probability hypothesis density
Drones 2024, 8, 349 10 of 27
to generate prediction information for the tracker, which primes the model for the next
frame’s association matching.
The data association process employs four distance metrics, leading to the design of
three matching methods: Gau-IOU, Appear-IOU, and OSPA-IOU distance matching.
3.2.1. Appear-IOU Distance Matching
Appear-IOU trajectory matching considers the appearance and spatial location features
between the object and predicted trajectories while calculating the cosine distance and
IOU distance between all predicted trajectories and high-scored detection appearances as
metrics. The appearance vector of the object contains extensive appearance information,
which is combined with the IOU distance of the BBox to enhance the matching accuracy
between detection boxes and trajectories. The process as shown in Figure 6.
Figure 6. Appear-IOU trajectory matching.
Let BBoxi
d, Ei
d

represent the i-th detected object detection BBox and its corresponding
feature vector in the current frame, and BBoxi
t, Ei
t

represents the j-th trajectory-predicted
object BBox and its corresponding feature vector from the previous frame. The first distance
metric DI
ij is computed based on the IOU distance:
DI
ij = 1 −
area(BBoxi
d ∩ BBox
j
t)
area(BBoxi
d ∪ BBox
j
t)
(8)
where area(A) represents the area of the input set A, and the symbols ∩ and ∪ represent
the intersection and union operator of two sets. The appearance distance metric DA
ij is
calculated based on the cosine distance between two embedding feature vectors:
DA
ij = 1 −
Ai
d · A
j
t
∥Ai
d∥∥A
j
t∥
(9)
where · denotes the dot product between two vectors, and ∥A∥ denotes the 2-norm value
of the vector.
Subsequently, the IOU distances and appearance distances between all detections and
trajectories are combined in a weighted manner to obtain the Appear-IOU distance:
DAI
= αDA
ij + (1 − α)DI
ij (10)
Drones 2024, 8, 349 11 of 27
where α represents the proportion of the cosine distance, with values ranging between 0
and 1.
Finally, all Appear-IOU distances are merged into a cost matrix, and the Hungar-
ian algorithm is employed to achieve the best match. Unmatched trajectories undergo
secondary matching with low-scored detections through the Gau-IOU matching model.
In contrast, unmatched high-scored detection boxes undergo secondary matching with
inactive trajectories through the OSPA-IOU matching model.
3.2.2. Gau-IOU Distance Matching
The Gau-IOU distance matching process is illustrated in Figure 7. Low-scored detec-
tion boxes often represent small objects. In order to better extract object features, both the
low-scored detections and the trajectories to be matched are transformed into Gaussian
space. This transformation integrates the Wasserstein distance (WD) and the IOU distance
between the Gaussian distributions of trajectories and objects.
Figure 7. Gau-IOU trajectory matching.
We first transform the BBox of the object and trajectory into Gaussian space using a
matrix transformation. For the object box represented by (x, y, h, w) , the parameters of the
Gaussian distribution N(x|µ, Σ) are computed as:
µ = [x, y]T
(11)
Σ =

w2
4 0
0 w2
4
#
(12)
The key to matching detection boxes with trajectories is how to calculate the similarity
between the Gaussian distributions Nd(xd|µd, Σd) of the detection box and Nt(xt|µt, Σt) of
the trajectory box. We use the Wasserstein distance to compute the distance between the
two Gaussian distributions. The Wasserstein distance between two Gaussian distributions
is defined as:
Drones 2024, 8, 349 12 of 27
DW(Nd, Nt) = ∥µd − µt∥2
+ Tr(Σt) + Tr(Σt) − 2Tr

Σ
1
2
d ΣtΣ
1
2
d
1
2
!
(13)
The Wasserstein distance primarily consists of two components: the distance between
the center points, represented by (x, y), and a coupling term related to (h, w). Due to
the chain-like coupling relationship formed by these parameters, which causes them to
influence each other, the Wasserstein distance is highly advantageous for achieving high-
precision matching.
Next, the IOU distance and the WD distance between all detections and trajectories
are weighted to obtain the Appear-IOU distance.
DGI
= βDW + (1 − β)DI
ij (14)
where β represents the proportion of the WD distance and takes values between 0 and
1. Finally, the Hungarian algorithm is employed to achieve the best matching between
detections and trajectories based on all Gau-IOU distances. Unmatched trajectories are
converted to inactive trajectories, and unmatched low-scored detections are removed.
3.2.3. OSPA-IOU Distance Matching
The OSPA distance allows for considering subpattern matching of object trajectories,
enabling the model to better capture both the similarities and differences between object
trajectories. This, in turn, provides a more accurate assessment of tracking performance.
Building upon the foundation of IOU distance matching, we comprehensively consider
the OSPA distance and propose the OSPA-IOU trajectory matching model. The process is
illustrated in Figure 8.
Figure 8. OSPA-IOU trajectory matching.
Assume the object state set is X = {x1, x2, . . . , xm} and the object trajectory set is
Y = {y1, y2, . . . , yn} , where m, m ∈ N0 = {0, 1, 2, . . .} represent the estimated and true
numbers of objects, respectively. The OSPA distance is expressed as:
Drones 2024, 8, 349 13 of 27
Dp,e(x, y) =

1
n
min
π∈Πn
Σm
i=1

dc

xi, yπ(i)
p
+ (n − m)cp
 1
p
(15)
where Πn represents all permutations for selecting numbers from the set {1, 2, . . . , n} . If
p = 1, the OSPA distance can be expressed as:
Dp,c = eloc
p,c(x, y) + ecard
p,c (x, y) (16)
eloc
p,c(x, y) =

1
n
min
π∈Πn
Σm
i=1

dc

xi, yπ(i)
p
 1
p
(17)
ecard
p,c (x, y) =

1
n
(n − m)cp
 1
p
(18)
where eloc
p,c(x, y) and ecard
p,c (x, y) represent the positional difference and cardinality difference
between the sets of estimated object states and true object states, respectively. The posi-
tional difference signifies the spatial gap, while the cardinality difference encompasses
performance metrics like the false track proportion, redundancy, and interruptions. The
truncation parameter adjusts the balance between positional and cardinality differences,
with smaller values prioritizing positional differences. Treating objects as single-element
sets and trajectories as multi-element sets, we compute the OSPA distance between them to
optimize matching between individual detections and trajectories.
Subsequently, the IOU distance and the WD distance between all detections and
trajectories are weighted to derive the Appear-IOU distance.
DOI
= λDp,c + (1 − λ)DI
ij (19)
where λ represents the proportion of the OSPA distance and takes values between 0 and 1.
Finally, the Hungarian algorithm is applied to achieve the best matching between detections
and trajectories based on all Gau-IOU distances. Unmatched high-score detections are
converted to new inactive trajectories, and inactive trajectories that remain unmatched for
30 frames are removed.
3.2.4. Visual Gaussian Mixture Probability Hypothesis Density
The visual Gaussian mixture probability hypothesis density (VGM-PHD) filtering
algorithm utilizes the center positions of all trajectories as the measurement input for
the random finite set, preserving object ID and size data to reconstruct trajectories. As-
sumptions include representing both spawned and newly born object PHDs as Gaussian
mixtures, independence between object detection and survival probabilities, and modeling
state transition density and observation likelihood functions as linear Gaussian models.
Both the motion model and observation model of the VGM-PHD filtering algorithm
are set to be linear, and noise and errors follow Gaussian distributions. Using the weights,
means, and variances of the PHD Gaussian distribution, the algorithm iteratively propa-
gates the multi-object states. The specific implementation steps of the VGM-PHD filtering
algorithm are as follows. Assuming the posterior PHD at a specific time is given by the
following Gaussian sum form:
Tk−1(x) = Σ
Jk−1
i=1 ωi
k−1N

x; mi
k−1, Pi
k−1

(20)
where ωi
k , mi
k , and pi
k represent the weight, mean, and covariance of the i-th Gaussian
component at time k for a single object state x , and Jk represents the number of Gaussian
components at time k. The function N(. . . ) represents variables that follow a Gaussian
distribution. The predicted intensity function at time k is given by:
Drones 2024, 8, 349 14 of 27
Tk|k−1(x) = Ts,k|k−1(x) + Tβ,k|k−1(x) + γk(x) (21)
The three terms on the right side respectively represent the predicted PHDs of surviv-
ing objects, spawned objects, and newly born objects. The intensity function obtained from
the GM-PHD filtering algorithm update can be expressed as:
Tk(x) = (1 − PD,k)Tk|k−1(x) + Σz∈Zk
TD,k(x; z) (22)
where the first term represents the PHD of missed objects, and the second term represents
the updated PHD of detected objects. In the VGM-PHD filtering algorithm, if the PHD
functions at time k − 1, the prior distribution generated by the prediction at time k and the
posterior distribution obtained by the filtering update can both be represented in Gaussian
mixture form. The weights can be obtained through PHD filtering, while the means and
covariances are recursively obtained through Kalman filtering. During the prediction and
update of the object PHD in VGM-PHD, the predicted object numbers Nk|k−1 and updated
object numbers Nk are given by:
Nk|k−1 = Σ
Jk|k−1
i=1 ωn
k|k−1 = Nk−1

PS,k + Σ
Jβ,k
i=1ωi
β,k

+ Σ
Jγ,k
j=1ωi
γ,k (23)
Nk = Σ
Jk
n=1ωn
k = Nk|k−1(1 − PD,k) + Σz∈Zk
Σ
Jk|k−1
j=1 ω
j
k(z) (24)
4. Experiments
4.1. Dataset and Evaluation Metrics
The proposed algorithm undergoes comprehensive evaluations on the VisDrone
MOT [22] and UAVDT [23] datasets, which encompass diverse drone-captured scenes
and facilitate a thorough assessment of the proposed methods’ practical effectiveness. Ex-
tensive evaluations compare the algorithm with other leading multi-object trackers across
various scenarios and conditions. Established MOT evaluation metrics are utilized to
assess performance comprehensively, with the aim of gauging overall effectiveness and
pinpointing potential weaknesses in each model. The metrics include:
(1) FP (↓): Number of false positives in the entire video.
(2) FN (↓): Number of false negatives in the entire video.
(3) IDSW (↓): Number of identity switches in the entire video.
(4) FM (↓): Number of ground truth trajectories interrupted during the tracking process.
(5) IDF1 (↑): Ratio of correctly identified detections to the computed detections and
ground truth.
(6) MOTA (↑): Combined FP, FN, and IDSW, scored as follows:
MOTA = 1 −
(FN + FP + IDSW)
GT
(25)
where GT is the actual tracking result.
(7) MOTP (↑): Mismatches between ground truth and predicted results, calculated as:
MOTP = 1 −
Σt,idt,i
Σtct
(26)
These metrics contribute to a comprehensive assessment of MOT algorithm perfor-
mance in various aspects, providing in-depth insights into system effectiveness.
4.2. Training Preprocessing
Existing MOT methods integrating object detection and appearance embedding often
use a single-stage training approach, where detection and appearance branches are trained
simultaneously. While this reduces training time, it can harm detection performance due
to differing learning objectives. In densely populated scenes, fully occluded objects may
Drones 2024, 8, 349 15 of 27
still have annotated bounding boxes in the training dataset, which can introduce errors
when learning appearance embeddings and can reduce tracking accuracy. To address
this, our proposed model filters highly occluded objects from the training samples before
commencing model training. To implement this, we initially define a metric variable
Boverlap ∈ [0, 1] to gauge the overlap between two ground truth Bboxes; the metric is
defined as follows:
Boverlap =
BBoxi
GT
T
BBox
j
GT
BBoxi
GT
S
BBox
j
GT
(27)
where BBoxi
GT and BBox
j
GT represent the i-th and j-th ground truth BBoxes of the input
training samples, respectively. A higher value of the variable indicates greater overlap
between the two ground-truth BBoxes. In object detection, a value Boverlap ≥ 0.75 signifies
substantial overlap between two BBoxes. Therefore, in this study, we set the threshold at
Boverlap ≥ 0.75, considering smaller BBoxes as indicative of occluded objects and excluding
them from the training dataset. We ultimately train the model using the filtered dataset.
4.3. Experimental Settings
The detector is initialized with pre-existing weights obtained from training on the
COCO dataset. We train the detector using SGD with the following parameters: 150 epochs,
a batch size of 16, a learning rate of 0.02, momentum set to 0.9, and decay set to 0.0001.
We train the detector on both the VisDrone and UAVDT datasets and perform validation
using the same set of verification images. We execute the testing on hardware (NVIDIA
RTX 4090 with 24 GB of memory) and calculate the average of the top-100 most reliable
detection results.
4.4. Comparative Experiments
4.4.1. Detection Comparison
To compare the performance of our detector, we select a total of seven excellent
detectors: DETR [17], Deformable DETR [24], YOLO-S [25], Swin-JDE [18], VitDet [26],
RTD-Net [27], and DN-DETR [28]. They are trained and evaluated on the VisDrone and
UAVDT datasets using the experimental settings described in their respective papers. DETR
completely discards traditional object detection components such as anchor boxes and non-
maximum suppression and utilizes a complete attention mechanism for end-to-end object
detection. Deformable DETR is an improved version of DETR that introduces deformable
attention to enhance the model’s adaptability to changes in object shape and scale. YOLO-S
employs a small feature extractor, skip connections, cascaded skip connections, and a
reshaping pass-through layer to facilitate cross-network feature reuse, combining low-
level positional information with more meaningful high-level information. The Swin-JDE
algorithm adopts a Swin transformer based on windowed self-attention as the backbone
network to enhance feature extraction capabilities. ViTDet utilizes ViT as the backbone for
a Mask R-CNN object detection model, enhancing competitiveness by optimizing the RPN
section. RTD-Net replaces positional linear projection with convolutional projection and
uses an efficient convolutional multi-head self-attention algorithm based on convolutional
transformer blocks to improve the recognition of occluded objects by extracting contextual
information. DN-DETR introduces a novel denoising training approach to address the
instability of bipartite graph matching in the DETR decoder during training, doubling the
convergence speed and significantly improving the detection results.
The comparative results in Table 1 demonstrate the substantial advantages of our
detection performance. AP is the average accuracy, and AP@0.5 and AP@0.75 indicate
intersection-to-union ratios greater than 50% and 75%, respectively. APs, APm, and APl
are the average accuracies for small objects (with an area less than 32 × 32 pixels), medium
objects (with an area between 32 × 32 and 96 × 96 pixels), and large objects (with an area
greater than 96 × 96 pixels), respectively. The visual comparison results in Figures 9 and 10
Drones 2024, 8, 349 16 of 27
show that our results exhibit excellent performance under various lighting conditions and
crowded environments.
(a) DETR (b) Deformable DETR
(c) YOLOS (d) Swin-JDE
(e) VitDet (f) RTD-Net
(g) DN-DETR-Net (h) Holistic trans-det
Figure 9. Comparison of detection results on the VisDrone dataset.
Drones 2024, 8, 349 17 of 27
(a) DETR (b) Deformable DETR
(c) YOLOS (d) Swin-JDE
(e) VitDet (f) RTD-Net
(g) DN-DETR-Net (h) Holistic Trans-Det
Figure 10. Comparison of detection results on the UAVDT dataset.
Drones 2024, 8, 349 18 of 27
Table 1. The detection results of the detectors on the datasets.
Dataset Detector AP AP@0.5 AP@0.75 APs APm APl
VisDrone
DETR [17] 34.8 63.4 32.2 12.8 38.5 55.6
Deformable DETR [24] 36.9 60.4 35.2 9.9 38.1 52.7
YOLOS [25] 36.6 63.1 38.7 15.4 39.9 54.9
Swin-JDE[18] 38.2 60.5 34.8 11.1 41.4 57.6
VitDet [26] 38.9 64.7 38.7 19.6 40.5 57.8
RTD-Net [27] 38.1 64.6 40.2 17.6 42.8 57.6
DN-DETR [28] 39.4 63.4 36.5 16.8 42.5 59.2
Holistic Trans-Det 39.6 67.9 40.8 18.6 40.3 59.4
UAVDT
DETR [17] 48.8 69.3 49.3 28.0 47.5 57.1
Deformable DETR [24] 47.2 69.2 50.3 29.0 53.2 59.4
YOLOS [25] 49.3 71.1 51.4 32.3 50.4 58.9
Swin-JDE [18] 49.6 69.9 52.8 33.9 54.8 59.7
VitDet [26] 54.6 68.9 59.5 37.5 57.9 61.0
RTD-Net [27] 52.2 71.4 55.6 36.3 57.2 60.9
DN-DETR [28] 56.7 68.6 60.2 38.7 59.8 62.9
Holistic Trans-Det 57.5 69.0 60.5 38.8 61.5 67.9
4.4.2. Tracking Comparison
We compared DeepSORT [29], ByteTrack [30], BoT-SORT [31], UAVMOT [32], DC-
MOT [33], TFAM [34], MTTJDT [35], and SimpleTrack [36] as well as transformer-based meth-
ods including TransTrack [37], TrackFormer [38], TransCenter [39], MOTR [20], MeMOT [21],
GTR [40], TR-MOT [41], GCEVT [42], STN-Track [43], and STDFormer [19]. These compar-
isons were conducted on the VisDrone MOT and UAVDT datasets.
To ensure consistent comparisons despite variations in object distributions across
datasets, we employed the holistic trans-detector to produce uniform detection results
for all tracking comparison methods. This approach mitigates evaluation bias stemming
from uneven category distributions, fostering fairer and more reliable tracking method
comparisons. To maintain detection accuracy across categories during evaluation, distinct
thresholds were applied: 0.3 for cars, 0.1 for trucks, and 0.4 for pedestrians, with a lower
threshold of 0.05 for buses, which present greater visual variability.
Tables 2 and 3 comprehensively compare GAO-Tracker with other popular trackers
on the VisDrone MOT and UAVDT datasets. The evaluation includes critical metrics such
as MOTA, MOTP, IDF1, and IDSW and comparisons with other methods. GAO-Tracker
demonstrates excellent performance by effectively utilizing position and appearance in-
formation. DeepSORT associates categories independently using positional information.
ByteTrack utilizes low-scoring detection for similarity tracking and background noise
filtering. BoT-SORT incorporates camera motion compensation for improved matching.
UAVMOT enhances object feature association with an ID feature update module. Simple-
Track merges object embedding cosine and GIOU distances to create a new association
matrix. Transformer-based methods like TransTrack employ a query–key mechanism for
existing object tracking and new object detection. TrackFormer considers position, occlu-
sion, and object recognition features simultaneously. TransCenter predicts the association’s
heatmap of object centers globally. MOTR models the entire trajectory of an object using a
tracking query. MeMOT uses information from previous frames for tracking clues. GTR
extends the window length for matching and utilizes interaction information fully. TR-
MOT achieves reliable associations using visual temporal features. STDFormer utilizes the
transformer’s remote modeling capability for intent and decision information extraction.
However, these methods apply a single matching rule for all detection classes, leading to
inaccurate tracking of various object classes and poorer performance.
Drones 2024, 8, 349 19 of 27
Table 2. Comparison between GAO-Tracker and the latest multiple trackers tested on the Vis-
Drone dataset.
Tracker MOTA↑ MOTP↑ IDF1 (%)↑ IDSW↓ MT (%)↑ ML (%)↑ FP↓ FN↓
Motion-
based
DeepSORT [29] 19.4 69.8 33.1 6387 38.8 52.2 15,181 44,830
ByteTrack [30] 25.1 72.6 40.8 4590 42.8 50.3 10,722 24,376
BoT-SORT [31] 23.0 71.6 41.4 7014 51.9 73.6 10,701 47,922
UAVMOT [32] 25.0 72.3 40.5 6644 52.6 49.6 10,134 55,630
DCMOT [33] 33.5 76.1 45.5 1139 - - 12,594 64,856
TFAM [34] 30.9 74.4 42.7 3998 - - 27,732 126,811
MTTJDT [35] 31.2 73.2 43.6 2415 - - 25,976 183,381
Transformer-
based
TransTrack [37] 27.3 62.1 28.3 2523 33.5 59.7 15,028 51,396
TrackFormer [38] 24 77.3 38 4724 39 46.3 11,731 32,807
TransCenter [39] 29.9 66.6 46.8 3446 33.4 61.8 15,104 20,894
MOTR [20] 13.1 72.4 47.1 2997 52.9 72 12,216 42,186
MeMOT [21] 29.4 73 48.7 3755 46.7 47.9 9963 30,062
GTR [40] 28.1 76.8 54.5 2000 61.3 57.6 8165 10,553
TR-MOT [41] 29.9 64.3 46 1005 42.8 59.9 7593 17,352
GCEVT [42] 34.5 73.8 50.6 841 520 612 - -
STN-Track [43] 38.6 - 73.7 668 31.4 51.2 7385 76,006
STDFormer [19] 35.9 74.5 59.9 1441 52.7 60.3 8527 20,558
GAO-Tracker 38.8 76.3 54.3 972 55.9 52.4 6883 10,204
Table 3. Comparison between GAO-Tracker and the latest multiple trackers tested on the UAVDT
dataset.
Tracker MOTA↑ MOTP↑ IDF1 (%)↑ IDSW↓ MT (%)↑ ML (%)↑ FP↓ FN↓
Motion-
based
DeepSORT [29] 35.9 71.5 58.3 698 43.4 25.7 50,513 59,733
ByteTrack [30] 39.1 74.3 44.7 2341 43.8 28.1 14,468 87,485
BoT-SORT [31] 37.2 72.1 53.1 1692 40.8 27.3 42,286 64,494
UAVMOT [32] 43.0 73.5 61.5 641 45.3 22.7 27,832 65,467
SimpleTrack [36] 45.3 73.9 57.1 1404 43.6 22.5 21,153 53,448
TFAM [34] 47.0 72.9 67.8 506 - - 68,282 111,959
Transformer-
based
TransTrack [37] 33.2 72.4 67.6 1122 38.9 23.8 50,746 54,938
TrackFormer [38] 53.4 74.2 46.3 2247 43.7 23.3 13,719 91,061
TransCenter [39] 48.9 73.9 51.3 2287 32.6 35.1 27,995 93,013
MOTR [20] 35.6 72.5 56.1 1759 39.8 29.3 39,733 56,368
MeMOT [21] 45.6 74.6 62.8 2118 34.9 26.5 38,933 59,156
GTR [40] 46.5 75.3 61.1 1482 42.7 18.6 21,676 52,617
TR-MOT [41] 57.7 74.1 55.7 2461 33.9 21.3 32,217 50,838
GCEVT [42] 47.6 73.4 68.6 1801 618 363 - -
STN-Track [43] 60.6 - 73.1 1420 57.0 17.0 12,825 61,760
STDFormer [19] 60.6 74.8 61.7 1642 44.6 20.3 20,258 41,895
GAO-Tracker 61.7 75.2 67.9 1216 45.3 24.6 24,915 59,640
Combining the data from Tables 2 and 3, we observe that transformer-based methods
outperform motion-based methods. This trend reflects the effectiveness and superiority
of transformer-based methods for addressing multi-object tracking problems in drone
aerial videos and that transformer-based methods enable better capturing of long-distance
dependencies between objects in complex environments and better handling of challenges
such as object occlusion and scale changes.
Drones 2024, 8, 349 20 of 27
Figures 11 and 12 show time-order frames with bounding boxes and different-colored
identities. In the initial images (left), bounding boxes may appear inconsistent due to occlu-
sion. However, in the final images (right), GAO-Tracker maintains consistent bounding
boxes, reducing the identity switching of pedestrians. The center images show intermediate
steps where identities might temporarily switch due to occlusions or overlaps. The final
images (right) demonstrate GAO-Tracker’s ability to preserve identities throughout the se-
quence, even in crowded scenarios. By utilizing object motion information, GAO-Tracker’s
trajectory association technology effectively solves the problems of missed detection and
incorrect detection caused by occlusion, especially in the case of short-term overlapping
objects. Compared with previous algorithms based on bounding box connections, GAO-
Tracker reduces pedestrian identity switching. The results indicate that GAO-Tracker
performs well in crowded scenarios of drone aerial videos and ensures consistent bounding
boxes and identities throughout the entire sequence.
Figure 11. Tracking results of GAO-Tracker on the VisDrone dataset.
Figure 12. Tracking results of GAO-Tracker on the UAVDT dataset.
4.5. Ablation Experiments
To demonstrate the effectiveness of the designed method, we conducted multiple sets
of ablation experiments on training preprocessing strategies, the GAO module, the sequence
of various matching strategies, and VGM-PHD on the VisDrone and UAVDT datasets.
Drones 2024, 8, 349 21 of 27
4.5.1. Effect of Backbone
To validate the effectiveness of our holistic trans as the backbone network, we com-
pared it with ResNet50, DLA-34, ViT, and Swin-L and conducted ablation experiments.
Table 4 presents the performance evaluation results of the proposed GAO-Tracker combined
with different backbone networks. This experiment used the proposed data association
method as the post-processing module and evaluated the UAVDT and VisDrone test
datasets. Based on the results in Table 4, we have the following findings: In the evaluation
results of UAVDT, using DLA-34 as the backbone network yielded the best performance,
with MOTA, MOTP, and IDF1 scores reaching 61.9%, 75.1%, and 66.4%, respectively. Ad-
ditionally, using the holistic trans backbone network resulted in the lowest IDSW count.
In the evaluation results of VisDrone, compared to ResNet50, DLA-34, ViT, and Swin-L,
using the holistic trans backbone network achieved 38.8% MOTA, 76.3% MOTP, and 54.3%
IDF1 and a significant reduction in FP. Since VisDrone contains many congested scenes, the
experimental results indicate that using the holistic trans backbone network can improve
MOT performance in crowded scenarios.The tracking performance using the DLA-34 back-
bone network was the best on UAVDT but was significantly worse on VisDrone. In contrast,
using the holistic trans backbone network resulted in inferior tracking performance on
UAVDT but the best performance on VisDrone. The MOTA increase and the FP decrease
using the holistic trans backbone network indicate that our model significantly enhances
the detection capability of correct objects.
Table 4. Performance evaluation of the proposed GAO-Tracker model combined with different
backbone networks.
Dataset Detector Backbone MOTA↑ MOTP↑ IDF1 (%)↑ IDSW↓ MT (%)↑ ML (%)↑ FP↓ FN↓
VisDrone
ResNet-50 19.6 59.9 36.7 4287 35.3 31.3 9078 18,764
DLA-34 34.9 68.5 50.3 2198 46.3 43.5 8818 13,070
ViT 35.2 69.7 51.0 2019 48.9 45.9 8009 12,897
Swin-L 35.5 70.2 52.3 1509 51.9 47.6 6832 12,223
Holistic Trans 38.8 76.3 54.3 972 55.9 52.4 6883 10,204
UAVDT
ResNet-50 56.2 70.3 62.1 2252 40.4 22.6 32,743 72,629
DLA-34 61.9 75.1 66.4 1798 42.4 23.4 28,705 65,616
ViT 60.1 74.0 65.9 1504 42.8 23.7 26,937 62,348
Swin-L 59.6 74.4 66.0 1264 43.9 23.8 25,822 61,324
Holistic Trans 61.7 75.2 67.9 1216 45.3 24.6 24,915 59,640
Based on the observations above, it can be concluded that the backbone network signif-
icantly impacts the tracking performance of multi-object trackers depending on the density
of tracking objects in the scene. Therefore, improving the feature extraction capability
of the backbone network model is a crucial factor affecting the tracking performance of
multi-object trackers.
4.5.2. Impact of Pre-Processing and Detection Results Classification
During the training process of the multi-object tracking model, we attempted to train
the network by removing highly overlapped objects to provide efficient and accurate
appearance embedding information for multi-object tracking matching. We also explored
the impact of classifying high- and low-scoring detection boxes. As shown in Table 5, we
verified the effectiveness by adding or not adding training set optimization and detection
score branches. “Pre” indicates training with removing highly overlapped objects, while
“Grade” signifies the model distinguished between high- and low-scoring detection boxes
for input into the GAO trajectory association pattern.
Drones 2024, 8, 349 22 of 27
Table 5. Comparison between detection and classification with or without preprocessing.
Dataset Method MOTA↑ MOTP↑ IDF1 (%)↑ IDSW↓ MT (%)↑ ML (%)↑ FP↓ FN↓
VisDrone
Baseline 36.2 70.9 52.5 1344 53.1 49.3 9117 11,987
B+Pre 37.6 71.2 52.8 1320 54.3 50.1 9135 11,499
B+Grade 37.3 74.2 52.7 1138 54.7 51.2 9627 11,060
B+Pre+Grade 38.8 76.3 54.3 972 55.9 52.4 6883 10,204
UAVDT
Baseline 57.8 72.0 64.0 1841 42.4 23.3 29,057 67,373
B+Pre 59.3 74.4 65.6 1398 43.8 23.8 25,836 62,429
B+Grade 60.4 74.7 66.1 1221 44.5 23.9 25,418 60,828
B+Pre+Grade 61.7 75.2 67.9 1216 45.3 24.6 24,915 59,640
The results indicate that removing ground truth Bbox annotations for occluded objects
can reduce errors in learning appearance embeddings, thereby improving the accuracy of
tracked object identification. By differentiating between low- and high-scoring detection
boxes, it is possible to effectively reduce trajectory fragmentation and IDSW, thus enhancing
the effectiveness and performance of object tracking. Additionally, by using preprocessing
and detection result classification, the MOTA, MOTP, and IDF1 on VisDrone improved by
2.6%, 5.4%, and 1.8%, respectively, while on UAVDT, they improved by 2.6%, 5.4%, and
1.8% respectively.
4.5.3. Impact of Matching Strategies
We validated the individual contributions of each component by combining different
association strategies, as shown in Table 6. The baseline uses IOU matching for all associa-
tions, and we gradually replace it with Appear-IOU, Gau-IOU, and OSPA-IOU on top of
the baseline. The results indicate that all three proposed association strategies effectively
enhance the accuracy of tracking associations. The baseline model shows significantly
higher FP and more IDSW, indicating a higher number of false detections introduced by
the model, resulting in poor trajectory matching quality and increased identity switching.
After replacing the high-score detection box matching strategy with Appear-IOU, MOTA
and IDF1 showed noticeable improvements. However, there was a slight increase in FP, and
the powerful detection capability significantly reduced FN. After replacing the low-score
detection box matching strategy with Gau-IOU, MOTA and MOTP improved significantly.
At the same time, IDSW decreased substantially, demonstrating the effectiveness of match-
ing smaller low-score detection boxes in Gaussian space. Substituting the OSPA-IOU
distance-based object-to-trajectory matching method, the high-score detection boxes are
considered to be a collection of individual trajectories for matching against trajectory col-
lections, improving all metrics. These results indicate that our various strategies contribute
to better overall tracking performance.
Table 6. Comparison of different association strategies.
Dataset Method MOTA↑ MOTP↑ IDF1 (%)↑ IDSW↓ MT (%)↑ ML (%)↑ FP↓ FN↓
VisDrone
Baseline 36.2 70.9 52.5 1552 53.1 49.3 9117 11,987
B+Appear-IOU 37.1 74.4 53.2 1334 53.2 49.2 9209 11,027
B+Appear-IOU+Gau-IOU 38.3 75.6 53.9 1052 53.8 49.9 7343 10,946
B+Appear-IOU+Gau-IOU+
OSPA-IOU
38.8 76.3 54.3 972 55.9 52.4 6883 10,204
UAVDT
Baseline 57.8 72.0 64.0 1841 42.4 23.3 29,057 67,373
B+Appear-IOU 58.2 73.8 64.9 1536 43.0 23.7 29,133 63,781
B+Appear-IOU+Gau-IOU 60.9 74.9 66.3 1297 45.0 24.0 25,011 60,369
B+Appear-IOU+Gau-IOU+
OSPA-IOU
61.7 75.2 67.9 1216 45.3 24.6 24,915 59,640
Drones 2024, 8, 349 23 of 27
4.5.4. Impact of VGM-PHD
We designed ablation experiments to validate the effectiveness of the VGM-PHD
method. We compared this method against no trajectory prediction and the use of a Kalman
filter. The results are presented in Table 7. The findings indicate that VGM-PHD exhibits
higher prediction accuracy and robustness compared to no trajectory prediction and the
Kalman filter across multiple scenarios. In complex environments particularly, the new
method successfully overcomes the limitations of traditional approaches, enhancing the
accuracy of predicting future positions of moving objects. Moreover, the decrease in IDSW
and the increase in IDF1 suggest improved stability in trajectory tracking. Consequently,
overall tracking performance is enhanced.
Table 7. Comparison with and without trajectory prediction.
Dataset Method MOTA↑ MOTP↑ IDF1 (%)↑ IDSW↓ MT (%)↑ ML (%)↑ FP↓ FN↓
VisDrone
No trajectory
prediction
29.9 64.4 49.3 2497 42.8 42.8 8719 15,226
Kalman Filter 35.3 69.9 50.6 1727 51.4 47.5 8998 12,302
VGM-PHD 38.8 76.3 54.3 972 55.9 52.4 6883 10,204
UAVDT
No trajectory
prediction
43.1 61.2 46.4 4437 32.3 17.4 49,018 99,620
Kalman Filter 52.8 68.0 56.1 3069 37.4 21.0 38,389 82,471
VGM-PHD 61.7 75.2 67.9 1216 45.3 24.6 24,915 59,640
5. Discussion
In this paper, we have integrated the strengths of joint detection and visual multi-
object tracking algorithms with transformer-based visual multi-object tracking algorithms
to address the unique challenges posed by drone aerial videos. Our proposed GAO-Tracker,
which models object motion information, has demonstrated significant improvements in
tracking performance, particularly in complex real-world scenarios.
5.1. Performance Analysis
GAO-Tracker’s performance on the VisDrone and UAVDT datasets has shown remark-
able results, surpassing existing state-of-the-art methods in terms of both accuracy and
robustness. The integration of the transformer model’s global context capturing capabilities
with the joint detection and tracking methods’ handling of occlusions and scale variations
has proven effective. The results indicate that our approach can maintain high tracking
accuracy even in challenging environments that are characterized by rapid object motion,
complex backgrounds, and varying object scales.
5.2. Strengths
(1) Enhanced accuracy: By leveraging the transformer model’s self-attention mech-
anism, GAO-Tracker effectively captures long-range dependencies and global contexts,
which are crucial for accurately tracking multiple objects in aerial videos.
(2) Robustness to occlusions and scale variations: The joint detection and tracking
methods integrated into GAO-Tracker enable it to handle occlusions and significant scale
variations efficiently, ensuring continuous and reliable tracking.
(3) Practical solutions: GAO-Tracker provides practical solutions to real-world multi-
object tracking problems, making it highly applicable in various domains such as surveil-
lance, rescue operations, and urban planning.
5.3. Limitations
(1) Computational complexity: Despite its accuracy and robustness, the computational
expense associated with the transformer model’s fine-grained self-attention mechanism
Drones 2024, 8, 349 24 of 27
remains a challenge. This could potentially limit the real-time applicability of GAO-Tracker
in resource-constrained environments.
(2) Scalability: While GAO-Tracker performs well on benchmark datasets, its scala-
bility to handle extremely large-scale datasets or highly crowded scenes requires further
exploration and optimization.
5.4. Future Directions
To further enhance the performance and applicability of GAO-Tracker, several future
research directions are proposed:
(1) Algorithm optimization: Efforts will focus on optimizing the algorithm to reduce
computational complexity and improve real-time performance. This includes exploring
more efficient implementations of the transformer model and refining the integration with
detection and tracking components.
(2) Broader application areas: Extending the research findings to benefit more diverse
fields is a key future direction. Improvements in drone-based multi-object tracking can be
adapted for use in autonomous driving, security systems, wildlife monitoring, and other
domains requiring accurate and reliable tracking of multiple objects.
(3) Handling complex scenarios: Further research is needed to enhance GAO-Tracker’s
performance in highly dynamic and crowded environments. This includes developing
methods to better handle dense object interactions and rapidly changing scenes.
(4) Long-term tracking: Enhancing the system’s ability to maintain long-term tracking
stability and accuracy, particularly in scenarios with frequent object disappearances and
reappearances, is another important area for future work.
6. Conclusions
This paper aims to integrate the strengths of joint detection and visual multi-object
tracking algorithms with transformer-based visual multi-object tracking algorithms to
improve the performance of multi-object tracking in drone aerial videos. Additionally,
we propose a more comprehensive, robust, and efficient integrated multi-object tracking
algorithm by modeling object motion information.
By leveraging the advanced capabilities of transformer models to capture global
contexts and the strengths of joint detection and tracking methods in handling occlusions
and scale variations, our approach addresses the unique challenges posed by drone aerial
videos, such as rapid object motion, complex environmental conditions, and data noise. This
integration allows for more accurate and reliable tracking of multiple objects, enhancing the
overall performance and robustness of tracking systems in various real-world scenarios.
A series of novel results have been achieved in the drone aerial multi-object tracking
field, with GAO-Tracker demonstrating excellent results on the VisDrone and UAVDT
datasets. These datasets, which are widely used benchmarks in the field, have shown that
our method significantly outperforms existing state-of-the-art methods in terms of both
accuracy and robustness. This indicates GAO-Tracker’s strong potential for practical appli-
cations in surveillance, rescue operations, agriculture, and urban planning, among others.
The practical solutions provided by GAO-Tracker to multi-object tracking problems
in real-world scenarios offer new ideas and methods for the development of drone visual
tracking. Our approach not only contributes to the current body of knowledge but also
paves the way for future research in this area. In the future, efforts will focus on improving
and optimizing algorithms to enhance multi-object tracking performance further. This
includes refining the integration of detection and tracking components, enhancing the effi-
ciency of the transformer model, and exploring new ways to handle challenging scenarios
such as crowded environments and dynamic backgrounds.
Additionally, endeavors will be made to extend research findings to broader appli-
cation areas to benefit more diverse fields. For instance, improvements in drone-based
multi-object tracking can be adapted for use in autonomous driving, security systems,
wildlife monitoring, and other areas where real-time, accurate tracking of multiple objects
Drones 2024, 8, 349 25 of 27
is critical. By expanding the applicability of our research, we aim to contribute to the
advancement of technology across various domains, ultimately enhancing the capabilities
and reliability of multi-object tracking systems.
Author Contributions: Conceptualization, Y.Y.; methodology, Y.Y., Y.W., and Y.L.; software, Y.P.
and Y.L.; formal analysis, Y.W.; investigation, Y.W.; resources, Y.P.; data curation, L.Z., Y.P., and
Y.L.; writing—original draft, Y.Y.; writing—review and editing, Y.W. and L.Z.; visualization, Y.L.;
supervision, L.Z.; project administration, Y.P. All authors have read and agreed to the published
version of the manuscript.
Funding: This work was supported by Funding for Outstanding Doctoral Dissertation in NUAA
under grant BCXJ24-10, the Postgraduate Research and Practice Innovation Program of Jiangsu
Province under grant KYCX24_0583, the National Natural Science Foundation of China under grant
61573183, and the Natural Science Foundation of Shaanxi Province of China under grant 2024JC-
YBQN-0695.
Data Availability Statement: The data presented in this study are available on request from the
corresponding author.
Conflicts of Interest: The authors declare no conflicts of interest.
Abbreviations
The following abbreviations are used in this manuscript:
MOT Multiple object tracking
GAO Gaussian, appearance, and optimal subpattern assignment
IOU Intersection over union
OSPA Optimal subpattern assignment
VGM-PHD Visual Gaussian mixture probability hypothesis density
MOTA Multiple object tracking accuracy
MOTP Multiple object tracking precision
References
1. Wu, X.; Li,W.; Hong, D.; Tao, R.; Du, Q. Deep learning for unmanned aerial vehicle-based object detection and tracking: A survey.
IEEE Geosci. Remote Sens. Mag. 2021, 10, 91–124.
2. Li, Y.; Zhang, H.; Yang, Y.; Liu, H.; Yuan, D. RISTrack: Learning Response Interference Suppression Correlation Filters for UAV
Tracking. IEEE Geosci. Remote Sens. Lett. 2023, 20, 1–5.
3. Dai, M.; Hu, J.; Zhuang, J.; Zheng, E. A transformer-based feature segmentation and region alignment method for UAV-view
geo-localization. IEEE Trans. Circuits Syst. Video Technol. 2021, 32, 4376–4389.
4. Yi, S.; Liu, X.; Li, J.; Chen, L. UAVformer: a composite transformer network for urban scene segmentation of UAV images. Pattern
Recogn. 2023, 133, 109019.
5. Yongqiang, X.; Zhongbo, L.; Jin, Q.; Zhang, K.; Zhang, B.; Feng, Q. Optimal video communication strategy for intelligent video
analysis in unmanned aerial vehicle applications. Chin. J. Aeronaut. 2020, 33, 2921–2929.
6. Bochinski, E.; Eiselein, V.; Sikora, T. High-speed tracking-by-detection without using image information. In Proceedings of
the 14th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), Lecce, Italy, 29 August–1
September 2017; pp. 1–6.
7. Chen, G.; Wang, W.; He, Zh.; Wang, L.; Yuan, Y.; Zhang, D.; Zhang, J.; Zhu, P.; Van G.; Han, J.; Hoi, S.; Hu, Q.; Liu, M. VisDrone-
MOT2021: The Vision Meets Drone Multiple Object Tracking Challenge Results. In Proceedings of the IEEE/CVF International
Conference on Computer Vision, Virtual, 11–17 October 2021; pp. 2839–2846.
8. Bisio, I.; Garibotto, C.; Haleem, H.; Lavagetto, F.; Sciarrone, A. Vehicular/Non-Vehicular Multi-Class Multi-Object Tracking in
Drone-based Aerial Scenes. IEEE Trans. Veh. Technol. 2023, 73, 4961–4977.
9. Lin, Y.; Wang, M.; Chen, W.; Gao, W.; Li, L.; Liu, Y. Multiple Object Tracking of Drone Videos by a Temporal-Association Network
with Separated-Tasks Structure. Remote Sens.2022, 14, 3862.
10. Al-Shakarji, N.; Bunyak, F.; Seetharaman, G.; Palaniappan, K. Multi-object tracking cascade with multi-step data association
and occlusion handling. In Proceedings of the 2018 15th IEEE International Conference on Advanced Video and Signal Based
Surveillance (AVSS), Auckland, New Zealand, 27–30 November 2018; pp. 1–6.
11. Yu, H.; Li, G.; Zhang, W.; Yao, H.; Huang, Q. Self-balance motion and appearance model for multi-object tracking in uav. In
Proceedings of the 1st ACM International Conference on Multimedia in Asia, Beijing, China, 15–18 December 2019; pp. 1–6.
12. Wang, Z.; Zheng, L.; Liu, Y.; Li, Y.; Wang, S. Towards real-time multi-object tracking. In Proceedings of the 16th European
Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 107–122.
Drones 2024, 8, 349 26 of 27
13. Wu, H.; Nie, J.; He, Z.; Zhu, Z.; Gao, M. One-shot multiple object tracking in UAV videos using task-specific fine-grained features.
Remote Sens. 2022, 14, 3853.
14. Shi, L.; Zhang, Q.; Pan, B.; Zhang, J.; Su, Y. Global-Local and Occlusion Awareness Network for Object Tracking in UAVs. IEEE J.
Sel. Top. Appl. Earth Obs. Remote Sens. 2023, 16, 8834–8844.
15. Zhou, X.; Koltun, V.; Krähenbühl, P. Tracking objects as points.In Proceedings of the 16th European Conference on Computer
Vision, Glasgow, UK, 23–28 August 2020; pp. 474–490.
16. Peng, J.; Wang, C.; Wan, F.; Wu, Y.; Wang, Y.; Tai, Y.; Wang, C.; Li, J.; Huang, F.; Fu, Y. Chained-tracker: Chaining paired attentive
regression results for end-to-end joint multiple-object detection and tracking. In Proceedings of the 16th European Conference on
Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 145–161.
17. Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In
Proceedings of the 16th European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 213–229.
18. Tsai, C.; Shen, G.; Nisar, H. Swin-JDE: joint detection and embedding multi-object tracking in crowded scenes based on swin-
transformer. Eng. Appl. Artif. Intel. 2023, 119, 105770.
19. Hu, M.; Zhu, X.; Wang, H.; Cao, S.; Liu, C.; Song, Q. STDFormer: Spatial-Temporal Motion Transformer for Multiple Object
Tracking. IEEE Trans. Circuits Syst. Video Technol. 2023, 33, 6571–6594.
20. Zeng, F.; Dong, B.; Zhang, Y.; Wang, T.; Zhang, X.; Wei, Y. Motr: End-to-end multiple-object tracking with transformer. In
Proceedings of the 17th European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; pp. 659–675.
21. Cai, J.; Xu, M.; Li, W.; Xiong, Y.; Xia, W.; Tu, Z.; Soatto, S. Memot: Multi-object tracking with memory. In Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 8090–8100.
22. Zhu, P.; Wen, L.; Du, D.; Bian, X.; Hu, Q.; Ling, H. Vision meets drones: Past, present and future. arXiv 2020, arXiv:2001.06303.
23. Du, D.; Qi, Y.; Yu, H.; Yang, Y.; Duan, K.; Li, G.; Zhang, W.; Huang, Q.; Tian, Q. The unmanned aerial vehicle benchmark:
Object detection and tracking. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14
September 2018; pp. 370–386.
24. Zhu, X.; Su, W.; Lu, L.; Li, B.; Wang, X.; Dai, J. Deformable detr: Deformable transformers for end-to-end object detection. arXiv
2020, arXiv:2010.04159.
25. Fang, Y.; Liao, B.; Wang, X.; Fang, J.; Qi, J.; Wu, R.; Niu, J.; Liu, W. You Only Look at One Sequence: Rethinking Transformer in
Vision through Object Detection. Adv. Neural Inf. Process. Syst. 2021, 34, 26183–26197.
26. Li, Y.; Mao, H.; Girshick, R.; He, K. Exploring plain vision transformer backbones for object detection. In Proceedings of the 17th
European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; pp. 280–296.
27. Ye, T.; Qin, W.; Zhao, Z.; Gao, X.; Deng, X.; Ouyang, Y. Real-Time Object Detection Network in UAV-Vision Based on CNN and
Transformer.IEEE Trans. Instrum. Meas. 2023, 72, 2505713.
28. Li, F.; Zhang, H.; Liu, S.; Guo, J.; Ni, L.; Zhang, L. Dn-detr: Accelerate detr training by introducing query denoising. In
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022;
pp. 13619–13627.
29. Wojke, N.; Bewley, A.; Paulus, D. Simple online and realtime tracking with a deep association metric. In Proceedings of the 2017
IEEE International Conference on Image Processing (ICIP),Beijing, China, 17–20 September 2017; pp. 3645–3649.
30. Zhang, Y.; Sun, P.; Jiang, Y.; Yu, D.; Weng, F.; Yuan, Z.; Luo, P.; Liu, W.; Wang, X. Bytetrack: Multi-object tracking by associating
every detection box. In Proceedings of the 17th European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022;
pp. 1–21.
31. Aharon, N.; Orfaig, R.; Bobrovsky, B. BoT-SORT: Robust associations multi-pedestrian tracking. arXiv 2022, arXiv:2206.14651.
32. Liu, S.; Li, X.; Lu, H.; He, Y. Multi-Object Tracking Meets Moving UAV. In Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 8876–8885.
33. Deng, K.; Zhang, C.; Chen, Z.; Hu, W.; Li, B.; Lu, F. Jointing Recurrent Across-Channel and Spatial Attention for Multi-Object
Tracking With Block-Erasing Data Augmentation. IEEE Trans. Circuits Syst. Video Technol. 2023, 33, 4054–4069.
34. Xiao, C.; Cao, Q.; Zhong, Y.; Lan, L.; Zhang, X.; Cai, H.; Luo, Z. Enhancing Online UAV Multi-Object Tracking with Temporal
Context and Spatial Topological Relationships. Drones 2023, 7, 389.
35. Keawboontan, T.; Thammawichai, M. Toward Real-Time UAV Multi-Target Tracking Using Joint Detection and Tracking. IEEE
Access 2023, 11, 65238–65254.
36. Li, J.; Ding, Y.; Wei, H.; Zhang, Y.; Lin, W. Simpletrack: Rethinking and improving the jde approach for multi-object tracking.
Sensors 2022, 22, 5863.
37. Sun, P.; Cao, J.; Jiang, Y.; Zhang, R.; Xie, E.; Yuan, Z.; Wang, C.; Luo, P. Transtrack: Multiple object tracking with transformer.
arXiv 2020,arXiv:2012.15460.
38. Meinhardt, T.; Kirillov, A.; Leal-Taixe, L.; Feichtenhofer, C. Trackformer: Multi-object tracking with transformers. In Proceedings of
the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 8844–8854.
39. Xu, Y.; Ban, Y.; Delorme, G.; Gan, C.; Rus, D.; Alameda-Pineda, X. TransCenter: Transformers with dense representations for
multiple-object tracking. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 7820–7835.
40. Zhou, X.; Yin, T.; Koltun, V.; Krähenbühl, P. Global Tracking Transformers. In Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 8771–8780.
41. Chen, M.; Liao, Y.; Liu, S.; Wang, F.; Hwang, J. TR-MOT: Multi-Object Tracking by Reference. arXiv 2022, arXiv:2203.16621.
Drones 2024, 8, 349 27 of 27
42. Wu, H.; He, Z.; Gao, M. GCEVT: Learning Global Context Embedding for Vehicle Tracking in Unmanned Aerial Vehicle Videos.
IEEE Geosci. Remote Sens. Lett. 2023, 20, 1–5.
43. Xu, X.; Feng, Z.; Cao, C.; Yu, C.; Li, M.; Wu, Z.; Ye, S.; Shang, Y. STN-Track: Multiobject Tracking of Unmanned Aerial Vehicles by
Swin Transformer Neck and New Data Association Method. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2022, 15, 8734–8743.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.

More Related Content

PDF
Survey Multiple Object Tracking Survey Paper
PDF
Real-time object detection and video monitoring in Drone System
PDF
A Literature Review on Vehicle Detection and Tracking in Aerial Image Sequenc...
PDF
RuiLi_CVVT2016
PDF
Person Detection in Maritime Search And Rescue Operations
PDF
Person Detection in Maritime Search And Rescue Operations
PPTX
sensor fusion presentation iit kanpur ashish
PDF
MULTIPLE OBJECTS AND ROAD DETECTION IN UNMANNED AERIAL VEHICLE
Survey Multiple Object Tracking Survey Paper
Real-time object detection and video monitoring in Drone System
A Literature Review on Vehicle Detection and Tracking in Aerial Image Sequenc...
RuiLi_CVVT2016
Person Detection in Maritime Search And Rescue Operations
Person Detection in Maritime Search And Rescue Operations
sensor fusion presentation iit kanpur ashish
MULTIPLE OBJECTS AND ROAD DETECTION IN UNMANNED AERIAL VEHICLE

Similar to Multiple Object Tracking in Drone Aerial Videos by a Holistic Transformer and Multiple Feature Trajectory Matching Pattern (20)

PDF
IRJET- Object Detection in Real Time using AI and Deep Learning
PDF
Object Detection in UAVs
PDF
Vot presentation
PPTX
[20240603_LabSeminar_Huy]TransMOT: Spatial-Temporal Graph Transformer for Mul...
PPTX
Video Multi-Object Tracking using Deep Learning
PDF
Object Detection and Tracking AI Robot
PDF
ReconTraj4Drones: A Framework for the Reconstruction and Semantic Modeling of...
PDF
IRJET- A Survey on Object Detection using Deep Learning Techniques
PDF
3d object detection and recognition : a review
PDF
[DL輪読会]Tracking Objects as Points
PDF
Object tracking final
PDF
Object tracking presentation
PDF
O180305103105
PDF
ObjectDetectionUsingMachineLearningandNeuralNetworks.pdf
PDF
Crowd Counting from UAVs (ECCV2020)
PDF
Deep sort and sort paper introduce presentation
PDF
IRJET- Comparative Analysis of Video Processing Object Detection
PDF
IRJET- Design the Surveillance Algorithm and Motion Detection of Objects for ...
PDF
Design of an effective multiple objects tracking framework for dynamic video ...
PPTX
Multiple Object Tracking
IRJET- Object Detection in Real Time using AI and Deep Learning
Object Detection in UAVs
Vot presentation
[20240603_LabSeminar_Huy]TransMOT: Spatial-Temporal Graph Transformer for Mul...
Video Multi-Object Tracking using Deep Learning
Object Detection and Tracking AI Robot
ReconTraj4Drones: A Framework for the Reconstruction and Semantic Modeling of...
IRJET- A Survey on Object Detection using Deep Learning Techniques
3d object detection and recognition : a review
[DL輪読会]Tracking Objects as Points
Object tracking final
Object tracking presentation
O180305103105
ObjectDetectionUsingMachineLearningandNeuralNetworks.pdf
Crowd Counting from UAVs (ECCV2020)
Deep sort and sort paper introduce presentation
IRJET- Comparative Analysis of Video Processing Object Detection
IRJET- Design the Surveillance Algorithm and Motion Detection of Objects for ...
Design of an effective multiple objects tracking framework for dynamic video ...
Multiple Object Tracking
Ad

Recently uploaded (20)

PDF
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
DOCX
573137875-Attendance-Management-System-original
PPTX
Construction Project Organization Group 2.pptx
PPTX
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
PDF
composite construction of structures.pdf
PPTX
Sustainable Sites - Green Building Construction
PPTX
additive manufacturing of ss316l using mig welding
PDF
Model Code of Practice - Construction Work - 21102022 .pdf
PDF
TFEC-4-2020-Design-Guide-for-Timber-Roof-Trusses.pdf
PPTX
CYBER-CRIMES AND SECURITY A guide to understanding
PPTX
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
PPTX
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
PDF
Digital Logic Computer Design lecture notes
PPT
CRASH COURSE IN ALTERNATIVE PLUMBING CLASS
PDF
Well-logging-methods_new................
PDF
Operating System & Kernel Study Guide-1 - converted.pdf
PDF
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
PPTX
Internet of Things (IOT) - A guide to understanding
PPTX
UNIT 4 Total Quality Management .pptx
PDF
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
573137875-Attendance-Management-System-original
Construction Project Organization Group 2.pptx
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
composite construction of structures.pdf
Sustainable Sites - Green Building Construction
additive manufacturing of ss316l using mig welding
Model Code of Practice - Construction Work - 21102022 .pdf
TFEC-4-2020-Design-Guide-for-Timber-Roof-Trusses.pdf
CYBER-CRIMES AND SECURITY A guide to understanding
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
Digital Logic Computer Design lecture notes
CRASH COURSE IN ALTERNATIVE PLUMBING CLASS
Well-logging-methods_new................
Operating System & Kernel Study Guide-1 - converted.pdf
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
Internet of Things (IOT) - A guide to understanding
UNIT 4 Total Quality Management .pptx
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
Ad

Multiple Object Tracking in Drone Aerial Videos by a Holistic Transformer and Multiple Feature Trajectory Matching Pattern

  • 1. Citation: Yuan, Y.; Wu, Y.; Zhao, L.; Pang, Y.; Liu, Y. Multiple Object Tracking in Drone Aerial Videos by a Holistic Transformer and Multiple Feature Trajectory Matching Pattern. Drones 2024, 8, 349. https://doi.org/ 10.3390/drones8080349 Academic Editor: Xiwang Dong Received: 23 June 2024 Revised: 22 July 2024 Accepted: 26 July 2024 Published: 28 July 2024 Copyright: © 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/). drones Article Multiple Object Tracking in Drone Aerial Videos by a Holistic Transformer and Multiple Feature Trajectory Matching Pattern Yubin Yuan , Yiquan Wu *, Langyue Zhao, Yaxuan Pang and Yuqi Liu College of Electronic and Information Engineering, Nanjing University of Aeronautics and Astronautics, Nanjing 211106, China; harley_yuan@nuaa.edu.cn (Y.Y.); zlangyue@nuaa.edu.cn (L.Z.); hins_pang@nuaa.edu.cn(Y.P.); tolyuqi@nuaa.edu.cn (Y.L.) * Correspondence: imagestrong@nuaa.edu.cn; Tel.: +86-137-7666-7415 Abstract: Drone aerial videos have immense potential in surveillance, rescue, agriculture, and urban planning. However, accurately tracking multiple objects in drone aerial videos faces challenges like occlusion, scale variations, and rapid motion. Current joint detection and tracking methods often compromise accuracy. We propose a drone multiple object tracking algorithm based on a holistic transformer and multiple feature trajectory matching pattern to overcome these challenges. The holistic transformer captures local and global interaction information, providing precise detection and appearance features for tracking. The tracker includes three components: preprocessing, trajectory prediction, and matching. Preprocessing categorizes detection boxes based on scores, with each category adopting specific matching rules. Trajectory prediction employs the visual Gaussian mixture probability hypothesis density method to integrate visual detection results to forecast object motion accurately. The multiple feature pattern introduces Gaussian, Appearance, and Optimal subpattern assignment distances for different detection box types (GAO trajectory matching pattern) in the data association process, enhancing tracking robustness. We perform comparative validations on the vision-meets-drone (VisDrone) and the unmanned aerial vehicle benchmarks; the object detection and tracking (UAVDT) datasets affirm the algorithm’s effectiveness: it obtained 38.8% and 61.7% MOTA, respectively. Its potential for seamless integration into practical engineering applications offers enhanced situational awareness and operational efficiency in drone-based missions. Keywords: multiple object tracking; transformer; detection confidence; multiple feature matching 1. Introduction In recent years, with the rapid development of drone technology, drone aerial videos have become an effective means of acquiring high-resolution, wide-coverage areas and hold significant potential in various applications such as surveillance, rescue operations, agriculture, and urban planning [1]. Drone aerial videos capture a wide range of object categories, including human activities, vehicles, buildings and infrastructure, and natural environments, among others, providing rich data that endow drones with the capability to monitor and track various objects in different application scenarios. In this context, multiple object tracking (MOT) has become particularly important for processing drone aerial videos, allowing systems to track and monitor multiple objects, thus enabling a more comprehensive range of applications such as object tracking, behavior analysis, and environmental monitoring. However, multi-object tracking in drone aerial videos faces numerous challenges, including object occlusion, object variations at different scales, rapid object motion, complex environmental conditions, and data noise. Traditional multi-object tracking methods have limitations in addressing these issues, thus requiring more advanced techniques to enhance tracking performance [2]. The majority of multi-object tracking methods for drone aerial videos are based on detection. These methods initially identify objects in each frame using object detection Drones 2024, 8, 349. https://doi.org/10.3390/drones8080349 https://www.mdpi.com/journal/drones
  • 2. Drones 2024, 8, 349 2 of 27 algorithms, then employ data association, motion estimation, and filter updating to resolve occlusion and scale variations. Long-term tracking may involve object re-identification to handle object loss. Maintaining object trajectory information and conducting analyses improves the system’s robustness in complex environments. Additionally, to further improve efficiency, some researchers synchronize detection and tracking, integrating both technologies to address the challenges posed by the wide variety and complex appearances of objects in drone aerial videos. Transformer models, known for their self-attention mechanism and parallel comput- ing capabilities, have revolutionized natural language processing and computer vision [3]. Their versatility extends from vision transformers to full transformer models, and they enable breakthroughs in tasks like image classification, object detection, and semantic segmentation; they even branch into action recognition, object tracking, and scene flow esti- mation. In drone aerial video analyses, transformers offer fresh perspectives for multi-object detection and tracking. Unlike convolutional neural networks, transformers emphasize global context interactions alongside local contexts, enhancing understanding of spatial relationships. However, the computational expense of fine-grained self-attention in high- resolution images poses challenges. Recent studies explore solutions like coarse-grained global or fine-grained local self-attention to alleviate the computational burden, albeit at the cost of simultaneously modeling short- and long-distance visual dependencies [4]. Given these challenges and the transformative potential of transformer models, we are motivated to explore and develop advanced multi-object tracking methods that leverage the strengths of transformers. Therefore, we propose a multi-object tracking method named GAO-Tracker, which is based on a holistic transformer and multiple feature trajectory matching pattern, to address various challenges in drone aerial videos. Our goal is to overcome the limitations of traditional approaches and enhance the performance and robustness of MOT in drone aerial videos, enabling more accurate and reliable tracking in diverse and complex environments. The remaining sections of this paper are organized as follows: Section 2, Related Work, reviews and discusses the latest advancements in the field of multiple object track- ing (MOT). We analyze current mainstream and cutting-edge technologies, including object-feature-based methods, joint detection and tracking methods, and transformer-based methods, providing a solid theoretical foundation and practical background for this re- search. Section 3, Methodology, details our proposed GAO-Tracker method for multi-object tracking. We delve into the core concepts, including the use of a holistic transformer and multiple feature trajectory matching pattern to address various challenges in drone aerial videos. We describe the model structure, algorithm workflow, and implementation details. Section 4, Experiments, presents extensive experiments and performance evalu- ations of GAO-Tracker. We test the method on several public datasets and compare it with state-of-the-art methods. The results demonstrate GAO-Tracker’s superior perfor- mance and robustness in complex scenarios. Section 5, Discussion, provides an in-depth analysis of the experimental results. We discuss GAO-Tracker’s performance in different scenarios, analyze its strengths and limitations, and suggest potential improvements and future research directions. Section 6, Conclusion, summarizes the main contributions and findings of this paper. We reiterate GAO-Tracker’s innovations in enhancing multi-object tracking performance in drone aerial videos and discuss its prospects and potential for practical applications. 2. Related Work This section aims to comprehensively review and discuss the latest research advance- ments in the field of multiple object tracking. By deeply analyzing current mainstream and cutting-edge technologies, we establish a solid theoretical foundation and practical back- ground for this study. First, we focus on the basic framework and challenges of multiple object tracking. Then, we detail several core methods: object-feature-based multi-object tracking methods, which achieve continuous tracking by extracting and utilizing the ap-
  • 3. Drones 2024, 8, 349 3 of 27 pearance, motion, and other feature information of objects; joint detection and tracking multi-object methods, which tightly integrate object detection and tracking tasks to en- hance the overall performance and efficiency of the system; and finally, transformer-based multi-object tracking methods, given the transformer model’s outstanding performance in sequence data processing. We explore how these methods utilize attention mechanisms to achieve precise and robust object tracking in complex scenarios. Through this review and analysis, we not only present the latest achievements in the MOT field but also highlight the current research gaps and shortcomings, leading to the research motivation and main contributions of this paper. Our goal is to provide new insights and solutions for the development of multi-object tracking technology. 2.1. Multiple Object Tracking Multi-object tracking is a highly regarded technology, and its wide range of applica- tions has attracted widespread interest among scholars. In the early stages of research, researchers primarily focused on applying optimization algorithms to derive object trajec- tories [5]. The IOUTracker, which relies solely on the bounding box intersection over union (IOU), was the simplest early multi-object tracking method [6]. Researchers gradually introduced motion models and Kalman filters to predict the positions of objects in the next frame [7]. Although these improvements made multi-object tracking algorithms faster and significantly improved their performance, the algorithms performed poorly in complex occlusion and object loss situations. To address these challenges, researchers introduced re-identification (ReID) features as appearance models, using visual features of objects be- tween different frames to match objects and improve the accuracy of associations between trajectories and detection results [8]. In addition to ReID, some studies have utilized image segmentation techniques to identify and track objects, thereby better handling occlusion sit- uations [9]. Furthermore, some researchers have begun to use recurrent neural networks or attention mechanisms to model the spatiotemporal relationships between objects, thereby improving tracking accuracy and stability. However, these methods often employ a single matching approach, neglecting the different characteristics of different types of objects. Moreover, introducing these different technological approaches into tracking systems can result in suboptimal tracking results, limiting effectiveness. 2.2. Object-Feature-Based Multi-Object Tracking Methods Benefiting from the rapid development of object detectors, object feature modeling has become widely used in multi-object tracking algorithms from the perspective of drones. It achieves multi-object tracking by capturing unique features of objects such as color, texture, and optical flow. These extracted features must be distinctive in order to discriminate different objects in the feature space effectively. Once these features are extracted, similarity criteria can be utilized to find the most similar objects in the next frame, thus enabling multi- object tracking. SCTrack adopts a three-stage data association method that combines object appearance models, spatial distances, and explicit occlusion handling units. The system relies on the motion patterns of tracked objects and considers environmental constraints, thus exhibiting good performance in handling occluded objects [10]. To address the issue of the subjective setting of fusion ratios between appearance and motion, which often merge appearance similarity and motion consistency in the latest frame, the appearance similarity between objects and surrounding objects is computed, object motion is predicted using Social LSTM networks, and weighted appearance similarity and motion predictions are used to generate associations between the current object and the object in the previous frame [11]. However, due to the significant increase in computational costs, false detections, drone aerial backgrounds, and other issues associated with handling large numbers of object detections and association computations, these methods need to overcome various challenges in maintaining accuracy while mitigating computational costs, false detections, object associations, and so on.
  • 4. Drones 2024, 8, 349 4 of 27 2.3. Joint Detection and Tracking Multi-Object Methods To enhance the computational speed of the entire drone aerial multi-object tracking system, researchers have actively explored methods that combine object detection and feature extraction to achieve greater sharing in computation. JDE was the first attempt at this approach and innovatively integrated the feature extraction branch into the single-stage detector YOLOv3 [12]. Conversely, Fairmont balanced the handling between detection and recognition tasks by adopting the anchor-free detector CenterNet to reduce anchor ambiguities [13]. In addition to these joint detection and feature embedding methods, several other single-stage trackers have emerged. GLOA designed global–local perception blocks to extract scale variance feature information from input frames. Adding identity embedding branches to the prediction heads outputs more discriminative identity informa- tion [14]. CenterTrack [15] and Chained Tracker [16], on the other hand, use multi-frame methods to predict bounding boxes in consecutive frames, facilitating efficient short-term associations that eventually form long-term object trajectories. However, it is essential to note that these technologies often generate many identity switches due to the difficulty of capturing long-term dependencies. Additionally, these methods cannot simultaneously consider multiple features of objects and differences in features among different categories, resulting in the easy loss of tracking for some small objects. 2.4. Transformer-Based Multi-Object Tracking Methods In recent years, transformer-based models have achieved significant success in the field of computer vision, primarily excelling in the domain of object detection. This has given rise to several transformer-based methods making strides in drone multi-object tracking. Some methods based on DETR [17] and its derivative models, such as TransTrack [18], TrackFormer [19], and MOTR [20], represent the front of online tracking and training progress in the field of MOT. Swin-JDE leverages transformers and comprehensively considers three factors—detection confidence, appearance embedding distance, and IoU distance—to match each trajectory and the detection information. Furthermore, MOTR achieves end-to-end object tracking by iteratively updating tracking queries, eliminating the need for complex post-processing steps. MeMOT [21], similar to MOTR, utilizes attention mechanisms to predict by focusing on object states. Despite pioneering new tracking paradigms, these methods still fall short of advanced tracking algorithms. While standard self-attention can capture fine-grained short- and long-distance interactions, executing attention on high-resolution feature maps incurs high computational costs, leading to explosive growth in time and memory costs. This paper addresses this issue through a holistic self-attention module. Therefore, we proposes a multi-object tracking method named GAO-Tracker based on a holistic transformer and multiple feature trajectory matching pattern to address various challenges in drone aerial videos. The effectiveness of the proposed method is validated through a series of experiments and quantitative analyses, and we compare it with excellent methods of the same kind and provide new insights and methods for multi-object tracking in drone applications. The main contributions are as follows: (1) A framework named GAO-Tracker, which integrates object detection and tracking in a joint detection and tracking framework for drone aerial videos, is proposed. The framework employs a holistic transformer as the core model for object detection and includes a GAO trajectory matching algorithm based on object features in drone aerial videos to achieve efficient and precise multi-object tracking. (2) The holistic transformer, which combines fine-grained local interactions and coarse- grained global interactions, is proposed. The framework includes an object detector holistic trans-detector using a joint anchor-free detection head to achieve accurate object detection in drone aerial videos. (3) A multi-object trajectory prediction and matching module named the GAO-trajectory matching pattern is proposed; it comprehensively considers the appearance features, mo- tion characteristics, and size features of objects and trajectories. It includes three matching
  • 5. Drones 2024, 8, 349 5 of 27 modes: Gaussian-IOU, Appear-IOU, and OSPA-IOU, fully exploiting various object and trajectory information to achieve robust tracking of multiple objects in drone aerial videos. (4) Using the prior information of the object’s position from the previous frame and combining it with object visual features, a visual Gaussian mixture probability hypothesis density (VGM-PHD) trajectory predictor tailored to the features of drone aerial videos is designed to provide accurate trajectory information for trajectory matching. 3. Methodology The proposed multi-object tracking system for drone aerial videos consists of the holis- tic trans-detector module and the GAO-trajectory matching pattern trajectory association module. The holistic trans-detector model is an anchor-free object detector and feature extraction module that integrates holistic self-attention, combining fine-grained local and coarse-grained global interactions. In this new mechanism, each token finely attends to its nearest surrounding tokens and coarsely attends to its distant surrounding tokens, effectively capturing short-term and long-term visual dependencies. The GAO-trajectory matching pattern trajectory association module handles the data association process by si- multaneously considering detection confidence, appearance embedding distance, and IOU distance, thereby enhancing the tracking robustness of the MOT model. The framework is illustrated in Figure 1. Figure 1. GAO-Tracker framework. 3.1. Holistic Trans-Detector: Object Detection and Feature Extraction In order to adapt to high-resolution visual tasks, high-resolution feature maps can be obtained in the early stages. The entire model adopts a hierarchical design consisting of four stages, each reducing the resolution of the input feature map and expanding the recep- tive field layer by layer, like a CNN. The framework is shown in Figure 2. At the beginning of the input, patch embedding is done, which cuts the image into individual blocks and embeds them into the embedding. Each stage is composed of multiple holistic transformer layers. The specific structure of the holistic transformer layer is shown in Figure 3; it is mainly composed of LayerNorm, MLP (multi-layer perceptron), and holistic attention.
  • 6. Drones 2024, 8, 349 6 of 27 Figure 2. Holistic trans-detector. Figure 3. Holistic transformer. An image with a resolution of H × W × 3 is first divided into blocks of size 4 × 4, resulting in H 4 × W 4 × (4 × 4 × 3) patches. Then, these patches are projected into features of dimension d using a convolutional layer for which the kernel size and stride are both equal to 4. Given this spatial feature map, it is passed through four stages of concatenated holistic transformer layers. In each stage, the holistic transformer block consists of 2, 2, 18, and 2 holistic transformer layers, respectively. The selected configuration aims to capture complex features at different levels of abstraction gradually. In the initial stage, there are two layers, each aimed at capturing low-level features. In the middle stage, 18 layers focus on learning high-level and complex features. In the final stage, two layers refine these features to achieve precise tracking. After each stage, a patch embedding layer is added to reduce the spatial dimensions of the feature map by half while doubling the feature dimension. Finally, the feature maps from all four stages are sent to the detection head, which simultaneously outputs appearance feature vectors of the objects for multi-object trajectory matching. Traditional transformer models face high computational and memory costs with large- scale input data due to the global self-attention mechanism, which considers all tokens in the input sequence. A holistic transformer addresses this by partitioning the input feature map into sub-windows and conducting attention operations on each sub-window, reducing computation and memory usage. For a feature map of size M × N for x ∈ RM×N×d, we first divide it into partitions of size 4 × 4, with each partition serving as a feature perception core in order to perform attention perception within a localized context. Then, we locate the surrounding context for each window instead of individual tokens. Sub-window pooling is a core component of a holistic transformer and divides the input feature map into smaller sub-windows, thereby reducing the number of tokens each attention operation needs to focus on. This segmentation and pooling transforms global attention operations into local operations, making the model more scalable and efficient. The process is illustrated in Figure 4.
  • 7. Drones 2024, 8, 349 7 of 27 Figure 4. Holistic self-attention. We initially partition the feature map into 4 × 4 grids. While the central 4 × 4 grid serves as the query window, we extract tokens at three granularity levels of 1 × 1, 2 × 2, and 4 × 4, respectively, from surrounding regions to serve as its keys and values. This results in tokens with dimensions of 8 × 8, 6 × 6, and 5 × 5. Ultimately, these tokens from the three levels are concatenated to compute the keys and values for the 4 × 4 = 16 tokens (queries) within the window. Suppose the input feature map is denoted as x ∈ RM×N×d, where M × N represents the spatial dimensions, and d represents the feature dimensions. Sub-window pooling is performed in parallel on the feature map at three levels l ∈ {1, 2, 4}, dividing the input feature map x into grids of size l × l for spatial sub-window pooling, followed by a simple linear layer f l p to perform spatial sub-window pooling, as shown in Equation (1). xl = f l p(x̂) ∈ R M l × N l ×d (1) where x̂ = Restructure(x) ∈ R( M l × N l ×d)×(l×l) . The pooled feature maps at different levels l provide rich fine-grained and coarse-grained information. 3.1.1. Attention Computation After obtaining the pooled feature maps at all levels, three linear projection layers fq, fk, and fv are used to compute the query for the first layer and the key and value for all layers, as shown in Equations (2)–(4). Q = fq xl (2) Kl = fk xl (3)
  • 8. Drones 2024, 8, 349 8 of 27 Vl = fv xl (4) To perform holistic self-attention, extracting surrounding tokens for each query token in the feature map is necessary. For the queries within the i-th window Qi ∈ Rsp×sp×d , keys Ki ∈ Rs×d and values Vi ∈ Rs×d are extracted from the surrounding Kl and Vl of the window, where l represents the size of the keys and values, and s is the sum of all holistic regions from all levels, i.e., s = 8 × 8 + 6 × 6 + 5 × 5. Finally, the holistic self-attention for Qi is computed as shown in Equation (5). Attention(Qi, Ki, Vi) = So f tmax QiKT i √ d + B ! Vi (5) where B = {Bl} is a learnable relative position bias. For the first layer, it is parameterized as Bl ∈ R7×7, while for other holistic levels, considering their different granularities towards queries, all queries within the window are treated equally. Bl ∈ Rsl r×sl r is then used to represent the relative position deviation tokens between the query window and each pooled sl r × sl r. The relative position deviation takes into account the positional relationships between different sub-windows. This allows the model to understand the dependencies between different positions better, thus enabling more accurate attention computation. The intro- duction of relative position deviation enhances the flexibility and expressive power of the model, enabling it to adapt better to different types of input data. Since the attention operations for each sub-window are independent, modern hard- ware and parallel computing frameworks can be leveraged to accelerate the model’s training and inference processes. 3.1.2. Detection Head We designed an anchor-free prediction head based on the CenterNet architecture and divided it into detection and appearance branches. Through holistic transformer feature extraction, the output feature map is provided to both branches for object detection and appearance embedding. The detection branch consists of three heads, which are used to predict the heatmap, the offset of the object’s center point, and the object’s size, respectively. The heatmap head is utilized to predict the center position of the object, with an output dimension of h × w × Cls, where h and w represent the height and width of the input feature map, and Cls is the number of detection classes. Each class has its own heatmap output, with each Gaussian peak in the heatmap representing the center position of the detected object. Assuming there are N objects in the current training sample, let ci x, ci y represent the center position of the i-th object in i ∈ [1, N]. Then, the heatmap corresponding to the current training sample is calculated as shown in Equation (6). Mxy = N ∑ i=1 exp  − (x − ⌊ci x 4 ⌋)2 + (y − ⌊ ci y 4 ⌋)2 2σ2 c   f (6) Here, the operator ⌊a⌋ returns the nearest and smallest integer to a, while σc is the standard deviation parameter. M ∈ Rh×w×Cls represents the output of the heatmap head, and Mxy serves as the value of M at position (x, y). The box size and center offset heads are used to predict the BBox and the offset of the object’s center point, respectively. Let BBoxi = xi lt, yi lt, xi rb, yi rb represent the BBox of the i-th object, where xi lt, yi lt and xi rb, yi rb represent the top-left and bottom-right coordinates of the object, respectively. Simultaneously, the offset of the center point of the i-th object is defined as shown in Equation (7).
  • 9. Drones 2024, 8, 349 9 of 27 oi xy ≜ δi x, δi y = ci x 4 − ⌊ ci x 4 ⌋, ci y 4 − ⌊ ci y 4 ⌋ ! (7) This helps improve the accuracy of predicting the center position of the object. The term ô ∈ Rh×w×2 represents the output of the center offset head, and ôi xy represents the offset prediction of the i-th object at position (x, y) on ô. The appearance branch is responsible for generating embedding features that assist in identifying the object. Each head consists of a 3 × 3 convolutional layer with 256 channels, followed by a 1 × 1 convolutional layer to produce the final output. The embedding heads of the appearance branch calculate the appearance feature vectors of the object, which are used in the association matching operation for multi-object tracking tasks. Specifically, these appearance feature vectors can be used for association matching to calculate the similarity between the tracker and the detected object. A 128-dimensional vector at position (x, y) represents the appearance feature vector of the object at that location. 3.2. GAO Trajectory Matching Pattern Our GAO trajectory matching pattern considers detection confidence, appearance embedding distance, and IoU distance to associate all tracking trajectories with all detection Bboxes. Figure 5 illustrates the architecture of the module. When receiving detection results from the detector output, we add detection Bboxes with confidence scores higher than 0.5 to the high-score detection Bbox set, and those between 0.2 and 0.5 are added to the low-score detection Bbox set. Figure 5. GAO trajectory matching module. Initially, predicted trajectories are matched with high-score detection boxes using the Appear-IOU matching method. Unmatched trajectories then undergo secondary matching with low-score detection boxes via Gau-IOU matching, with any remaining unmatched low-score boxes removed. Subsequently, high-score detection boxes that were not initially matched are re-evaluated using (optimal subpattern assignment) OSPA-IOU matching with previously unmatched trajectories from the previous frame. High-score boxes unmatched after both attempts are considered new trajectories, while trajectories that have been continuously unmatched for 30 frames are removed from tracking, with flexibility to adjust based on the video frame rate. Successful matches update tracking through the update process with matched de- tection frames. Trajectory prediction involves modeling visual objects’ trajectories as a random finite set, utilizing the visual Gaussian mixture probability hypothesis density
  • 10. Drones 2024, 8, 349 10 of 27 to generate prediction information for the tracker, which primes the model for the next frame’s association matching. The data association process employs four distance metrics, leading to the design of three matching methods: Gau-IOU, Appear-IOU, and OSPA-IOU distance matching. 3.2.1. Appear-IOU Distance Matching Appear-IOU trajectory matching considers the appearance and spatial location features between the object and predicted trajectories while calculating the cosine distance and IOU distance between all predicted trajectories and high-scored detection appearances as metrics. The appearance vector of the object contains extensive appearance information, which is combined with the IOU distance of the BBox to enhance the matching accuracy between detection boxes and trajectories. The process as shown in Figure 6. Figure 6. Appear-IOU trajectory matching. Let BBoxi d, Ei d represent the i-th detected object detection BBox and its corresponding feature vector in the current frame, and BBoxi t, Ei t represents the j-th trajectory-predicted object BBox and its corresponding feature vector from the previous frame. The first distance metric DI ij is computed based on the IOU distance: DI ij = 1 − area(BBoxi d ∩ BBox j t) area(BBoxi d ∪ BBox j t) (8) where area(A) represents the area of the input set A, and the symbols ∩ and ∪ represent the intersection and union operator of two sets. The appearance distance metric DA ij is calculated based on the cosine distance between two embedding feature vectors: DA ij = 1 − Ai d · A j t ∥Ai d∥∥A j t∥ (9) where · denotes the dot product between two vectors, and ∥A∥ denotes the 2-norm value of the vector. Subsequently, the IOU distances and appearance distances between all detections and trajectories are combined in a weighted manner to obtain the Appear-IOU distance: DAI = αDA ij + (1 − α)DI ij (10)
  • 11. Drones 2024, 8, 349 11 of 27 where α represents the proportion of the cosine distance, with values ranging between 0 and 1. Finally, all Appear-IOU distances are merged into a cost matrix, and the Hungar- ian algorithm is employed to achieve the best match. Unmatched trajectories undergo secondary matching with low-scored detections through the Gau-IOU matching model. In contrast, unmatched high-scored detection boxes undergo secondary matching with inactive trajectories through the OSPA-IOU matching model. 3.2.2. Gau-IOU Distance Matching The Gau-IOU distance matching process is illustrated in Figure 7. Low-scored detec- tion boxes often represent small objects. In order to better extract object features, both the low-scored detections and the trajectories to be matched are transformed into Gaussian space. This transformation integrates the Wasserstein distance (WD) and the IOU distance between the Gaussian distributions of trajectories and objects. Figure 7. Gau-IOU trajectory matching. We first transform the BBox of the object and trajectory into Gaussian space using a matrix transformation. For the object box represented by (x, y, h, w) , the parameters of the Gaussian distribution N(x|µ, Σ) are computed as: µ = [x, y]T (11) Σ = w2 4 0 0 w2 4 # (12) The key to matching detection boxes with trajectories is how to calculate the similarity between the Gaussian distributions Nd(xd|µd, Σd) of the detection box and Nt(xt|µt, Σt) of the trajectory box. We use the Wasserstein distance to compute the distance between the two Gaussian distributions. The Wasserstein distance between two Gaussian distributions is defined as:
  • 12. Drones 2024, 8, 349 12 of 27 DW(Nd, Nt) = ∥µd − µt∥2 + Tr(Σt) + Tr(Σt) − 2Tr Σ 1 2 d ΣtΣ 1 2 d 1 2 ! (13) The Wasserstein distance primarily consists of two components: the distance between the center points, represented by (x, y), and a coupling term related to (h, w). Due to the chain-like coupling relationship formed by these parameters, which causes them to influence each other, the Wasserstein distance is highly advantageous for achieving high- precision matching. Next, the IOU distance and the WD distance between all detections and trajectories are weighted to obtain the Appear-IOU distance. DGI = βDW + (1 − β)DI ij (14) where β represents the proportion of the WD distance and takes values between 0 and 1. Finally, the Hungarian algorithm is employed to achieve the best matching between detections and trajectories based on all Gau-IOU distances. Unmatched trajectories are converted to inactive trajectories, and unmatched low-scored detections are removed. 3.2.3. OSPA-IOU Distance Matching The OSPA distance allows for considering subpattern matching of object trajectories, enabling the model to better capture both the similarities and differences between object trajectories. This, in turn, provides a more accurate assessment of tracking performance. Building upon the foundation of IOU distance matching, we comprehensively consider the OSPA distance and propose the OSPA-IOU trajectory matching model. The process is illustrated in Figure 8. Figure 8. OSPA-IOU trajectory matching. Assume the object state set is X = {x1, x2, . . . , xm} and the object trajectory set is Y = {y1, y2, . . . , yn} , where m, m ∈ N0 = {0, 1, 2, . . .} represent the estimated and true numbers of objects, respectively. The OSPA distance is expressed as:
  • 13. Drones 2024, 8, 349 13 of 27 Dp,e(x, y) = 1 n min π∈Πn Σm i=1 dc xi, yπ(i) p + (n − m)cp 1 p (15) where Πn represents all permutations for selecting numbers from the set {1, 2, . . . , n} . If p = 1, the OSPA distance can be expressed as: Dp,c = eloc p,c(x, y) + ecard p,c (x, y) (16) eloc p,c(x, y) = 1 n min π∈Πn Σm i=1 dc xi, yπ(i) p 1 p (17) ecard p,c (x, y) = 1 n (n − m)cp 1 p (18) where eloc p,c(x, y) and ecard p,c (x, y) represent the positional difference and cardinality difference between the sets of estimated object states and true object states, respectively. The posi- tional difference signifies the spatial gap, while the cardinality difference encompasses performance metrics like the false track proportion, redundancy, and interruptions. The truncation parameter adjusts the balance between positional and cardinality differences, with smaller values prioritizing positional differences. Treating objects as single-element sets and trajectories as multi-element sets, we compute the OSPA distance between them to optimize matching between individual detections and trajectories. Subsequently, the IOU distance and the WD distance between all detections and trajectories are weighted to derive the Appear-IOU distance. DOI = λDp,c + (1 − λ)DI ij (19) where λ represents the proportion of the OSPA distance and takes values between 0 and 1. Finally, the Hungarian algorithm is applied to achieve the best matching between detections and trajectories based on all Gau-IOU distances. Unmatched high-score detections are converted to new inactive trajectories, and inactive trajectories that remain unmatched for 30 frames are removed. 3.2.4. Visual Gaussian Mixture Probability Hypothesis Density The visual Gaussian mixture probability hypothesis density (VGM-PHD) filtering algorithm utilizes the center positions of all trajectories as the measurement input for the random finite set, preserving object ID and size data to reconstruct trajectories. As- sumptions include representing both spawned and newly born object PHDs as Gaussian mixtures, independence between object detection and survival probabilities, and modeling state transition density and observation likelihood functions as linear Gaussian models. Both the motion model and observation model of the VGM-PHD filtering algorithm are set to be linear, and noise and errors follow Gaussian distributions. Using the weights, means, and variances of the PHD Gaussian distribution, the algorithm iteratively propa- gates the multi-object states. The specific implementation steps of the VGM-PHD filtering algorithm are as follows. Assuming the posterior PHD at a specific time is given by the following Gaussian sum form: Tk−1(x) = Σ Jk−1 i=1 ωi k−1N x; mi k−1, Pi k−1 (20) where ωi k , mi k , and pi k represent the weight, mean, and covariance of the i-th Gaussian component at time k for a single object state x , and Jk represents the number of Gaussian components at time k. The function N(. . . ) represents variables that follow a Gaussian distribution. The predicted intensity function at time k is given by:
  • 14. Drones 2024, 8, 349 14 of 27 Tk|k−1(x) = Ts,k|k−1(x) + Tβ,k|k−1(x) + γk(x) (21) The three terms on the right side respectively represent the predicted PHDs of surviv- ing objects, spawned objects, and newly born objects. The intensity function obtained from the GM-PHD filtering algorithm update can be expressed as: Tk(x) = (1 − PD,k)Tk|k−1(x) + Σz∈Zk TD,k(x; z) (22) where the first term represents the PHD of missed objects, and the second term represents the updated PHD of detected objects. In the VGM-PHD filtering algorithm, if the PHD functions at time k − 1, the prior distribution generated by the prediction at time k and the posterior distribution obtained by the filtering update can both be represented in Gaussian mixture form. The weights can be obtained through PHD filtering, while the means and covariances are recursively obtained through Kalman filtering. During the prediction and update of the object PHD in VGM-PHD, the predicted object numbers Nk|k−1 and updated object numbers Nk are given by: Nk|k−1 = Σ Jk|k−1 i=1 ωn k|k−1 = Nk−1 PS,k + Σ Jβ,k i=1ωi β,k + Σ Jγ,k j=1ωi γ,k (23) Nk = Σ Jk n=1ωn k = Nk|k−1(1 − PD,k) + Σz∈Zk Σ Jk|k−1 j=1 ω j k(z) (24) 4. Experiments 4.1. Dataset and Evaluation Metrics The proposed algorithm undergoes comprehensive evaluations on the VisDrone MOT [22] and UAVDT [23] datasets, which encompass diverse drone-captured scenes and facilitate a thorough assessment of the proposed methods’ practical effectiveness. Ex- tensive evaluations compare the algorithm with other leading multi-object trackers across various scenarios and conditions. Established MOT evaluation metrics are utilized to assess performance comprehensively, with the aim of gauging overall effectiveness and pinpointing potential weaknesses in each model. The metrics include: (1) FP (↓): Number of false positives in the entire video. (2) FN (↓): Number of false negatives in the entire video. (3) IDSW (↓): Number of identity switches in the entire video. (4) FM (↓): Number of ground truth trajectories interrupted during the tracking process. (5) IDF1 (↑): Ratio of correctly identified detections to the computed detections and ground truth. (6) MOTA (↑): Combined FP, FN, and IDSW, scored as follows: MOTA = 1 − (FN + FP + IDSW) GT (25) where GT is the actual tracking result. (7) MOTP (↑): Mismatches between ground truth and predicted results, calculated as: MOTP = 1 − Σt,idt,i Σtct (26) These metrics contribute to a comprehensive assessment of MOT algorithm perfor- mance in various aspects, providing in-depth insights into system effectiveness. 4.2. Training Preprocessing Existing MOT methods integrating object detection and appearance embedding often use a single-stage training approach, where detection and appearance branches are trained simultaneously. While this reduces training time, it can harm detection performance due to differing learning objectives. In densely populated scenes, fully occluded objects may
  • 15. Drones 2024, 8, 349 15 of 27 still have annotated bounding boxes in the training dataset, which can introduce errors when learning appearance embeddings and can reduce tracking accuracy. To address this, our proposed model filters highly occluded objects from the training samples before commencing model training. To implement this, we initially define a metric variable Boverlap ∈ [0, 1] to gauge the overlap between two ground truth Bboxes; the metric is defined as follows: Boverlap = BBoxi GT T BBox j GT BBoxi GT S BBox j GT (27) where BBoxi GT and BBox j GT represent the i-th and j-th ground truth BBoxes of the input training samples, respectively. A higher value of the variable indicates greater overlap between the two ground-truth BBoxes. In object detection, a value Boverlap ≥ 0.75 signifies substantial overlap between two BBoxes. Therefore, in this study, we set the threshold at Boverlap ≥ 0.75, considering smaller BBoxes as indicative of occluded objects and excluding them from the training dataset. We ultimately train the model using the filtered dataset. 4.3. Experimental Settings The detector is initialized with pre-existing weights obtained from training on the COCO dataset. We train the detector using SGD with the following parameters: 150 epochs, a batch size of 16, a learning rate of 0.02, momentum set to 0.9, and decay set to 0.0001. We train the detector on both the VisDrone and UAVDT datasets and perform validation using the same set of verification images. We execute the testing on hardware (NVIDIA RTX 4090 with 24 GB of memory) and calculate the average of the top-100 most reliable detection results. 4.4. Comparative Experiments 4.4.1. Detection Comparison To compare the performance of our detector, we select a total of seven excellent detectors: DETR [17], Deformable DETR [24], YOLO-S [25], Swin-JDE [18], VitDet [26], RTD-Net [27], and DN-DETR [28]. They are trained and evaluated on the VisDrone and UAVDT datasets using the experimental settings described in their respective papers. DETR completely discards traditional object detection components such as anchor boxes and non- maximum suppression and utilizes a complete attention mechanism for end-to-end object detection. Deformable DETR is an improved version of DETR that introduces deformable attention to enhance the model’s adaptability to changes in object shape and scale. YOLO-S employs a small feature extractor, skip connections, cascaded skip connections, and a reshaping pass-through layer to facilitate cross-network feature reuse, combining low- level positional information with more meaningful high-level information. The Swin-JDE algorithm adopts a Swin transformer based on windowed self-attention as the backbone network to enhance feature extraction capabilities. ViTDet utilizes ViT as the backbone for a Mask R-CNN object detection model, enhancing competitiveness by optimizing the RPN section. RTD-Net replaces positional linear projection with convolutional projection and uses an efficient convolutional multi-head self-attention algorithm based on convolutional transformer blocks to improve the recognition of occluded objects by extracting contextual information. DN-DETR introduces a novel denoising training approach to address the instability of bipartite graph matching in the DETR decoder during training, doubling the convergence speed and significantly improving the detection results. The comparative results in Table 1 demonstrate the substantial advantages of our detection performance. AP is the average accuracy, and AP@0.5 and AP@0.75 indicate intersection-to-union ratios greater than 50% and 75%, respectively. APs, APm, and APl are the average accuracies for small objects (with an area less than 32 × 32 pixels), medium objects (with an area between 32 × 32 and 96 × 96 pixels), and large objects (with an area greater than 96 × 96 pixels), respectively. The visual comparison results in Figures 9 and 10
  • 16. Drones 2024, 8, 349 16 of 27 show that our results exhibit excellent performance under various lighting conditions and crowded environments. (a) DETR (b) Deformable DETR (c) YOLOS (d) Swin-JDE (e) VitDet (f) RTD-Net (g) DN-DETR-Net (h) Holistic trans-det Figure 9. Comparison of detection results on the VisDrone dataset.
  • 17. Drones 2024, 8, 349 17 of 27 (a) DETR (b) Deformable DETR (c) YOLOS (d) Swin-JDE (e) VitDet (f) RTD-Net (g) DN-DETR-Net (h) Holistic Trans-Det Figure 10. Comparison of detection results on the UAVDT dataset.
  • 18. Drones 2024, 8, 349 18 of 27 Table 1. The detection results of the detectors on the datasets. Dataset Detector AP AP@0.5 AP@0.75 APs APm APl VisDrone DETR [17] 34.8 63.4 32.2 12.8 38.5 55.6 Deformable DETR [24] 36.9 60.4 35.2 9.9 38.1 52.7 YOLOS [25] 36.6 63.1 38.7 15.4 39.9 54.9 Swin-JDE[18] 38.2 60.5 34.8 11.1 41.4 57.6 VitDet [26] 38.9 64.7 38.7 19.6 40.5 57.8 RTD-Net [27] 38.1 64.6 40.2 17.6 42.8 57.6 DN-DETR [28] 39.4 63.4 36.5 16.8 42.5 59.2 Holistic Trans-Det 39.6 67.9 40.8 18.6 40.3 59.4 UAVDT DETR [17] 48.8 69.3 49.3 28.0 47.5 57.1 Deformable DETR [24] 47.2 69.2 50.3 29.0 53.2 59.4 YOLOS [25] 49.3 71.1 51.4 32.3 50.4 58.9 Swin-JDE [18] 49.6 69.9 52.8 33.9 54.8 59.7 VitDet [26] 54.6 68.9 59.5 37.5 57.9 61.0 RTD-Net [27] 52.2 71.4 55.6 36.3 57.2 60.9 DN-DETR [28] 56.7 68.6 60.2 38.7 59.8 62.9 Holistic Trans-Det 57.5 69.0 60.5 38.8 61.5 67.9 4.4.2. Tracking Comparison We compared DeepSORT [29], ByteTrack [30], BoT-SORT [31], UAVMOT [32], DC- MOT [33], TFAM [34], MTTJDT [35], and SimpleTrack [36] as well as transformer-based meth- ods including TransTrack [37], TrackFormer [38], TransCenter [39], MOTR [20], MeMOT [21], GTR [40], TR-MOT [41], GCEVT [42], STN-Track [43], and STDFormer [19]. These compar- isons were conducted on the VisDrone MOT and UAVDT datasets. To ensure consistent comparisons despite variations in object distributions across datasets, we employed the holistic trans-detector to produce uniform detection results for all tracking comparison methods. This approach mitigates evaluation bias stemming from uneven category distributions, fostering fairer and more reliable tracking method comparisons. To maintain detection accuracy across categories during evaluation, distinct thresholds were applied: 0.3 for cars, 0.1 for trucks, and 0.4 for pedestrians, with a lower threshold of 0.05 for buses, which present greater visual variability. Tables 2 and 3 comprehensively compare GAO-Tracker with other popular trackers on the VisDrone MOT and UAVDT datasets. The evaluation includes critical metrics such as MOTA, MOTP, IDF1, and IDSW and comparisons with other methods. GAO-Tracker demonstrates excellent performance by effectively utilizing position and appearance in- formation. DeepSORT associates categories independently using positional information. ByteTrack utilizes low-scoring detection for similarity tracking and background noise filtering. BoT-SORT incorporates camera motion compensation for improved matching. UAVMOT enhances object feature association with an ID feature update module. Simple- Track merges object embedding cosine and GIOU distances to create a new association matrix. Transformer-based methods like TransTrack employ a query–key mechanism for existing object tracking and new object detection. TrackFormer considers position, occlu- sion, and object recognition features simultaneously. TransCenter predicts the association’s heatmap of object centers globally. MOTR models the entire trajectory of an object using a tracking query. MeMOT uses information from previous frames for tracking clues. GTR extends the window length for matching and utilizes interaction information fully. TR- MOT achieves reliable associations using visual temporal features. STDFormer utilizes the transformer’s remote modeling capability for intent and decision information extraction. However, these methods apply a single matching rule for all detection classes, leading to inaccurate tracking of various object classes and poorer performance.
  • 19. Drones 2024, 8, 349 19 of 27 Table 2. Comparison between GAO-Tracker and the latest multiple trackers tested on the Vis- Drone dataset. Tracker MOTA↑ MOTP↑ IDF1 (%)↑ IDSW↓ MT (%)↑ ML (%)↑ FP↓ FN↓ Motion- based DeepSORT [29] 19.4 69.8 33.1 6387 38.8 52.2 15,181 44,830 ByteTrack [30] 25.1 72.6 40.8 4590 42.8 50.3 10,722 24,376 BoT-SORT [31] 23.0 71.6 41.4 7014 51.9 73.6 10,701 47,922 UAVMOT [32] 25.0 72.3 40.5 6644 52.6 49.6 10,134 55,630 DCMOT [33] 33.5 76.1 45.5 1139 - - 12,594 64,856 TFAM [34] 30.9 74.4 42.7 3998 - - 27,732 126,811 MTTJDT [35] 31.2 73.2 43.6 2415 - - 25,976 183,381 Transformer- based TransTrack [37] 27.3 62.1 28.3 2523 33.5 59.7 15,028 51,396 TrackFormer [38] 24 77.3 38 4724 39 46.3 11,731 32,807 TransCenter [39] 29.9 66.6 46.8 3446 33.4 61.8 15,104 20,894 MOTR [20] 13.1 72.4 47.1 2997 52.9 72 12,216 42,186 MeMOT [21] 29.4 73 48.7 3755 46.7 47.9 9963 30,062 GTR [40] 28.1 76.8 54.5 2000 61.3 57.6 8165 10,553 TR-MOT [41] 29.9 64.3 46 1005 42.8 59.9 7593 17,352 GCEVT [42] 34.5 73.8 50.6 841 520 612 - - STN-Track [43] 38.6 - 73.7 668 31.4 51.2 7385 76,006 STDFormer [19] 35.9 74.5 59.9 1441 52.7 60.3 8527 20,558 GAO-Tracker 38.8 76.3 54.3 972 55.9 52.4 6883 10,204 Table 3. Comparison between GAO-Tracker and the latest multiple trackers tested on the UAVDT dataset. Tracker MOTA↑ MOTP↑ IDF1 (%)↑ IDSW↓ MT (%)↑ ML (%)↑ FP↓ FN↓ Motion- based DeepSORT [29] 35.9 71.5 58.3 698 43.4 25.7 50,513 59,733 ByteTrack [30] 39.1 74.3 44.7 2341 43.8 28.1 14,468 87,485 BoT-SORT [31] 37.2 72.1 53.1 1692 40.8 27.3 42,286 64,494 UAVMOT [32] 43.0 73.5 61.5 641 45.3 22.7 27,832 65,467 SimpleTrack [36] 45.3 73.9 57.1 1404 43.6 22.5 21,153 53,448 TFAM [34] 47.0 72.9 67.8 506 - - 68,282 111,959 Transformer- based TransTrack [37] 33.2 72.4 67.6 1122 38.9 23.8 50,746 54,938 TrackFormer [38] 53.4 74.2 46.3 2247 43.7 23.3 13,719 91,061 TransCenter [39] 48.9 73.9 51.3 2287 32.6 35.1 27,995 93,013 MOTR [20] 35.6 72.5 56.1 1759 39.8 29.3 39,733 56,368 MeMOT [21] 45.6 74.6 62.8 2118 34.9 26.5 38,933 59,156 GTR [40] 46.5 75.3 61.1 1482 42.7 18.6 21,676 52,617 TR-MOT [41] 57.7 74.1 55.7 2461 33.9 21.3 32,217 50,838 GCEVT [42] 47.6 73.4 68.6 1801 618 363 - - STN-Track [43] 60.6 - 73.1 1420 57.0 17.0 12,825 61,760 STDFormer [19] 60.6 74.8 61.7 1642 44.6 20.3 20,258 41,895 GAO-Tracker 61.7 75.2 67.9 1216 45.3 24.6 24,915 59,640 Combining the data from Tables 2 and 3, we observe that transformer-based methods outperform motion-based methods. This trend reflects the effectiveness and superiority of transformer-based methods for addressing multi-object tracking problems in drone aerial videos and that transformer-based methods enable better capturing of long-distance dependencies between objects in complex environments and better handling of challenges such as object occlusion and scale changes.
  • 20. Drones 2024, 8, 349 20 of 27 Figures 11 and 12 show time-order frames with bounding boxes and different-colored identities. In the initial images (left), bounding boxes may appear inconsistent due to occlu- sion. However, in the final images (right), GAO-Tracker maintains consistent bounding boxes, reducing the identity switching of pedestrians. The center images show intermediate steps where identities might temporarily switch due to occlusions or overlaps. The final images (right) demonstrate GAO-Tracker’s ability to preserve identities throughout the se- quence, even in crowded scenarios. By utilizing object motion information, GAO-Tracker’s trajectory association technology effectively solves the problems of missed detection and incorrect detection caused by occlusion, especially in the case of short-term overlapping objects. Compared with previous algorithms based on bounding box connections, GAO- Tracker reduces pedestrian identity switching. The results indicate that GAO-Tracker performs well in crowded scenarios of drone aerial videos and ensures consistent bounding boxes and identities throughout the entire sequence. Figure 11. Tracking results of GAO-Tracker on the VisDrone dataset. Figure 12. Tracking results of GAO-Tracker on the UAVDT dataset. 4.5. Ablation Experiments To demonstrate the effectiveness of the designed method, we conducted multiple sets of ablation experiments on training preprocessing strategies, the GAO module, the sequence of various matching strategies, and VGM-PHD on the VisDrone and UAVDT datasets.
  • 21. Drones 2024, 8, 349 21 of 27 4.5.1. Effect of Backbone To validate the effectiveness of our holistic trans as the backbone network, we com- pared it with ResNet50, DLA-34, ViT, and Swin-L and conducted ablation experiments. Table 4 presents the performance evaluation results of the proposed GAO-Tracker combined with different backbone networks. This experiment used the proposed data association method as the post-processing module and evaluated the UAVDT and VisDrone test datasets. Based on the results in Table 4, we have the following findings: In the evaluation results of UAVDT, using DLA-34 as the backbone network yielded the best performance, with MOTA, MOTP, and IDF1 scores reaching 61.9%, 75.1%, and 66.4%, respectively. Ad- ditionally, using the holistic trans backbone network resulted in the lowest IDSW count. In the evaluation results of VisDrone, compared to ResNet50, DLA-34, ViT, and Swin-L, using the holistic trans backbone network achieved 38.8% MOTA, 76.3% MOTP, and 54.3% IDF1 and a significant reduction in FP. Since VisDrone contains many congested scenes, the experimental results indicate that using the holistic trans backbone network can improve MOT performance in crowded scenarios.The tracking performance using the DLA-34 back- bone network was the best on UAVDT but was significantly worse on VisDrone. In contrast, using the holistic trans backbone network resulted in inferior tracking performance on UAVDT but the best performance on VisDrone. The MOTA increase and the FP decrease using the holistic trans backbone network indicate that our model significantly enhances the detection capability of correct objects. Table 4. Performance evaluation of the proposed GAO-Tracker model combined with different backbone networks. Dataset Detector Backbone MOTA↑ MOTP↑ IDF1 (%)↑ IDSW↓ MT (%)↑ ML (%)↑ FP↓ FN↓ VisDrone ResNet-50 19.6 59.9 36.7 4287 35.3 31.3 9078 18,764 DLA-34 34.9 68.5 50.3 2198 46.3 43.5 8818 13,070 ViT 35.2 69.7 51.0 2019 48.9 45.9 8009 12,897 Swin-L 35.5 70.2 52.3 1509 51.9 47.6 6832 12,223 Holistic Trans 38.8 76.3 54.3 972 55.9 52.4 6883 10,204 UAVDT ResNet-50 56.2 70.3 62.1 2252 40.4 22.6 32,743 72,629 DLA-34 61.9 75.1 66.4 1798 42.4 23.4 28,705 65,616 ViT 60.1 74.0 65.9 1504 42.8 23.7 26,937 62,348 Swin-L 59.6 74.4 66.0 1264 43.9 23.8 25,822 61,324 Holistic Trans 61.7 75.2 67.9 1216 45.3 24.6 24,915 59,640 Based on the observations above, it can be concluded that the backbone network signif- icantly impacts the tracking performance of multi-object trackers depending on the density of tracking objects in the scene. Therefore, improving the feature extraction capability of the backbone network model is a crucial factor affecting the tracking performance of multi-object trackers. 4.5.2. Impact of Pre-Processing and Detection Results Classification During the training process of the multi-object tracking model, we attempted to train the network by removing highly overlapped objects to provide efficient and accurate appearance embedding information for multi-object tracking matching. We also explored the impact of classifying high- and low-scoring detection boxes. As shown in Table 5, we verified the effectiveness by adding or not adding training set optimization and detection score branches. “Pre” indicates training with removing highly overlapped objects, while “Grade” signifies the model distinguished between high- and low-scoring detection boxes for input into the GAO trajectory association pattern.
  • 22. Drones 2024, 8, 349 22 of 27 Table 5. Comparison between detection and classification with or without preprocessing. Dataset Method MOTA↑ MOTP↑ IDF1 (%)↑ IDSW↓ MT (%)↑ ML (%)↑ FP↓ FN↓ VisDrone Baseline 36.2 70.9 52.5 1344 53.1 49.3 9117 11,987 B+Pre 37.6 71.2 52.8 1320 54.3 50.1 9135 11,499 B+Grade 37.3 74.2 52.7 1138 54.7 51.2 9627 11,060 B+Pre+Grade 38.8 76.3 54.3 972 55.9 52.4 6883 10,204 UAVDT Baseline 57.8 72.0 64.0 1841 42.4 23.3 29,057 67,373 B+Pre 59.3 74.4 65.6 1398 43.8 23.8 25,836 62,429 B+Grade 60.4 74.7 66.1 1221 44.5 23.9 25,418 60,828 B+Pre+Grade 61.7 75.2 67.9 1216 45.3 24.6 24,915 59,640 The results indicate that removing ground truth Bbox annotations for occluded objects can reduce errors in learning appearance embeddings, thereby improving the accuracy of tracked object identification. By differentiating between low- and high-scoring detection boxes, it is possible to effectively reduce trajectory fragmentation and IDSW, thus enhancing the effectiveness and performance of object tracking. Additionally, by using preprocessing and detection result classification, the MOTA, MOTP, and IDF1 on VisDrone improved by 2.6%, 5.4%, and 1.8%, respectively, while on UAVDT, they improved by 2.6%, 5.4%, and 1.8% respectively. 4.5.3. Impact of Matching Strategies We validated the individual contributions of each component by combining different association strategies, as shown in Table 6. The baseline uses IOU matching for all associa- tions, and we gradually replace it with Appear-IOU, Gau-IOU, and OSPA-IOU on top of the baseline. The results indicate that all three proposed association strategies effectively enhance the accuracy of tracking associations. The baseline model shows significantly higher FP and more IDSW, indicating a higher number of false detections introduced by the model, resulting in poor trajectory matching quality and increased identity switching. After replacing the high-score detection box matching strategy with Appear-IOU, MOTA and IDF1 showed noticeable improvements. However, there was a slight increase in FP, and the powerful detection capability significantly reduced FN. After replacing the low-score detection box matching strategy with Gau-IOU, MOTA and MOTP improved significantly. At the same time, IDSW decreased substantially, demonstrating the effectiveness of match- ing smaller low-score detection boxes in Gaussian space. Substituting the OSPA-IOU distance-based object-to-trajectory matching method, the high-score detection boxes are considered to be a collection of individual trajectories for matching against trajectory col- lections, improving all metrics. These results indicate that our various strategies contribute to better overall tracking performance. Table 6. Comparison of different association strategies. Dataset Method MOTA↑ MOTP↑ IDF1 (%)↑ IDSW↓ MT (%)↑ ML (%)↑ FP↓ FN↓ VisDrone Baseline 36.2 70.9 52.5 1552 53.1 49.3 9117 11,987 B+Appear-IOU 37.1 74.4 53.2 1334 53.2 49.2 9209 11,027 B+Appear-IOU+Gau-IOU 38.3 75.6 53.9 1052 53.8 49.9 7343 10,946 B+Appear-IOU+Gau-IOU+ OSPA-IOU 38.8 76.3 54.3 972 55.9 52.4 6883 10,204 UAVDT Baseline 57.8 72.0 64.0 1841 42.4 23.3 29,057 67,373 B+Appear-IOU 58.2 73.8 64.9 1536 43.0 23.7 29,133 63,781 B+Appear-IOU+Gau-IOU 60.9 74.9 66.3 1297 45.0 24.0 25,011 60,369 B+Appear-IOU+Gau-IOU+ OSPA-IOU 61.7 75.2 67.9 1216 45.3 24.6 24,915 59,640
  • 23. Drones 2024, 8, 349 23 of 27 4.5.4. Impact of VGM-PHD We designed ablation experiments to validate the effectiveness of the VGM-PHD method. We compared this method against no trajectory prediction and the use of a Kalman filter. The results are presented in Table 7. The findings indicate that VGM-PHD exhibits higher prediction accuracy and robustness compared to no trajectory prediction and the Kalman filter across multiple scenarios. In complex environments particularly, the new method successfully overcomes the limitations of traditional approaches, enhancing the accuracy of predicting future positions of moving objects. Moreover, the decrease in IDSW and the increase in IDF1 suggest improved stability in trajectory tracking. Consequently, overall tracking performance is enhanced. Table 7. Comparison with and without trajectory prediction. Dataset Method MOTA↑ MOTP↑ IDF1 (%)↑ IDSW↓ MT (%)↑ ML (%)↑ FP↓ FN↓ VisDrone No trajectory prediction 29.9 64.4 49.3 2497 42.8 42.8 8719 15,226 Kalman Filter 35.3 69.9 50.6 1727 51.4 47.5 8998 12,302 VGM-PHD 38.8 76.3 54.3 972 55.9 52.4 6883 10,204 UAVDT No trajectory prediction 43.1 61.2 46.4 4437 32.3 17.4 49,018 99,620 Kalman Filter 52.8 68.0 56.1 3069 37.4 21.0 38,389 82,471 VGM-PHD 61.7 75.2 67.9 1216 45.3 24.6 24,915 59,640 5. Discussion In this paper, we have integrated the strengths of joint detection and visual multi- object tracking algorithms with transformer-based visual multi-object tracking algorithms to address the unique challenges posed by drone aerial videos. Our proposed GAO-Tracker, which models object motion information, has demonstrated significant improvements in tracking performance, particularly in complex real-world scenarios. 5.1. Performance Analysis GAO-Tracker’s performance on the VisDrone and UAVDT datasets has shown remark- able results, surpassing existing state-of-the-art methods in terms of both accuracy and robustness. The integration of the transformer model’s global context capturing capabilities with the joint detection and tracking methods’ handling of occlusions and scale variations has proven effective. The results indicate that our approach can maintain high tracking accuracy even in challenging environments that are characterized by rapid object motion, complex backgrounds, and varying object scales. 5.2. Strengths (1) Enhanced accuracy: By leveraging the transformer model’s self-attention mech- anism, GAO-Tracker effectively captures long-range dependencies and global contexts, which are crucial for accurately tracking multiple objects in aerial videos. (2) Robustness to occlusions and scale variations: The joint detection and tracking methods integrated into GAO-Tracker enable it to handle occlusions and significant scale variations efficiently, ensuring continuous and reliable tracking. (3) Practical solutions: GAO-Tracker provides practical solutions to real-world multi- object tracking problems, making it highly applicable in various domains such as surveil- lance, rescue operations, and urban planning. 5.3. Limitations (1) Computational complexity: Despite its accuracy and robustness, the computational expense associated with the transformer model’s fine-grained self-attention mechanism
  • 24. Drones 2024, 8, 349 24 of 27 remains a challenge. This could potentially limit the real-time applicability of GAO-Tracker in resource-constrained environments. (2) Scalability: While GAO-Tracker performs well on benchmark datasets, its scala- bility to handle extremely large-scale datasets or highly crowded scenes requires further exploration and optimization. 5.4. Future Directions To further enhance the performance and applicability of GAO-Tracker, several future research directions are proposed: (1) Algorithm optimization: Efforts will focus on optimizing the algorithm to reduce computational complexity and improve real-time performance. This includes exploring more efficient implementations of the transformer model and refining the integration with detection and tracking components. (2) Broader application areas: Extending the research findings to benefit more diverse fields is a key future direction. Improvements in drone-based multi-object tracking can be adapted for use in autonomous driving, security systems, wildlife monitoring, and other domains requiring accurate and reliable tracking of multiple objects. (3) Handling complex scenarios: Further research is needed to enhance GAO-Tracker’s performance in highly dynamic and crowded environments. This includes developing methods to better handle dense object interactions and rapidly changing scenes. (4) Long-term tracking: Enhancing the system’s ability to maintain long-term tracking stability and accuracy, particularly in scenarios with frequent object disappearances and reappearances, is another important area for future work. 6. Conclusions This paper aims to integrate the strengths of joint detection and visual multi-object tracking algorithms with transformer-based visual multi-object tracking algorithms to improve the performance of multi-object tracking in drone aerial videos. Additionally, we propose a more comprehensive, robust, and efficient integrated multi-object tracking algorithm by modeling object motion information. By leveraging the advanced capabilities of transformer models to capture global contexts and the strengths of joint detection and tracking methods in handling occlusions and scale variations, our approach addresses the unique challenges posed by drone aerial videos, such as rapid object motion, complex environmental conditions, and data noise. This integration allows for more accurate and reliable tracking of multiple objects, enhancing the overall performance and robustness of tracking systems in various real-world scenarios. A series of novel results have been achieved in the drone aerial multi-object tracking field, with GAO-Tracker demonstrating excellent results on the VisDrone and UAVDT datasets. These datasets, which are widely used benchmarks in the field, have shown that our method significantly outperforms existing state-of-the-art methods in terms of both accuracy and robustness. This indicates GAO-Tracker’s strong potential for practical appli- cations in surveillance, rescue operations, agriculture, and urban planning, among others. The practical solutions provided by GAO-Tracker to multi-object tracking problems in real-world scenarios offer new ideas and methods for the development of drone visual tracking. Our approach not only contributes to the current body of knowledge but also paves the way for future research in this area. In the future, efforts will focus on improving and optimizing algorithms to enhance multi-object tracking performance further. This includes refining the integration of detection and tracking components, enhancing the effi- ciency of the transformer model, and exploring new ways to handle challenging scenarios such as crowded environments and dynamic backgrounds. Additionally, endeavors will be made to extend research findings to broader appli- cation areas to benefit more diverse fields. For instance, improvements in drone-based multi-object tracking can be adapted for use in autonomous driving, security systems, wildlife monitoring, and other areas where real-time, accurate tracking of multiple objects
  • 25. Drones 2024, 8, 349 25 of 27 is critical. By expanding the applicability of our research, we aim to contribute to the advancement of technology across various domains, ultimately enhancing the capabilities and reliability of multi-object tracking systems. Author Contributions: Conceptualization, Y.Y.; methodology, Y.Y., Y.W., and Y.L.; software, Y.P. and Y.L.; formal analysis, Y.W.; investigation, Y.W.; resources, Y.P.; data curation, L.Z., Y.P., and Y.L.; writing—original draft, Y.Y.; writing—review and editing, Y.W. and L.Z.; visualization, Y.L.; supervision, L.Z.; project administration, Y.P. All authors have read and agreed to the published version of the manuscript. Funding: This work was supported by Funding for Outstanding Doctoral Dissertation in NUAA under grant BCXJ24-10, the Postgraduate Research and Practice Innovation Program of Jiangsu Province under grant KYCX24_0583, the National Natural Science Foundation of China under grant 61573183, and the Natural Science Foundation of Shaanxi Province of China under grant 2024JC- YBQN-0695. Data Availability Statement: The data presented in this study are available on request from the corresponding author. Conflicts of Interest: The authors declare no conflicts of interest. Abbreviations The following abbreviations are used in this manuscript: MOT Multiple object tracking GAO Gaussian, appearance, and optimal subpattern assignment IOU Intersection over union OSPA Optimal subpattern assignment VGM-PHD Visual Gaussian mixture probability hypothesis density MOTA Multiple object tracking accuracy MOTP Multiple object tracking precision References 1. Wu, X.; Li,W.; Hong, D.; Tao, R.; Du, Q. Deep learning for unmanned aerial vehicle-based object detection and tracking: A survey. IEEE Geosci. Remote Sens. Mag. 2021, 10, 91–124. 2. Li, Y.; Zhang, H.; Yang, Y.; Liu, H.; Yuan, D. RISTrack: Learning Response Interference Suppression Correlation Filters for UAV Tracking. IEEE Geosci. Remote Sens. Lett. 2023, 20, 1–5. 3. Dai, M.; Hu, J.; Zhuang, J.; Zheng, E. A transformer-based feature segmentation and region alignment method for UAV-view geo-localization. IEEE Trans. Circuits Syst. Video Technol. 2021, 32, 4376–4389. 4. Yi, S.; Liu, X.; Li, J.; Chen, L. UAVformer: a composite transformer network for urban scene segmentation of UAV images. Pattern Recogn. 2023, 133, 109019. 5. Yongqiang, X.; Zhongbo, L.; Jin, Q.; Zhang, K.; Zhang, B.; Feng, Q. Optimal video communication strategy for intelligent video analysis in unmanned aerial vehicle applications. Chin. J. Aeronaut. 2020, 33, 2921–2929. 6. Bochinski, E.; Eiselein, V.; Sikora, T. High-speed tracking-by-detection without using image information. In Proceedings of the 14th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), Lecce, Italy, 29 August–1 September 2017; pp. 1–6. 7. Chen, G.; Wang, W.; He, Zh.; Wang, L.; Yuan, Y.; Zhang, D.; Zhang, J.; Zhu, P.; Van G.; Han, J.; Hoi, S.; Hu, Q.; Liu, M. VisDrone- MOT2021: The Vision Meets Drone Multiple Object Tracking Challenge Results. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Virtual, 11–17 October 2021; pp. 2839–2846. 8. Bisio, I.; Garibotto, C.; Haleem, H.; Lavagetto, F.; Sciarrone, A. Vehicular/Non-Vehicular Multi-Class Multi-Object Tracking in Drone-based Aerial Scenes. IEEE Trans. Veh. Technol. 2023, 73, 4961–4977. 9. Lin, Y.; Wang, M.; Chen, W.; Gao, W.; Li, L.; Liu, Y. Multiple Object Tracking of Drone Videos by a Temporal-Association Network with Separated-Tasks Structure. Remote Sens.2022, 14, 3862. 10. Al-Shakarji, N.; Bunyak, F.; Seetharaman, G.; Palaniappan, K. Multi-object tracking cascade with multi-step data association and occlusion handling. In Proceedings of the 2018 15th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), Auckland, New Zealand, 27–30 November 2018; pp. 1–6. 11. Yu, H.; Li, G.; Zhang, W.; Yao, H.; Huang, Q. Self-balance motion and appearance model for multi-object tracking in uav. In Proceedings of the 1st ACM International Conference on Multimedia in Asia, Beijing, China, 15–18 December 2019; pp. 1–6. 12. Wang, Z.; Zheng, L.; Liu, Y.; Li, Y.; Wang, S. Towards real-time multi-object tracking. In Proceedings of the 16th European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 107–122.
  • 26. Drones 2024, 8, 349 26 of 27 13. Wu, H.; Nie, J.; He, Z.; Zhu, Z.; Gao, M. One-shot multiple object tracking in UAV videos using task-specific fine-grained features. Remote Sens. 2022, 14, 3853. 14. Shi, L.; Zhang, Q.; Pan, B.; Zhang, J.; Su, Y. Global-Local and Occlusion Awareness Network for Object Tracking in UAVs. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2023, 16, 8834–8844. 15. Zhou, X.; Koltun, V.; Krähenbühl, P. Tracking objects as points.In Proceedings of the 16th European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 474–490. 16. Peng, J.; Wang, C.; Wan, F.; Wu, Y.; Wang, Y.; Tai, Y.; Wang, C.; Li, J.; Huang, F.; Fu, Y. Chained-tracker: Chaining paired attentive regression results for end-to-end joint multiple-object detection and tracking. In Proceedings of the 16th European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 145–161. 17. Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In Proceedings of the 16th European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 213–229. 18. Tsai, C.; Shen, G.; Nisar, H. Swin-JDE: joint detection and embedding multi-object tracking in crowded scenes based on swin- transformer. Eng. Appl. Artif. Intel. 2023, 119, 105770. 19. Hu, M.; Zhu, X.; Wang, H.; Cao, S.; Liu, C.; Song, Q. STDFormer: Spatial-Temporal Motion Transformer for Multiple Object Tracking. IEEE Trans. Circuits Syst. Video Technol. 2023, 33, 6571–6594. 20. Zeng, F.; Dong, B.; Zhang, Y.; Wang, T.; Zhang, X.; Wei, Y. Motr: End-to-end multiple-object tracking with transformer. In Proceedings of the 17th European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; pp. 659–675. 21. Cai, J.; Xu, M.; Li, W.; Xiong, Y.; Xia, W.; Tu, Z.; Soatto, S. Memot: Multi-object tracking with memory. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 8090–8100. 22. Zhu, P.; Wen, L.; Du, D.; Bian, X.; Hu, Q.; Ling, H. Vision meets drones: Past, present and future. arXiv 2020, arXiv:2001.06303. 23. Du, D.; Qi, Y.; Yu, H.; Yang, Y.; Duan, K.; Li, G.; Zhang, W.; Huang, Q.; Tian, Q. The unmanned aerial vehicle benchmark: Object detection and tracking. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 370–386. 24. Zhu, X.; Su, W.; Lu, L.; Li, B.; Wang, X.; Dai, J. Deformable detr: Deformable transformers for end-to-end object detection. arXiv 2020, arXiv:2010.04159. 25. Fang, Y.; Liao, B.; Wang, X.; Fang, J.; Qi, J.; Wu, R.; Niu, J.; Liu, W. You Only Look at One Sequence: Rethinking Transformer in Vision through Object Detection. Adv. Neural Inf. Process. Syst. 2021, 34, 26183–26197. 26. Li, Y.; Mao, H.; Girshick, R.; He, K. Exploring plain vision transformer backbones for object detection. In Proceedings of the 17th European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; pp. 280–296. 27. Ye, T.; Qin, W.; Zhao, Z.; Gao, X.; Deng, X.; Ouyang, Y. Real-Time Object Detection Network in UAV-Vision Based on CNN and Transformer.IEEE Trans. Instrum. Meas. 2023, 72, 2505713. 28. Li, F.; Zhang, H.; Liu, S.; Guo, J.; Ni, L.; Zhang, L. Dn-detr: Accelerate detr training by introducing query denoising. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 13619–13627. 29. Wojke, N.; Bewley, A.; Paulus, D. Simple online and realtime tracking with a deep association metric. In Proceedings of the 2017 IEEE International Conference on Image Processing (ICIP),Beijing, China, 17–20 September 2017; pp. 3645–3649. 30. Zhang, Y.; Sun, P.; Jiang, Y.; Yu, D.; Weng, F.; Yuan, Z.; Luo, P.; Liu, W.; Wang, X. Bytetrack: Multi-object tracking by associating every detection box. In Proceedings of the 17th European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; pp. 1–21. 31. Aharon, N.; Orfaig, R.; Bobrovsky, B. BoT-SORT: Robust associations multi-pedestrian tracking. arXiv 2022, arXiv:2206.14651. 32. Liu, S.; Li, X.; Lu, H.; He, Y. Multi-Object Tracking Meets Moving UAV. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 8876–8885. 33. Deng, K.; Zhang, C.; Chen, Z.; Hu, W.; Li, B.; Lu, F. Jointing Recurrent Across-Channel and Spatial Attention for Multi-Object Tracking With Block-Erasing Data Augmentation. IEEE Trans. Circuits Syst. Video Technol. 2023, 33, 4054–4069. 34. Xiao, C.; Cao, Q.; Zhong, Y.; Lan, L.; Zhang, X.; Cai, H.; Luo, Z. Enhancing Online UAV Multi-Object Tracking with Temporal Context and Spatial Topological Relationships. Drones 2023, 7, 389. 35. Keawboontan, T.; Thammawichai, M. Toward Real-Time UAV Multi-Target Tracking Using Joint Detection and Tracking. IEEE Access 2023, 11, 65238–65254. 36. Li, J.; Ding, Y.; Wei, H.; Zhang, Y.; Lin, W. Simpletrack: Rethinking and improving the jde approach for multi-object tracking. Sensors 2022, 22, 5863. 37. Sun, P.; Cao, J.; Jiang, Y.; Zhang, R.; Xie, E.; Yuan, Z.; Wang, C.; Luo, P. Transtrack: Multiple object tracking with transformer. arXiv 2020,arXiv:2012.15460. 38. Meinhardt, T.; Kirillov, A.; Leal-Taixe, L.; Feichtenhofer, C. Trackformer: Multi-object tracking with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 8844–8854. 39. Xu, Y.; Ban, Y.; Delorme, G.; Gan, C.; Rus, D.; Alameda-Pineda, X. TransCenter: Transformers with dense representations for multiple-object tracking. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 7820–7835. 40. Zhou, X.; Yin, T.; Koltun, V.; Krähenbühl, P. Global Tracking Transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 8771–8780. 41. Chen, M.; Liao, Y.; Liu, S.; Wang, F.; Hwang, J. TR-MOT: Multi-Object Tracking by Reference. arXiv 2022, arXiv:2203.16621.
  • 27. Drones 2024, 8, 349 27 of 27 42. Wu, H.; He, Z.; Gao, M. GCEVT: Learning Global Context Embedding for Vehicle Tracking in Unmanned Aerial Vehicle Videos. IEEE Geosci. Remote Sens. Lett. 2023, 20, 1–5. 43. Xu, X.; Feng, Z.; Cao, C.; Yu, C.; Li, M.; Wu, Z.; Ye, S.; Shang, Y. STN-Track: Multiobject Tracking of Unmanned Aerial Vehicles by Swin Transformer Neck and New Data Association Method. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2022, 15, 8734–8743. Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.