Multiple Object Tracking in Drone Aerial Videos by a Holistic Transformer and Multiple Feature Trajectory Matching Pattern

Citation: Yuan, Y.; Wu, Y.; Zhao, L.;
Pang, Y.; Liu, Y. Multiple Object
Tracking in Drone Aerial Videos by a
Holistic Transformer and Multiple
Feature Trajectory Matching Pattern.
Drones 2024, 8, 349. https://doi.org/
10.3390/drones8080349
Academic Editor: Xiwang Dong
Received: 23 June 2024
Revised: 22 July 2024
Accepted: 26 July 2024
Published: 28 July 2024
Copyright: © 2024 by the authors.
Licensee MDPI, Basel, Switzerland.
This article is an open access article
distributed under the terms and
conditions of the Creative Commons
Attribution (CC BY) license (https://
creativecommons.org/licenses/by/
4.0/).
drones
Article
Multiple Object Tracking in Drone Aerial Videos by a Holistic
Transformer and Multiple Feature Trajectory Matching Pattern
Yubin Yuan , Yiquan Wu *, Langyue Zhao, Yaxuan Pang and Yuqi Liu
College of Electronic and Information Engineering, Nanjing University of Aeronautics and Astronautics,
Nanjing 211106, China; harley_yuan@nuaa.edu.cn (Y.Y.); zlangyue@nuaa.edu.cn (L.Z.);
hins_pang@nuaa.edu.cn(Y.P.); tolyuqi@nuaa.edu.cn (Y.L.)
* Correspondence: imagestrong@nuaa.edu.cn; Tel.: +86-137-7666-7415
Abstract: Drone aerial videos have immense potential in surveillance, rescue, agriculture, and urban
planning. However, accurately tracking multiple objects in drone aerial videos faces challenges like
occlusion, scale variations, and rapid motion. Current joint detection and tracking methods often
compromise accuracy. We propose a drone multiple object tracking algorithm based on a holistic
transformer and multiple feature trajectory matching pattern to overcome these challenges. The
holistic transformer captures local and global interaction information, providing precise detection and
appearance features for tracking. The tracker includes three components: preprocessing, trajectory
prediction, and matching. Preprocessing categorizes detection boxes based on scores, with each
category adopting specific matching rules. Trajectory prediction employs the visual Gaussian mixture
probability hypothesis density method to integrate visual detection results to forecast object motion
accurately. The multiple feature pattern introduces Gaussian, Appearance, and Optimal subpattern
assignment distances for different detection box types (GAO trajectory matching pattern) in the data
association process, enhancing tracking robustness. We perform comparative validations on the
vision-meets-drone (VisDrone) and the unmanned aerial vehicle benchmarks; the object detection
and tracking (UAVDT) datasets affirm the algorithm’s effectiveness: it obtained 38.8% and 61.7%
MOTA, respectively. Its potential for seamless integration into practical engineering applications
offers enhanced situational awareness and operational efficiency in drone-based missions.
Keywords: multiple object tracking; transformer; detection confidence; multiple feature matching
1. Introduction
In recent years, with the rapid development of drone technology, drone aerial videos
have become an effective means of acquiring high-resolution, wide-coverage areas and
hold significant potential in various applications such as surveillance, rescue operations,
agriculture, and urban planning [1]. Drone aerial videos capture a wide range of object
categories, including human activities, vehicles, buildings and infrastructure, and natural
environments, among others, providing rich data that endow drones with the capability
to monitor and track various objects in different application scenarios. In this context,
multiple object tracking (MOT) has become particularly important for processing drone
aerial videos, allowing systems to track and monitor multiple objects, thus enabling a
more comprehensive range of applications such as object tracking, behavior analysis, and
environmental monitoring. However, multi-object tracking in drone aerial videos faces
numerous challenges, including object occlusion, object variations at different scales, rapid
object motion, complex environmental conditions, and data noise. Traditional multi-object
tracking methods have limitations in addressing these issues, thus requiring more advanced
techniques to enhance tracking performance [2].
The majority of multi-object tracking methods for drone aerial videos are based on
detection. These methods initially identify objects in each frame using object detection
Drones 2024, 8, 349. https://doi.org/10.3390/drones8080349 https://www.mdpi.com/journal/drones

Drones 2024, 8, 349 2 of 27
algorithms, then employ data association, motion estimation, and filter updating to resolve
occlusion and scale variations. Long-term tracking may involve object re-identification
to handle object loss. Maintaining object trajectory information and conducting analyses
improves the system’s robustness in complex environments. Additionally, to further
improve efficiency, some researchers synchronize detection and tracking, integrating both
technologies to address the challenges posed by the wide variety and complex appearances
of objects in drone aerial videos.
Transformer models, known for their self-attention mechanism and parallel comput-
ing capabilities, have revolutionized natural language processing and computer vision [3].
Their versatility extends from vision transformers to full transformer models, and they
enable breakthroughs in tasks like image classification, object detection, and semantic
segmentation; they even branch into action recognition, object tracking, and scene flow esti-
mation. In drone aerial video analyses, transformers offer fresh perspectives for multi-object
detection and tracking. Unlike convolutional neural networks, transformers emphasize
global context interactions alongside local contexts, enhancing understanding of spatial
relationships. However, the computational expense of fine-grained self-attention in high-
resolution images poses challenges. Recent studies explore solutions like coarse-grained
global or fine-grained local self-attention to alleviate the computational burden, albeit at
the cost of simultaneously modeling short- and long-distance visual dependencies [4].
Given these challenges and the transformative potential of transformer models, we are
motivated to explore and develop advanced multi-object tracking methods that leverage
the strengths of transformers. Therefore, we propose a multi-object tracking method named
GAO-Tracker, which is based on a holistic transformer and multiple feature trajectory
matching pattern, to address various challenges in drone aerial videos. Our goal is to
overcome the limitations of traditional approaches and enhance the performance and
robustness of MOT in drone aerial videos, enabling more accurate and reliable tracking in
diverse and complex environments.
The remaining sections of this paper are organized as follows: Section 2, Related
Work, reviews and discusses the latest advancements in the field of multiple object track-
ing (MOT). We analyze current mainstream and cutting-edge technologies, including
object-feature-based methods, joint detection and tracking methods, and transformer-based
methods, providing a solid theoretical foundation and practical background for this re-
search. Section 3, Methodology, details our proposed GAO-Tracker method for multi-object
tracking. We delve into the core concepts, including the use of a holistic transformer
and multiple feature trajectory matching pattern to address various challenges in drone
aerial videos. We describe the model structure, algorithm workflow, and implementation
details. Section 4, Experiments, presents extensive experiments and performance evalu-
ations of GAO-Tracker. We test the method on several public datasets and compare it
with state-of-the-art methods. The results demonstrate GAO-Tracker’s superior perfor-
mance and robustness in complex scenarios. Section 5, Discussion, provides an in-depth
analysis of the experimental results. We discuss GAO-Tracker’s performance in different
scenarios, analyze its strengths and limitations, and suggest potential improvements and
future research directions. Section 6, Conclusion, summarizes the main contributions and
findings of this paper. We reiterate GAO-Tracker’s innovations in enhancing multi-object
tracking performance in drone aerial videos and discuss its prospects and potential for
practical applications.
2. Related Work
This section aims to comprehensively review and discuss the latest research advance-
ments in the field of multiple object tracking. By deeply analyzing current mainstream and
cutting-edge technologies, we establish a solid theoretical foundation and practical back-
ground for this study. First, we focus on the basic framework and challenges of multiple
object tracking. Then, we detail several core methods: object-feature-based multi-object
tracking methods, which achieve continuous tracking by extracting and utilizing the ap-

Drones 2024, 8, 349 3 of 27
pearance, motion, and other feature information of objects; joint detection and tracking
multi-object methods, which tightly integrate object detection and tracking tasks to en-
hance the overall performance and efficiency of the system; and finally, transformer-based
multi-object tracking methods, given the transformer model’s outstanding performance in
sequence data processing. We explore how these methods utilize attention mechanisms to
achieve precise and robust object tracking in complex scenarios. Through this review and
analysis, we not only present the latest achievements in the MOT field but also highlight
the current research gaps and shortcomings, leading to the research motivation and main
contributions of this paper. Our goal is to provide new insights and solutions for the
development of multi-object tracking technology.
2.1. Multiple Object Tracking
Multi-object tracking is a highly regarded technology, and its wide range of applica-
tions has attracted widespread interest among scholars. In the early stages of research,
researchers primarily focused on applying optimization algorithms to derive object trajec-
tories [5]. The IOUTracker, which relies solely on the bounding box intersection over union
(IOU), was the simplest early multi-object tracking method [6]. Researchers gradually
introduced motion models and Kalman filters to predict the positions of objects in the next
frame [7]. Although these improvements made multi-object tracking algorithms faster and
significantly improved their performance, the algorithms performed poorly in complex
occlusion and object loss situations. To address these challenges, researchers introduced
re-identification (ReID) features as appearance models, using visual features of objects be-
tween different frames to match objects and improve the accuracy of associations between
trajectories and detection results [8]. In addition to ReID, some studies have utilized image
segmentation techniques to identify and track objects, thereby better handling occlusion sit-
uations [9]. Furthermore, some researchers have begun to use recurrent neural networks or
attention mechanisms to model the spatiotemporal relationships between objects, thereby
improving tracking accuracy and stability. However, these methods often employ a single
matching approach, neglecting the different characteristics of different types of objects.
Moreover, introducing these different technological approaches into tracking systems can
result in suboptimal tracking results, limiting effectiveness.
2.2. Object-Feature-Based Multi-Object Tracking Methods
Benefiting from the rapid development of object detectors, object feature modeling has
become widely used in multi-object tracking algorithms from the perspective of drones. It
achieves multi-object tracking by capturing unique features of objects such as color, texture,
and optical flow. These extracted features must be distinctive in order to discriminate
different objects in the feature space effectively. Once these features are extracted, similarity
criteria can be utilized to find the most similar objects in the next frame, thus enabling multi-
object tracking. SCTrack adopts a three-stage data association method that combines object
appearance models, spatial distances, and explicit occlusion handling units. The system
relies on the motion patterns of tracked objects and considers environmental constraints,
thus exhibiting good performance in handling occluded objects [10]. To address the issue of
the subjective setting of fusion ratios between appearance and motion, which often merge
appearance similarity and motion consistency in the latest frame, the appearance similarity
between objects and surrounding objects is computed, object motion is predicted using
Social LSTM networks, and weighted appearance similarity and motion predictions are
used to generate associations between the current object and the object in the previous
frame [11]. However, due to the significant increase in computational costs, false detections,
drone aerial backgrounds, and other issues associated with handling large numbers of
object detections and association computations, these methods need to overcome various
challenges in maintaining accuracy while mitigating computational costs, false detections,
object associations, and so on.

Drones 2024, 8, 349 4 of 27
2.3. Joint Detection and Tracking Multi-Object Methods
To enhance the computational speed of the entire drone aerial multi-object tracking
system, researchers have actively explored methods that combine object detection and
feature extraction to achieve greater sharing in computation. JDE was the first attempt at
this approach and innovatively integrated the feature extraction branch into the single-stage
detector YOLOv3 [12]. Conversely, Fairmont balanced the handling between detection
and recognition tasks by adopting the anchor-free detector CenterNet to reduce anchor
ambiguities [13]. In addition to these joint detection and feature embedding methods,
several other single-stage trackers have emerged. GLOA designed global–local perception
blocks to extract scale variance feature information from input frames. Adding identity
embedding branches to the prediction heads outputs more discriminative identity informa-
tion [14]. CenterTrack [15] and Chained Tracker [16], on the other hand, use multi-frame
methods to predict bounding boxes in consecutive frames, facilitating efficient short-term
associations that eventually form long-term object trajectories. However, it is essential to
note that these technologies often generate many identity switches due to the difficulty of
capturing long-term dependencies. Additionally, these methods cannot simultaneously
consider multiple features of objects and differences in features among different categories,
resulting in the easy loss of tracking for some small objects.
2.4. Transformer-Based Multi-Object Tracking Methods
In recent years, transformer-based models have achieved significant success in the field
of computer vision, primarily excelling in the domain of object detection. This has given
rise to several transformer-based methods making strides in drone multi-object tracking.
Some methods based on DETR [17] and its derivative models, such as TransTrack [18],
TrackFormer [19], and MOTR [20], represent the front of online tracking and training
progress in the field of MOT. Swin-JDE leverages transformers and comprehensively
considers three factors—detection confidence, appearance embedding distance, and IoU
distance—to match each trajectory and the detection information. Furthermore, MOTR
achieves end-to-end object tracking by iteratively updating tracking queries, eliminating the
need for complex post-processing steps. MeMOT [21], similar to MOTR, utilizes attention
mechanisms to predict by focusing on object states. Despite pioneering new tracking
paradigms, these methods still fall short of advanced tracking algorithms. While standard
self-attention can capture fine-grained short- and long-distance interactions, executing
attention on high-resolution feature maps incurs high computational costs, leading to
explosive growth in time and memory costs. This paper addresses this issue through a
holistic self-attention module.
Therefore, we proposes a multi-object tracking method named GAO-Tracker based on
a holistic transformer and multiple feature trajectory matching pattern to address various
challenges in drone aerial videos. The effectiveness of the proposed method is validated
through a series of experiments and quantitative analyses, and we compare it with excellent
methods of the same kind and provide new insights and methods for multi-object tracking
in drone applications. The main contributions are as follows:
(1) A framework named GAO-Tracker, which integrates object detection and tracking
in a joint detection and tracking framework for drone aerial videos, is proposed. The
framework employs a holistic transformer as the core model for object detection and
includes a GAO trajectory matching algorithm based on object features in drone aerial
videos to achieve efficient and precise multi-object tracking.
(2) The holistic transformer, which combines fine-grained local interactions and coarse-
grained global interactions, is proposed. The framework includes an object detector holistic
trans-detector using a joint anchor-free detection head to achieve accurate object detection
in drone aerial videos.
(3) A multi-object trajectory prediction and matching module named the GAO-trajectory
matching pattern is proposed; it comprehensively considers the appearance features, mo-
tion characteristics, and size features of objects and trajectories. It includes three matching

Drones 2024, 8, 349 5 of 27
modes: Gaussian-IOU, Appear-IOU, and OSPA-IOU, fully exploiting various object and
trajectory information to achieve robust tracking of multiple objects in drone aerial videos.
(4) Using the prior information of the object’s position from the previous frame and
combining it with object visual features, a visual Gaussian mixture probability hypothesis
density (VGM-PHD) trajectory predictor tailored to the features of drone aerial videos is
designed to provide accurate trajectory information for trajectory matching.
3. Methodology
The proposed multi-object tracking system for drone aerial videos consists of the holis-
tic trans-detector module and the GAO-trajectory matching pattern trajectory association
module. The holistic trans-detector model is an anchor-free object detector and feature
extraction module that integrates holistic self-attention, combining fine-grained local and
coarse-grained global interactions. In this new mechanism, each token finely attends to
its nearest surrounding tokens and coarsely attends to its distant surrounding tokens,
effectively capturing short-term and long-term visual dependencies. The GAO-trajectory
matching pattern trajectory association module handles the data association process by si-
multaneously considering detection confidence, appearance embedding distance, and IOU
distance, thereby enhancing the tracking robustness of the MOT model. The framework is
illustrated in Figure 1.
Figure 1. GAO-Tracker framework.
3.1. Holistic Trans-Detector: Object Detection and Feature Extraction
In order to adapt to high-resolution visual tasks, high-resolution feature maps can be
obtained in the early stages. The entire model adopts a hierarchical design consisting of
four stages, each reducing the resolution of the input feature map and expanding the recep-
tive field layer by layer, like a CNN. The framework is shown in Figure 2. At the beginning
of the input, patch embedding is done, which cuts the image into individual blocks and
embeds them into the embedding. Each stage is composed of multiple holistic transformer
layers. The specific structure of the holistic transformer layer is shown in Figure 3; it is
mainly composed of LayerNorm, MLP (multi-layer perceptron), and holistic attention.

Drones 2024, 8, 349 6 of 27
Figure 2. Holistic trans-detector.
Figure 3. Holistic transformer.
An image with a resolution of H × W × 3 is first divided into blocks of size 4 × 4,
resulting in H
4 × W
4 × (4 × 4 × 3) patches. Then, these patches are projected into features
of dimension d using a convolutional layer for which the kernel size and stride are both
equal to 4. Given this spatial feature map, it is passed through four stages of concatenated
holistic transformer layers. In each stage, the holistic transformer block consists of 2, 2, 18,
and 2 holistic transformer layers, respectively. The selected configuration aims to capture
complex features at different levels of abstraction gradually. In the initial stage, there are
two layers, each aimed at capturing low-level features. In the middle stage, 18 layers focus
on learning high-level and complex features. In the final stage, two layers refine these
features to achieve precise tracking. After each stage, a patch embedding layer is added
to reduce the spatial dimensions of the feature map by half while doubling the feature
dimension. Finally, the feature maps from all four stages are sent to the detection head,
which simultaneously outputs appearance feature vectors of the objects for multi-object
trajectory matching.
Traditional transformer models face high computational and memory costs with large-
scale input data due to the global self-attention mechanism, which considers all tokens in
the input sequence. A holistic transformer addresses this by partitioning the input feature
map into sub-windows and conducting attention operations on each sub-window, reducing
computation and memory usage.
For a feature map of size M × N for x ∈ RM×N×d, we first divide it into partitions
of size 4 × 4, with each partition serving as a feature perception core in order to perform
attention perception within a localized context. Then, we locate the surrounding context
for each window instead of individual tokens. Sub-window pooling is a core component
of a holistic transformer and divides the input feature map into smaller sub-windows,
thereby reducing the number of tokens each attention operation needs to focus on. This
segmentation and pooling transforms global attention operations into local operations,
making the model more scalable and efficient. The process is illustrated in Figure 4.

Drones 2024, 8, 349 7 of 27
Figure 4. Holistic self-attention. We initially partition the feature map into 4 × 4 grids. While the
central 4 × 4 grid serves as the query window, we extract tokens at three granularity levels of 1 × 1,
2 × 2, and 4 × 4, respectively, from surrounding regions to serve as its keys and values. This results in
tokens with dimensions of 8 × 8, 6 × 6, and 5 × 5. Ultimately, these tokens from the three levels are
concatenated to compute the keys and values for the 4 × 4 = 16 tokens (queries) within the window.
Suppose the input feature map is denoted as x ∈ RM×N×d, where M × N represents
the spatial dimensions, and d represents the feature dimensions. Sub-window pooling is
performed in parallel on the feature map at three levels l ∈ {1, 2, 4}, dividing the input
feature map x into grids of size l × l for spatial sub-window pooling, followed by a simple
linear layer f l
p to perform spatial sub-window pooling, as shown in Equation (1).
xl
= f l
p(x̂) ∈ R
M
l × N
l ×d
(1)
where x̂ = Restructure(x) ∈ R( M
l × N
l ×d)×(l×l)
. The pooled feature maps at different levels
l provide rich fine-grained and coarse-grained information.
3.1.1. Attention Computation
After obtaining the pooled feature maps at all levels, three linear projection layers fq,
fk, and fv are used to compute the query for the first layer and the key and value for all
layers, as shown in Equations (2)–(4).
Q = fq

xl

(2)
Kl
= fk

xl

(3)

Drones 2024, 8, 349 8 of 27
Vl
= fv

xl

(4)
To perform holistic self-attention, extracting surrounding tokens for each query token
in the feature map is necessary. For the queries within the i-th window Qi ∈ Rsp×sp×d
,
keys Ki ∈ Rs×d and values Vi ∈ Rs×d are extracted from the surrounding Kl and Vl of the
window, where l represents the size of the keys and values, and s is the sum of all holistic
regions from all levels, i.e., s = 8 × 8 + 6 × 6 + 5 × 5. Finally, the holistic self-attention for
Qi is computed as shown in Equation (5).
Attention(Qi, Ki, Vi) = So f tmax
QiKT
i
√
d
+ B
!
Vi (5)
where B = {Bl} is a learnable relative position bias. For the first layer, it is parameterized as
Bl ∈ R7×7, while for other holistic levels, considering their different granularities towards
queries, all queries within the window are treated equally. Bl ∈ Rsl
r×sl
r is then used to
represent the relative position deviation tokens between the query window and each pooled
sl
r × sl
r.
The relative position deviation takes into account the positional relationships between
different sub-windows. This allows the model to understand the dependencies between
different positions better, thus enabling more accurate attention computation. The intro-
duction of relative position deviation enhances the flexibility and expressive power of the
model, enabling it to adapt better to different types of input data.
Since the attention operations for each sub-window are independent, modern hard-
ware and parallel computing frameworks can be leveraged to accelerate the model’s
training and inference processes.
3.1.2. Detection Head
We designed an anchor-free prediction head based on the CenterNet architecture and
divided it into detection and appearance branches. Through holistic transformer feature
extraction, the output feature map is provided to both branches for object detection and
appearance embedding. The detection branch consists of three heads, which are used to
predict the heatmap, the offset of the object’s center point, and the object’s size, respectively.
The heatmap head is utilized to predict the center position of the object, with an
output dimension of h × w × Cls, where h and w represent the height and width of the
input feature map, and Cls is the number of detection classes. Each class has its own
heatmap output, with each Gaussian peak in the heatmap representing the center position
of the detected object. Assuming there are N objects in the current training sample, let

ci
x, ci
y

represent the center position of the i-th object in i ∈ [1, N]. Then, the heatmap
corresponding to the current training sample is calculated as shown in Equation (6).
Mxy =
N
∑
i=1
exp

−
(x − ⌊ci
x
4 ⌋)2 + (y − ⌊
ci
y
4 ⌋)2
2σ2
c

 f (6)
Here, the operator ⌊a⌋ returns the nearest and smallest integer to a, while σc is the
standard deviation parameter. M ∈ Rh×w×Cls represents the output of the heatmap head,
and Mxy serves as the value of M at position (x, y).
The box size and center offset heads are used to predict the BBox and the offset of the
object’s center point, respectively. Let BBoxi = xi
lt, yi
lt, xi
rb, yi
rb

represent the BBox of the
i-th object, where xi
lt, yi
lt

and xi
rb, yi
rb

represent the top-left and bottom-right coordinates
of the object, respectively. Simultaneously, the offset of the center point of the i-th object is
defined as shown in Equation (7).

Drones 2024, 8, 349 9 of 27
oi
xy ≜

δi
x, δi
y

=
ci
x
4
− ⌊
ci
x
4
⌋,
ci
y
4
− ⌊
ci
y
4
⌋
!
(7)
This helps improve the accuracy of predicting the center position of the object. The
term ô ∈ Rh×w×2 represents the output of the center offset head, and ôi
xy represents the
offset prediction of the i-th object at position (x, y) on ô.
The appearance branch is responsible for generating embedding features that assist in
identifying the object. Each head consists of a 3 × 3 convolutional layer with 256 channels,
followed by a 1 × 1 convolutional layer to produce the final output. The embedding heads
of the appearance branch calculate the appearance feature vectors of the object, which are
used in the association matching operation for multi-object tracking tasks. Specifically,
these appearance feature vectors can be used for association matching to calculate the
similarity between the tracker and the detected object. A 128-dimensional vector at position
(x, y) represents the appearance feature vector of the object at that location.
3.2. GAO Trajectory Matching Pattern
Our GAO trajectory matching pattern considers detection confidence, appearance
embedding distance, and IoU distance to associate all tracking trajectories with all detection
Bboxes. Figure 5 illustrates the architecture of the module. When receiving detection results
from the detector output, we add detection Bboxes with confidence scores higher than
0.5 to the high-score detection Bbox set, and those between 0.2 and 0.5 are added to the
low-score detection Bbox set.
Figure 5. GAO trajectory matching module.
Initially, predicted trajectories are matched with high-score detection boxes using the
Appear-IOU matching method. Unmatched trajectories then undergo secondary matching
with low-score detection boxes via Gau-IOU matching, with any remaining unmatched
low-score boxes removed. Subsequently, high-score detection boxes that were not initially
matched are re-evaluated using (optimal subpattern assignment) OSPA-IOU matching with
previously unmatched trajectories from the previous frame. High-score boxes unmatched
after both attempts are considered new trajectories, while trajectories that have been
continuously unmatched for 30 frames are removed from tracking, with flexibility to adjust
based on the video frame rate.
Successful matches update tracking through the update process with matched de-
tection frames. Trajectory prediction involves modeling visual objects’ trajectories as a
random finite set, utilizing the visual Gaussian mixture probability hypothesis density

Drones 2024, 8, 349 10 of 27
to generate prediction information for the tracker, which primes the model for the next
frame’s association matching.
The data association process employs four distance metrics, leading to the design of
three matching methods: Gau-IOU, Appear-IOU, and OSPA-IOU distance matching.
3.2.1. Appear-IOU Distance Matching
Appear-IOU trajectory matching considers the appearance and spatial location features
between the object and predicted trajectories while calculating the cosine distance and
IOU distance between all predicted trajectories and high-scored detection appearances as
metrics. The appearance vector of the object contains extensive appearance information,
which is combined with the IOU distance of the BBox to enhance the matching accuracy
between detection boxes and trajectories. The process as shown in Figure 6.
Figure 6. Appear-IOU trajectory matching.
Let BBoxi
d, Ei
d

represent the i-th detected object detection BBox and its corresponding
feature vector in the current frame, and BBoxi
t, Ei
t

represents the j-th trajectory-predicted
object BBox and its corresponding feature vector from the previous frame. The first distance
metric DI
ij is computed based on the IOU distance:
DI
ij = 1 −
area(BBoxi
d ∩ BBox
j
t)
area(BBoxi
d ∪ BBox
j
t)
(8)
where area(A) represents the area of the input set A, and the symbols ∩ and ∪ represent
the intersection and union operator of two sets. The appearance distance metric DA
ij is
calculated based on the cosine distance between two embedding feature vectors:
DA
ij = 1 −
Ai
d · A
j
t
∥Ai
d∥∥A
j
t∥
(9)
where · denotes the dot product between two vectors, and ∥A∥ denotes the 2-norm value
of the vector.
Subsequently, the IOU distances and appearance distances between all detections and
trajectories are combined in a weighted manner to obtain the Appear-IOU distance:
DAI
= αDA
ij + (1 − α)DI
ij (10)

Drones 2024, 8, 349 11 of 27
where α represents the proportion of the cosine distance, with values ranging between 0
and 1.
Finally, all Appear-IOU distances are merged into a cost matrix, and the Hungar-
ian algorithm is employed to achieve the best match. Unmatched trajectories undergo
secondary matching with low-scored detections through the Gau-IOU matching model.
In contrast, unmatched high-scored detection boxes undergo secondary matching with
inactive trajectories through the OSPA-IOU matching model.
3.2.2. Gau-IOU Distance Matching
The Gau-IOU distance matching process is illustrated in Figure 7. Low-scored detec-
tion boxes often represent small objects. In order to better extract object features, both the
low-scored detections and the trajectories to be matched are transformed into Gaussian
space. This transformation integrates the Wasserstein distance (WD) and the IOU distance
between the Gaussian distributions of trajectories and objects.
Figure 7. Gau-IOU trajectory matching.
We first transform the BBox of the object and trajectory into Gaussian space using a
matrix transformation. For the object box represented by (x, y, h, w) , the parameters of the
Gaussian distribution N(x|µ, Σ) are computed as:
µ = [x, y]T
(11)
Σ =

w2
4 0
0 w2
4
#
(12)
The key to matching detection boxes with trajectories is how to calculate the similarity
between the Gaussian distributions Nd(xd|µd, Σd) of the detection box and Nt(xt|µt, Σt) of
the trajectory box. We use the Wasserstein distance to compute the distance between the
two Gaussian distributions. The Wasserstein distance between two Gaussian distributions
is defined as:

Drones 2024, 8, 349 12 of 27
DW(Nd, Nt) = ∥µd − µt∥2
+ Tr(Σt) + Tr(Σt) − 2Tr

Σ
1
2
d ΣtΣ
1
2
d
1
2
!
(13)
The Wasserstein distance primarily consists of two components: the distance between
the center points, represented by (x, y), and a coupling term related to (h, w). Due to
the chain-like coupling relationship formed by these parameters, which causes them to
influence each other, the Wasserstein distance is highly advantageous for achieving high-
precision matching.
Next, the IOU distance and the WD distance between all detections and trajectories
are weighted to obtain the Appear-IOU distance.
DGI
= βDW + (1 − β)DI
ij (14)
where β represents the proportion of the WD distance and takes values between 0 and
1. Finally, the Hungarian algorithm is employed to achieve the best matching between
detections and trajectories based on all Gau-IOU distances. Unmatched trajectories are
converted to inactive trajectories, and unmatched low-scored detections are removed.
3.2.3. OSPA-IOU Distance Matching
The OSPA distance allows for considering subpattern matching of object trajectories,
enabling the model to better capture both the similarities and differences between object
trajectories. This, in turn, provides a more accurate assessment of tracking performance.
Building upon the foundation of IOU distance matching, we comprehensively consider
the OSPA distance and propose the OSPA-IOU trajectory matching model. The process is
illustrated in Figure 8.
Figure 8. OSPA-IOU trajectory matching.
Assume the object state set is X = {x1, x2, . . . , xm} and the object trajectory set is
Y = {y1, y2, . . . , yn} , where m, m ∈ N0 = {0, 1, 2, . . .} represent the estimated and true
numbers of objects, respectively. The OSPA distance is expressed as:

Drones 2024, 8, 349 13 of 27
Dp,e(x, y) =

1
n
min
π∈Πn
Σm
i=1

dc

xi, yπ(i)
p
+ (n − m)cp
1
p
(15)
where Πn represents all permutations for selecting numbers from the set {1, 2, . . . , n} . If
p = 1, the OSPA distance can be expressed as:
Dp,c = eloc
p,c(x, y) + ecard
p,c (x, y) (16)
eloc
p,c(x, y) =

1
n
min
π∈Πn
Σm
i=1

dc

xi, yπ(i)
p
1
p
(17)
ecard
p,c (x, y) =

1
n
(n − m)cp
1
p
(18)
where eloc
p,c(x, y) and ecard
p,c (x, y) represent the positional difference and cardinality difference
between the sets of estimated object states and true object states, respectively. The posi-
tional difference signifies the spatial gap, while the cardinality difference encompasses
performance metrics like the false track proportion, redundancy, and interruptions. The
truncation parameter adjusts the balance between positional and cardinality differences,
with smaller values prioritizing positional differences. Treating objects as single-element
sets and trajectories as multi-element sets, we compute the OSPA distance between them to
optimize matching between individual detections and trajectories.
Subsequently, the IOU distance and the WD distance between all detections and
trajectories are weighted to derive the Appear-IOU distance.
DOI
= λDp,c + (1 − λ)DI
ij (19)
where λ represents the proportion of the OSPA distance and takes values between 0 and 1.
Finally, the Hungarian algorithm is applied to achieve the best matching between detections
and trajectories based on all Gau-IOU distances. Unmatched high-score detections are
converted to new inactive trajectories, and inactive trajectories that remain unmatched for
30 frames are removed.
3.2.4. Visual Gaussian Mixture Probability Hypothesis Density
The visual Gaussian mixture probability hypothesis density (VGM-PHD) filtering
algorithm utilizes the center positions of all trajectories as the measurement input for
the random finite set, preserving object ID and size data to reconstruct trajectories. As-
sumptions include representing both spawned and newly born object PHDs as Gaussian
mixtures, independence between object detection and survival probabilities, and modeling
state transition density and observation likelihood functions as linear Gaussian models.
Both the motion model and observation model of the VGM-PHD filtering algorithm
are set to be linear, and noise and errors follow Gaussian distributions. Using the weights,
means, and variances of the PHD Gaussian distribution, the algorithm iteratively propa-
gates the multi-object states. The specific implementation steps of the VGM-PHD filtering
algorithm are as follows. Assuming the posterior PHD at a specific time is given by the
following Gaussian sum form:
Tk−1(x) = Σ
Jk−1
i=1 ωi
k−1N

x; mi
k−1, Pi
k−1

(20)
where ωi
k , mi
k , and pi
k represent the weight, mean, and covariance of the i-th Gaussian
component at time k for a single object state x , and Jk represents the number of Gaussian
components at time k. The function N(. . . ) represents variables that follow a Gaussian
distribution. The predicted intensity function at time k is given by:

Drones 2024, 8, 349 14 of 27
Tk|k−1(x) = Ts,k|k−1(x) + Tβ,k|k−1(x) + γk(x) (21)
The three terms on the right side respectively represent the predicted PHDs of surviv-
ing objects, spawned objects, and newly born objects. The intensity function obtained from
the GM-PHD filtering algorithm update can be expressed as:
Tk(x) = (1 − PD,k)Tk|k−1(x) + Σz∈Zk
TD,k(x; z) (22)
where the first term represents the PHD of missed objects, and the second term represents
the updated PHD of detected objects. In the VGM-PHD filtering algorithm, if the PHD
functions at time k − 1, the prior distribution generated by the prediction at time k and the
posterior distribution obtained by the filtering update can both be represented in Gaussian
mixture form. The weights can be obtained through PHD filtering, while the means and
covariances are recursively obtained through Kalman filtering. During the prediction and
update of the object PHD in VGM-PHD, the predicted object numbers Nk|k−1 and updated
object numbers Nk are given by:
Nk|k−1 = Σ
Jk|k−1
i=1 ωn
k|k−1 = Nk−1

PS,k + Σ
Jβ,k
i=1ωi
β,k

+ Σ
Jγ,k
j=1ωi
γ,k (23)
Nk = Σ
Jk
n=1ωn
k = Nk|k−1(1 − PD,k) + Σz∈Zk
Σ
Jk|k−1
j=1 ω
j
k(z) (24)
4. Experiments
4.1. Dataset and Evaluation Metrics
The proposed algorithm undergoes comprehensive evaluations on the VisDrone
MOT [22] and UAVDT [23] datasets, which encompass diverse drone-captured scenes
and facilitate a thorough assessment of the proposed methods’ practical effectiveness. Ex-
tensive evaluations compare the algorithm with other leading multi-object trackers across
various scenarios and conditions. Established MOT evaluation metrics are utilized to
assess performance comprehensively, with the aim of gauging overall effectiveness and
pinpointing potential weaknesses in each model. The metrics include:
(1) FP (↓): Number of false positives in the entire video.
(2) FN (↓): Number of false negatives in the entire video.
(3) IDSW (↓): Number of identity switches in the entire video.
(4) FM (↓): Number of ground truth trajectories interrupted during the tracking process.
(5) IDF1 (↑): Ratio of correctly identified detections to the computed detections and
ground truth.
(6) MOTA (↑): Combined FP, FN, and IDSW, scored as follows:
MOTA = 1 −
(FN + FP + IDSW)
GT
(25)
where GT is the actual tracking result.
(7) MOTP (↑): Mismatches between ground truth and predicted results, calculated as:
MOTP = 1 −
Σt,idt,i
Σtct
(26)
These metrics contribute to a comprehensive assessment of MOT algorithm perfor-
mance in various aspects, providing in-depth insights into system effectiveness.
4.2. Training Preprocessing
Existing MOT methods integrating object detection and appearance embedding often
use a single-stage training approach, where detection and appearance branches are trained
simultaneously. While this reduces training time, it can harm detection performance due
to differing learning objectives. In densely populated scenes, fully occluded objects may

Drones 2024, 8, 349 15 of 27
still have annotated bounding boxes in the training dataset, which can introduce errors
when learning appearance embeddings and can reduce tracking accuracy. To address
this, our proposed model filters highly occluded objects from the training samples before
commencing model training. To implement this, we initially define a metric variable
Boverlap ∈ [0, 1] to gauge the overlap between two ground truth Bboxes; the metric is
defined as follows:
Boverlap =
BBoxi
GT
T
BBox
j
GT
BBoxi
GT
S
BBox
j
GT
(27)
where BBoxi
GT and BBox
j
GT represent the i-th and j-th ground truth BBoxes of the input
training samples, respectively. A higher value of the variable indicates greater overlap
between the two ground-truth BBoxes. In object detection, a value Boverlap ≥ 0.75 signifies
substantial overlap between two BBoxes. Therefore, in this study, we set the threshold at
Boverlap ≥ 0.75, considering smaller BBoxes as indicative of occluded objects and excluding
them from the training dataset. We ultimately train the model using the filtered dataset.
4.3. Experimental Settings
The detector is initialized with pre-existing weights obtained from training on the
COCO dataset. We train the detector using SGD with the following parameters: 150 epochs,
a batch size of 16, a learning rate of 0.02, momentum set to 0.9, and decay set to 0.0001.
We train the detector on both the VisDrone and UAVDT datasets and perform validation
using the same set of verification images. We execute the testing on hardware (NVIDIA
RTX 4090 with 24 GB of memory) and calculate the average of the top-100 most reliable
detection results.
4.4. Comparative Experiments
4.4.1. Detection Comparison
To compare the performance of our detector, we select a total of seven excellent
detectors: DETR [17], Deformable DETR [24], YOLO-S [25], Swin-JDE [18], VitDet [26],
RTD-Net [27], and DN-DETR [28]. They are trained and evaluated on the VisDrone and
UAVDT datasets using the experimental settings described in their respective papers. DETR
completely discards traditional object detection components such as anchor boxes and non-
maximum suppression and utilizes a complete attention mechanism for end-to-end object
detection. Deformable DETR is an improved version of DETR that introduces deformable
attention to enhance the model’s adaptability to changes in object shape and scale. YOLO-S
employs a small feature extractor, skip connections, cascaded skip connections, and a
reshaping pass-through layer to facilitate cross-network feature reuse, combining low-
level positional information with more meaningful high-level information. The Swin-JDE
algorithm adopts a Swin transformer based on windowed self-attention as the backbone
network to enhance feature extraction capabilities. ViTDet utilizes ViT as the backbone for
a Mask R-CNN object detection model, enhancing competitiveness by optimizing the RPN
section. RTD-Net replaces positional linear projection with convolutional projection and
uses an efficient convolutional multi-head self-attention algorithm based on convolutional
transformer blocks to improve the recognition of occluded objects by extracting contextual
information. DN-DETR introduces a novel denoising training approach to address the
instability of bipartite graph matching in the DETR decoder during training, doubling the
convergence speed and significantly improving the detection results.
The comparative results in Table 1 demonstrate the substantial advantages of our
detection performance. AP is the average accuracy, and AP@0.5 and AP@0.75 indicate
intersection-to-union ratios greater than 50% and 75%, respectively. APs, APm, and APl
are the average accuracies for small objects (with an area less than 32 × 32 pixels), medium
objects (with an area between 32 × 32 and 96 × 96 pixels), and large objects (with an area
greater than 96 × 96 pixels), respectively. The visual comparison results in Figures 9 and 10

Drones 2024, 8, 349 16 of 27
show that our results exhibit excellent performance under various lighting conditions and
crowded environments.
(a) DETR (b) Deformable DETR
(c) YOLOS (d) Swin-JDE
(e) VitDet (f) RTD-Net
(g) DN-DETR-Net (h) Holistic trans-det
Figure 9. Comparison of detection results on the VisDrone dataset.

Drones 2024, 8, 349 17 of 27
(a) DETR (b) Deformable DETR
(c) YOLOS (d) Swin-JDE
(e) VitDet (f) RTD-Net
(g) DN-DETR-Net (h) Holistic Trans-Det
Figure 10. Comparison of detection results on the UAVDT dataset.

Drones 2024, 8, 349 18 of 27
Table 1. The detection results of the detectors on the datasets.
Dataset Detector AP AP@0.5 AP@0.75 APs APm APl
VisDrone
DETR [17] 34.8 63.4 32.2 12.8 38.5 55.6
Deformable DETR [24] 36.9 60.4 35.2 9.9 38.1 52.7
YOLOS [25] 36.6 63.1 38.7 15.4 39.9 54.9
Swin-JDE[18] 38.2 60.5 34.8 11.1 41.4 57.6
VitDet [26] 38.9 64.7 38.7 19.6 40.5 57.8
RTD-Net [27] 38.1 64.6 40.2 17.6 42.8 57.6
DN-DETR [28] 39.4 63.4 36.5 16.8 42.5 59.2
Holistic Trans-Det 39.6 67.9 40.8 18.6 40.3 59.4
UAVDT
DETR [17] 48.8 69.3 49.3 28.0 47.5 57.1
Deformable DETR [24] 47.2 69.2 50.3 29.0 53.2 59.4
YOLOS [25] 49.3 71.1 51.4 32.3 50.4 58.9
Swin-JDE [18] 49.6 69.9 52.8 33.9 54.8 59.7
VitDet [26] 54.6 68.9 59.5 37.5 57.9 61.0
RTD-Net [27] 52.2 71.4 55.6 36.3 57.2 60.9
DN-DETR [28] 56.7 68.6 60.2 38.7 59.8 62.9
Holistic Trans-Det 57.5 69.0 60.5 38.8 61.5 67.9
4.4.2. Tracking Comparison
We compared DeepSORT [29], ByteTrack [30], BoT-SORT [31], UAVMOT [32], DC-
MOT [33], TFAM [34], MTTJDT [35], and SimpleTrack [36] as well as transformer-based meth-
ods including TransTrack [37], TrackFormer [38], TransCenter [39], MOTR [20], MeMOT [21],
GTR [40], TR-MOT [41], GCEVT [42], STN-Track [43], and STDFormer [19]. These compar-
isons were conducted on the VisDrone MOT and UAVDT datasets.
To ensure consistent comparisons despite variations in object distributions across
datasets, we employed the holistic trans-detector to produce uniform detection results
for all tracking comparison methods. This approach mitigates evaluation bias stemming
from uneven category distributions, fostering fairer and more reliable tracking method
comparisons. To maintain detection accuracy across categories during evaluation, distinct
thresholds were applied: 0.3 for cars, 0.1 for trucks, and 0.4 for pedestrians, with a lower
threshold of 0.05 for buses, which present greater visual variability.
Tables 2 and 3 comprehensively compare GAO-Tracker with other popular trackers
on the VisDrone MOT and UAVDT datasets. The evaluation includes critical metrics such
as MOTA, MOTP, IDF1, and IDSW and comparisons with other methods. GAO-Tracker
demonstrates excellent performance by effectively utilizing position and appearance in-
formation. DeepSORT associates categories independently using positional information.
ByteTrack utilizes low-scoring detection for similarity tracking and background noise
filtering. BoT-SORT incorporates camera motion compensation for improved matching.
UAVMOT enhances object feature association with an ID feature update module. Simple-
Track merges object embedding cosine and GIOU distances to create a new association
matrix. Transformer-based methods like TransTrack employ a query–key mechanism for
existing object tracking and new object detection. TrackFormer considers position, occlu-
sion, and object recognition features simultaneously. TransCenter predicts the association’s
heatmap of object centers globally. MOTR models the entire trajectory of an object using a
tracking query. MeMOT uses information from previous frames for tracking clues. GTR
extends the window length for matching and utilizes interaction information fully. TR-
MOT achieves reliable associations using visual temporal features. STDFormer utilizes the
transformer’s remote modeling capability for intent and decision information extraction.
However, these methods apply a single matching rule for all detection classes, leading to
inaccurate tracking of various object classes and poorer performance.

Drones 2024, 8, 349 19 of 27
Table 2. Comparison between GAO-Tracker and the latest multiple trackers tested on the Vis-
Drone dataset.
Tracker MOTA↑ MOTP↑ IDF1 (%)↑ IDSW↓ MT (%)↑ ML (%)↑ FP↓ FN↓
Motion-
based
DeepSORT [29] 19.4 69.8 33.1 6387 38.8 52.2 15,181 44,830
ByteTrack [30] 25.1 72.6 40.8 4590 42.8 50.3 10,722 24,376
BoT-SORT [31] 23.0 71.6 41.4 7014 51.9 73.6 10,701 47,922
UAVMOT [32] 25.0 72.3 40.5 6644 52.6 49.6 10,134 55,630
DCMOT [33] 33.5 76.1 45.5 1139 - - 12,594 64,856
TFAM [34] 30.9 74.4 42.7 3998 - - 27,732 126,811
MTTJDT [35] 31.2 73.2 43.6 2415 - - 25,976 183,381
Transformer-
based
TransTrack [37] 27.3 62.1 28.3 2523 33.5 59.7 15,028 51,396
TrackFormer [38] 24 77.3 38 4724 39 46.3 11,731 32,807
TransCenter [39] 29.9 66.6 46.8 3446 33.4 61.8 15,104 20,894
MOTR [20] 13.1 72.4 47.1 2997 52.9 72 12,216 42,186
MeMOT [21] 29.4 73 48.7 3755 46.7 47.9 9963 30,062
GTR [40] 28.1 76.8 54.5 2000 61.3 57.6 8165 10,553
TR-MOT [41] 29.9 64.3 46 1005 42.8 59.9 7593 17,352
GCEVT [42] 34.5 73.8 50.6 841 520 612 - -
STN-Track [43] 38.6 - 73.7 668 31.4 51.2 7385 76,006
STDFormer [19] 35.9 74.5 59.9 1441 52.7 60.3 8527 20,558
GAO-Tracker 38.8 76.3 54.3 972 55.9 52.4 6883 10,204
Table 3. Comparison between GAO-Tracker and the latest multiple trackers tested on the UAVDT
dataset.
Tracker MOTA↑ MOTP↑ IDF1 (%)↑ IDSW↓ MT (%)↑ ML (%)↑ FP↓ FN↓
Motion-
based
DeepSORT [29] 35.9 71.5 58.3 698 43.4 25.7 50,513 59,733
ByteTrack [30] 39.1 74.3 44.7 2341 43.8 28.1 14,468 87,485
BoT-SORT [31] 37.2 72.1 53.1 1692 40.8 27.3 42,286 64,494
UAVMOT [32] 43.0 73.5 61.5 641 45.3 22.7 27,832 65,467
SimpleTrack [36] 45.3 73.9 57.1 1404 43.6 22.5 21,153 53,448
TFAM [34] 47.0 72.9 67.8 506 - - 68,282 111,959
Transformer-
based
TransTrack [37] 33.2 72.4 67.6 1122 38.9 23.8 50,746 54,938
TrackFormer [38] 53.4 74.2 46.3 2247 43.7 23.3 13,719 91,061
TransCenter [39] 48.9 73.9 51.3 2287 32.6 35.1 27,995 93,013
MOTR [20] 35.6 72.5 56.1 1759 39.8 29.3 39,733 56,368
MeMOT [21] 45.6 74.6 62.8 2118 34.9 26.5 38,933 59,156
GTR [40] 46.5 75.3 61.1 1482 42.7 18.6 21,676 52,617
TR-MOT [41] 57.7 74.1 55.7 2461 33.9 21.3 32,217 50,838
GCEVT [42] 47.6 73.4 68.6 1801 618 363 - -
STN-Track [43] 60.6 - 73.1 1420 57.0 17.0 12,825 61,760
STDFormer [19] 60.6 74.8 61.7 1642 44.6 20.3 20,258 41,895
GAO-Tracker 61.7 75.2 67.9 1216 45.3 24.6 24,915 59,640
Combining the data from Tables 2 and 3, we observe that transformer-based methods
outperform motion-based methods. This trend reflects the effectiveness and superiority
of transformer-based methods for addressing multi-object tracking problems in drone
aerial videos and that transformer-based methods enable better capturing of long-distance
dependencies between objects in complex environments and better handling of challenges
such as object occlusion and scale changes.

Drones 2024, 8, 349 20 of 27
Figures 11 and 12 show time-order frames with bounding boxes and different-colored
identities. In the initial images (left), bounding boxes may appear inconsistent due to occlu-
sion. However, in the final images (right), GAO-Tracker maintains consistent bounding
boxes, reducing the identity switching of pedestrians. The center images show intermediate
steps where identities might temporarily switch due to occlusions or overlaps. The final
images (right) demonstrate GAO-Tracker’s ability to preserve identities throughout the se-
quence, even in crowded scenarios. By utilizing object motion information, GAO-Tracker’s
trajectory association technology effectively solves the problems of missed detection and
incorrect detection caused by occlusion, especially in the case of short-term overlapping
objects. Compared with previous algorithms based on bounding box connections, GAO-
Tracker reduces pedestrian identity switching. The results indicate that GAO-Tracker
performs well in crowded scenarios of drone aerial videos and ensures consistent bounding
boxes and identities throughout the entire sequence.
Figure 11. Tracking results of GAO-Tracker on the VisDrone dataset.
Figure 12. Tracking results of GAO-Tracker on the UAVDT dataset.
4.5. Ablation Experiments
To demonstrate the effectiveness of the designed method, we conducted multiple sets
of ablation experiments on training preprocessing strategies, the GAO module, the sequence
of various matching strategies, and VGM-PHD on the VisDrone and UAVDT datasets.

Drones 2024, 8, 349 21 of 27
4.5.1. Effect of Backbone
To validate the effectiveness of our holistic trans as the backbone network, we com-
pared it with ResNet50, DLA-34, ViT, and Swin-L and conducted ablation experiments.
Table 4 presents the performance evaluation results of the proposed GAO-Tracker combined
with different backbone networks. This experiment used the proposed data association
method as the post-processing module and evaluated the UAVDT and VisDrone test
datasets. Based on the results in Table 4, we have the following findings: In the evaluation
results of UAVDT, using DLA-34 as the backbone network yielded the best performance,
with MOTA, MOTP, and IDF1 scores reaching 61.9%, 75.1%, and 66.4%, respectively. Ad-
ditionally, using the holistic trans backbone network resulted in the lowest IDSW count.
In the evaluation results of VisDrone, compared to ResNet50, DLA-34, ViT, and Swin-L,
using the holistic trans backbone network achieved 38.8% MOTA, 76.3% MOTP, and 54.3%
IDF1 and a significant reduction in FP. Since VisDrone contains many congested scenes, the
experimental results indicate that using the holistic trans backbone network can improve
MOT performance in crowded scenarios.The tracking performance using the DLA-34 back-
bone network was the best on UAVDT but was significantly worse on VisDrone. In contrast,
using the holistic trans backbone network resulted in inferior tracking performance on
UAVDT but the best performance on VisDrone. The MOTA increase and the FP decrease
using the holistic trans backbone network indicate that our model significantly enhances
the detection capability of correct objects.
Table 4. Performance evaluation of the proposed GAO-Tracker model combined with different
backbone networks.
Dataset Detector Backbone MOTA↑ MOTP↑ IDF1 (%)↑ IDSW↓ MT (%)↑ ML (%)↑ FP↓ FN↓
VisDrone
ResNet-50 19.6 59.9 36.7 4287 35.3 31.3 9078 18,764
DLA-34 34.9 68.5 50.3 2198 46.3 43.5 8818 13,070
ViT 35.2 69.7 51.0 2019 48.9 45.9 8009 12,897
Swin-L 35.5 70.2 52.3 1509 51.9 47.6 6832 12,223
Holistic Trans 38.8 76.3 54.3 972 55.9 52.4 6883 10,204
UAVDT
ResNet-50 56.2 70.3 62.1 2252 40.4 22.6 32,743 72,629
DLA-34 61.9 75.1 66.4 1798 42.4 23.4 28,705 65,616
ViT 60.1 74.0 65.9 1504 42.8 23.7 26,937 62,348
Swin-L 59.6 74.4 66.0 1264 43.9 23.8 25,822 61,324
Holistic Trans 61.7 75.2 67.9 1216 45.3 24.6 24,915 59,640
Based on the observations above, it can be concluded that the backbone network signif-
icantly impacts the tracking performance of multi-object trackers depending on the density
of tracking objects in the scene. Therefore, improving the feature extraction capability
of the backbone network model is a crucial factor affecting the tracking performance of
multi-object trackers.
4.5.2. Impact of Pre-Processing and Detection Results Classification
During the training process of the multi-object tracking model, we attempted to train
the network by removing highly overlapped objects to provide efficient and accurate
appearance embedding information for multi-object tracking matching. We also explored
the impact of classifying high- and low-scoring detection boxes. As shown in Table 5, we
verified the effectiveness by adding or not adding training set optimization and detection
score branches. “Pre” indicates training with removing highly overlapped objects, while
“Grade” signifies the model distinguished between high- and low-scoring detection boxes
for input into the GAO trajectory association pattern.

Drones 2024, 8, 349 22 of 27
Table 5. Comparison between detection and classification with or without preprocessing.
Dataset Method MOTA↑ MOTP↑ IDF1 (%)↑ IDSW↓ MT (%)↑ ML (%)↑ FP↓ FN↓
VisDrone
Baseline 36.2 70.9 52.5 1344 53.1 49.3 9117 11,987
B+Pre 37.6 71.2 52.8 1320 54.3 50.1 9135 11,499
B+Grade 37.3 74.2 52.7 1138 54.7 51.2 9627 11,060
B+Pre+Grade 38.8 76.3 54.3 972 55.9 52.4 6883 10,204
UAVDT
Baseline 57.8 72.0 64.0 1841 42.4 23.3 29,057 67,373
B+Pre 59.3 74.4 65.6 1398 43.8 23.8 25,836 62,429
B+Grade 60.4 74.7 66.1 1221 44.5 23.9 25,418 60,828
B+Pre+Grade 61.7 75.2 67.9 1216 45.3 24.6 24,915 59,640
The results indicate that removing ground truth Bbox annotations for occluded objects
can reduce errors in learning appearance embeddings, thereby improving the accuracy of
tracked object identification. By differentiating between low- and high-scoring detection
boxes, it is possible to effectively reduce trajectory fragmentation and IDSW, thus enhancing
the effectiveness and performance of object tracking. Additionally, by using preprocessing
and detection result classification, the MOTA, MOTP, and IDF1 on VisDrone improved by
2.6%, 5.4%, and 1.8%, respectively, while on UAVDT, they improved by 2.6%, 5.4%, and
1.8% respectively.
4.5.3. Impact of Matching Strategies
We validated the individual contributions of each component by combining different
association strategies, as shown in Table 6. The baseline uses IOU matching for all associa-
tions, and we gradually replace it with Appear-IOU, Gau-IOU, and OSPA-IOU on top of
the baseline. The results indicate that all three proposed association strategies effectively
enhance the accuracy of tracking associations. The baseline model shows significantly
higher FP and more IDSW, indicating a higher number of false detections introduced by
the model, resulting in poor trajectory matching quality and increased identity switching.
After replacing the high-score detection box matching strategy with Appear-IOU, MOTA
and IDF1 showed noticeable improvements. However, there was a slight increase in FP, and
the powerful detection capability significantly reduced FN. After replacing the low-score
detection box matching strategy with Gau-IOU, MOTA and MOTP improved significantly.
At the same time, IDSW decreased substantially, demonstrating the effectiveness of match-
ing smaller low-score detection boxes in Gaussian space. Substituting the OSPA-IOU
distance-based object-to-trajectory matching method, the high-score detection boxes are
considered to be a collection of individual trajectories for matching against trajectory col-
lections, improving all metrics. These results indicate that our various strategies contribute
to better overall tracking performance.
Table 6. Comparison of different association strategies.
VisDrone
Baseline 36.2 70.9 52.5 1552 53.1 49.3 9117 11,987
B+Appear-IOU 37.1 74.4 53.2 1334 53.2 49.2 9209 11,027
B+Appear-IOU+Gau-IOU 38.3 75.6 53.9 1052 53.8 49.9 7343 10,946
B+Appear-IOU+Gau-IOU+
OSPA-IOU
38.8 76.3 54.3 972 55.9 52.4 6883 10,204
UAVDT
Baseline 57.8 72.0 64.0 1841 42.4 23.3 29,057 67,373
B+Appear-IOU 58.2 73.8 64.9 1536 43.0 23.7 29,133 63,781
B+Appear-IOU+Gau-IOU 60.9 74.9 66.3 1297 45.0 24.0 25,011 60,369
B+Appear-IOU+Gau-IOU+
OSPA-IOU
61.7 75.2 67.9 1216 45.3 24.6 24,915 59,640

Drones 2024, 8, 349 23 of 27
4.5.4. Impact of VGM-PHD
We designed ablation experiments to validate the effectiveness of the VGM-PHD
method. We compared this method against no trajectory prediction and the use of a Kalman
filter. The results are presented in Table 7. The findings indicate that VGM-PHD exhibits
higher prediction accuracy and robustness compared to no trajectory prediction and the
Kalman filter across multiple scenarios. In complex environments particularly, the new
method successfully overcomes the limitations of traditional approaches, enhancing the
accuracy of predicting future positions of moving objects. Moreover, the decrease in IDSW
and the increase in IDF1 suggest improved stability in trajectory tracking. Consequently,
overall tracking performance is enhanced.
Table 7. Comparison with and without trajectory prediction.
VisDrone
No trajectory
prediction
29.9 64.4 49.3 2497 42.8 42.8 8719 15,226
Kalman Filter 35.3 69.9 50.6 1727 51.4 47.5 8998 12,302
VGM-PHD 38.8 76.3 54.3 972 55.9 52.4 6883 10,204
UAVDT
No trajectory
prediction
43.1 61.2 46.4 4437 32.3 17.4 49,018 99,620
Kalman Filter 52.8 68.0 56.1 3069 37.4 21.0 38,389 82,471
VGM-PHD 61.7 75.2 67.9 1216 45.3 24.6 24,915 59,640
5. Discussion
In this paper, we have integrated the strengths of joint detection and visual multi-
object tracking algorithms with transformer-based visual multi-object tracking algorithms
to address the unique challenges posed by drone aerial videos. Our proposed GAO-Tracker,
which models object motion information, has demonstrated significant improvements in
tracking performance, particularly in complex real-world scenarios.
5.1. Performance Analysis
GAO-Tracker’s performance on the VisDrone and UAVDT datasets has shown remark-
able results, surpassing existing state-of-the-art methods in terms of both accuracy and
robustness. The integration of the transformer model’s global context capturing capabilities
with the joint detection and tracking methods’ handling of occlusions and scale variations
has proven effective. The results indicate that our approach can maintain high tracking
accuracy even in challenging environments that are characterized by rapid object motion,
complex backgrounds, and varying object scales.
5.2. Strengths
(1) Enhanced accuracy: By leveraging the transformer model’s self-attention mech-
anism, GAO-Tracker effectively captures long-range dependencies and global contexts,
which are crucial for accurately tracking multiple objects in aerial videos.
(2) Robustness to occlusions and scale variations: The joint detection and tracking
methods integrated into GAO-Tracker enable it to handle occlusions and significant scale
variations efficiently, ensuring continuous and reliable tracking.
(3) Practical solutions: GAO-Tracker provides practical solutions to real-world multi-
object tracking problems, making it highly applicable in various domains such as surveil-
lance, rescue operations, and urban planning.
5.3. Limitations
(1) Computational complexity: Despite its accuracy and robustness, the computational
expense associated with the transformer model’s fine-grained self-attention mechanism

Drones 2024, 8, 349 24 of 27
remains a challenge. This could potentially limit the real-time applicability of GAO-Tracker
in resource-constrained environments.
(2) Scalability: While GAO-Tracker performs well on benchmark datasets, its scala-
bility to handle extremely large-scale datasets or highly crowded scenes requires further
exploration and optimization.
5.4. Future Directions
To further enhance the performance and applicability of GAO-Tracker, several future
research directions are proposed:
(1) Algorithm optimization: Efforts will focus on optimizing the algorithm to reduce
computational complexity and improve real-time performance. This includes exploring
more efficient implementations of the transformer model and refining the integration with
detection and tracking components.
(2) Broader application areas: Extending the research findings to benefit more diverse
fields is a key future direction. Improvements in drone-based multi-object tracking can be
adapted for use in autonomous driving, security systems, wildlife monitoring, and other
domains requiring accurate and reliable tracking of multiple objects.
(3) Handling complex scenarios: Further research is needed to enhance GAO-Tracker’s
performance in highly dynamic and crowded environments. This includes developing
methods to better handle dense object interactions and rapidly changing scenes.
(4) Long-term tracking: Enhancing the system’s ability to maintain long-term tracking
stability and accuracy, particularly in scenarios with frequent object disappearances and
reappearances, is another important area for future work.
6. Conclusions
This paper aims to integrate the strengths of joint detection and visual multi-object
tracking algorithms with transformer-based visual multi-object tracking algorithms to
improve the performance of multi-object tracking in drone aerial videos. Additionally,
we propose a more comprehensive, robust, and efficient integrated multi-object tracking
algorithm by modeling object motion information.
By leveraging the advanced capabilities of transformer models to capture global
contexts and the strengths of joint detection and tracking methods in handling occlusions
and scale variations, our approach addresses the unique challenges posed by drone aerial
videos, such as rapid object motion, complex environmental conditions, and data noise. This
integration allows for more accurate and reliable tracking of multiple objects, enhancing the
overall performance and robustness of tracking systems in various real-world scenarios.
A series of novel results have been achieved in the drone aerial multi-object tracking
field, with GAO-Tracker demonstrating excellent results on the VisDrone and UAVDT
datasets. These datasets, which are widely used benchmarks in the field, have shown that
our method significantly outperforms existing state-of-the-art methods in terms of both
accuracy and robustness. This indicates GAO-Tracker’s strong potential for practical appli-
cations in surveillance, rescue operations, agriculture, and urban planning, among others.
The practical solutions provided by GAO-Tracker to multi-object tracking problems
in real-world scenarios offer new ideas and methods for the development of drone visual
tracking. Our approach not only contributes to the current body of knowledge but also
paves the way for future research in this area. In the future, efforts will focus on improving
and optimizing algorithms to enhance multi-object tracking performance further. This
includes refining the integration of detection and tracking components, enhancing the effi-
ciency of the transformer model, and exploring new ways to handle challenging scenarios
such as crowded environments and dynamic backgrounds.
Additionally, endeavors will be made to extend research findings to broader appli-
cation areas to benefit more diverse fields. For instance, improvements in drone-based
multi-object tracking can be adapted for use in autonomous driving, security systems,
wildlife monitoring, and other areas where real-time, accurate tracking of multiple objects

Drones 2024, 8, 349 25 of 27
is critical. By expanding the applicability of our research, we aim to contribute to the
advancement of technology across various domains, ultimately enhancing the capabilities
and reliability of multi-object tracking systems.
Author Contributions: Conceptualization, Y.Y.; methodology, Y.Y., Y.W., and Y.L.; software, Y.P.
and Y.L.; formal analysis, Y.W.; investigation, Y.W.; resources, Y.P.; data curation, L.Z., Y.P., and
Y.L.; writing—original draft, Y.Y.; writing—review and editing, Y.W. and L.Z.; visualization, Y.L.;
supervision, L.Z.; project administration, Y.P. All authors have read and agreed to the published
version of the manuscript.
Funding: This work was supported by Funding for Outstanding Doctoral Dissertation in NUAA
under grant BCXJ24-10, the Postgraduate Research and Practice Innovation Program of Jiangsu
Province under grant KYCX24_0583, the National Natural Science Foundation of China under grant
61573183, and the Natural Science Foundation of Shaanxi Province of China under grant 2024JC-
YBQN-0695.
Data Availability Statement: The data presented in this study are available on request from the
corresponding author.
Conflicts of Interest: The authors declare no conflicts of interest.
Abbreviations
The following abbreviations are used in this manuscript:
MOT Multiple object tracking
GAO Gaussian, appearance, and optimal subpattern assignment
IOU Intersection over union
OSPA Optimal subpattern assignment
VGM-PHD Visual Gaussian mixture probability hypothesis density
MOTA Multiple object tracking accuracy
MOTP Multiple object tracking precision
References
1. Wu, X.; Li,W.; Hong, D.; Tao, R.; Du, Q. Deep learning for unmanned aerial vehicle-based object detection and tracking: A survey.
IEEE Geosci. Remote Sens. Mag. 2021, 10, 91–124.
2. Li, Y.; Zhang, H.; Yang, Y.; Liu, H.; Yuan, D. RISTrack: Learning Response Interference Suppression Correlation Filters for UAV
Tracking. IEEE Geosci. Remote Sens. Lett. 2023, 20, 1–5.
3. Dai, M.; Hu, J.; Zhuang, J.; Zheng, E. A transformer-based feature segmentation and region alignment method for UAV-view
geo-localization. IEEE Trans. Circuits Syst. Video Technol. 2021, 32, 4376–4389.
4. Yi, S.; Liu, X.; Li, J.; Chen, L. UAVformer: a composite transformer network for urban scene segmentation of UAV images. Pattern
Recogn. 2023, 133, 109019.
5. Yongqiang, X.; Zhongbo, L.; Jin, Q.; Zhang, K.; Zhang, B.; Feng, Q. Optimal video communication strategy for intelligent video
analysis in unmanned aerial vehicle applications. Chin. J. Aeronaut. 2020, 33, 2921–2929.
6. Bochinski, E.; Eiselein, V.; Sikora, T. High-speed tracking-by-detection without using image information. In Proceedings of
the 14th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), Lecce, Italy, 29 August–1
September 2017; pp. 1–6.
7. Chen, G.; Wang, W.; He, Zh.; Wang, L.; Yuan, Y.; Zhang, D.; Zhang, J.; Zhu, P.; Van G.; Han, J.; Hoi, S.; Hu, Q.; Liu, M. VisDrone-
MOT2021: The Vision Meets Drone Multiple Object Tracking Challenge Results. In Proceedings of the IEEE/CVF International
Conference on Computer Vision, Virtual, 11–17 October 2021; pp. 2839–2846.
8. Bisio, I.; Garibotto, C.; Haleem, H.; Lavagetto, F.; Sciarrone, A. Vehicular/Non-Vehicular Multi-Class Multi-Object Tracking in
Drone-based Aerial Scenes. IEEE Trans. Veh. Technol. 2023, 73, 4961–4977.
9. Lin, Y.; Wang, M.; Chen, W.; Gao, W.; Li, L.; Liu, Y. Multiple Object Tracking of Drone Videos by a Temporal-Association Network
with Separated-Tasks Structure. Remote Sens.2022, 14, 3862.
10. Al-Shakarji, N.; Bunyak, F.; Seetharaman, G.; Palaniappan, K. Multi-object tracking cascade with multi-step data association
and occlusion handling. In Proceedings of the 2018 15th IEEE International Conference on Advanced Video and Signal Based
Surveillance (AVSS), Auckland, New Zealand, 27–30 November 2018; pp. 1–6.
11. Yu, H.; Li, G.; Zhang, W.; Yao, H.; Huang, Q. Self-balance motion and appearance model for multi-object tracking in uav. In
Proceedings of the 1st ACM International Conference on Multimedia in Asia, Beijing, China, 15–18 December 2019; pp. 1–6.
12. Wang, Z.; Zheng, L.; Liu, Y.; Li, Y.; Wang, S. Towards real-time multi-object tracking. In Proceedings of the 16th European
Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 107–122.

Drones 2024, 8, 349 26 of 27
13. Wu, H.; Nie, J.; He, Z.; Zhu, Z.; Gao, M. One-shot multiple object tracking in UAV videos using task-specific fine-grained features.
Remote Sens. 2022, 14, 3853.
14. Shi, L.; Zhang, Q.; Pan, B.; Zhang, J.; Su, Y. Global-Local and Occlusion Awareness Network for Object Tracking in UAVs. IEEE J.
Sel. Top. Appl. Earth Obs. Remote Sens. 2023, 16, 8834–8844.
15. Zhou, X.; Koltun, V.; Krähenbühl, P. Tracking objects as points.In Proceedings of the 16th European Conference on Computer
Vision, Glasgow, UK, 23–28 August 2020; pp. 474–490.
16. Peng, J.; Wang, C.; Wan, F.; Wu, Y.; Wang, Y.; Tai, Y.; Wang, C.; Li, J.; Huang, F.; Fu, Y. Chained-tracker: Chaining paired attentive
regression results for end-to-end joint multiple-object detection and tracking. In Proceedings of the 16th European Conference on
Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 145–161.
17. Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In
Proceedings of the 16th European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 213–229.
18. Tsai, C.; Shen, G.; Nisar, H. Swin-JDE: joint detection and embedding multi-object tracking in crowded scenes based on swin-
transformer. Eng. Appl. Artif. Intel. 2023, 119, 105770.
19. Hu, M.; Zhu, X.; Wang, H.; Cao, S.; Liu, C.; Song, Q. STDFormer: Spatial-Temporal Motion Transformer for Multiple Object
Tracking. IEEE Trans. Circuits Syst. Video Technol. 2023, 33, 6571–6594.
20. Zeng, F.; Dong, B.; Zhang, Y.; Wang, T.; Zhang, X.; Wei, Y. Motr: End-to-end multiple-object tracking with transformer. In
Proceedings of the 17th European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; pp. 659–675.
21. Cai, J.; Xu, M.; Li, W.; Xiong, Y.; Xia, W.; Tu, Z.; Soatto, S. Memot: Multi-object tracking with memory. In Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 8090–8100.
22. Zhu, P.; Wen, L.; Du, D.; Bian, X.; Hu, Q.; Ling, H. Vision meets drones: Past, present and future. arXiv 2020, arXiv:2001.06303.
23. Du, D.; Qi, Y.; Yu, H.; Yang, Y.; Duan, K.; Li, G.; Zhang, W.; Huang, Q.; Tian, Q. The unmanned aerial vehicle benchmark:
Object detection and tracking. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14
September 2018; pp. 370–386.
24. Zhu, X.; Su, W.; Lu, L.; Li, B.; Wang, X.; Dai, J. Deformable detr: Deformable transformers for end-to-end object detection. arXiv
2020, arXiv:2010.04159.
25. Fang, Y.; Liao, B.; Wang, X.; Fang, J.; Qi, J.; Wu, R.; Niu, J.; Liu, W. You Only Look at One Sequence: Rethinking Transformer in
Vision through Object Detection. Adv. Neural Inf. Process. Syst. 2021, 34, 26183–26197.
26. Li, Y.; Mao, H.; Girshick, R.; He, K. Exploring plain vision transformer backbones for object detection. In Proceedings of the 17th
European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; pp. 280–296.
27. Ye, T.; Qin, W.; Zhao, Z.; Gao, X.; Deng, X.; Ouyang, Y. Real-Time Object Detection Network in UAV-Vision Based on CNN and
Transformer.IEEE Trans. Instrum. Meas. 2023, 72, 2505713.
28. Li, F.; Zhang, H.; Liu, S.; Guo, J.; Ni, L.; Zhang, L. Dn-detr: Accelerate detr training by introducing query denoising. In
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022;
pp. 13619–13627.
29. Wojke, N.; Bewley, A.; Paulus, D. Simple online and realtime tracking with a deep association metric. In Proceedings of the 2017
IEEE International Conference on Image Processing (ICIP),Beijing, China, 17–20 September 2017; pp. 3645–3649.
30. Zhang, Y.; Sun, P.; Jiang, Y.; Yu, D.; Weng, F.; Yuan, Z.; Luo, P.; Liu, W.; Wang, X. Bytetrack: Multi-object tracking by associating
every detection box. In Proceedings of the 17th European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022;
pp. 1–21.
31. Aharon, N.; Orfaig, R.; Bobrovsky, B. BoT-SORT: Robust associations multi-pedestrian tracking. arXiv 2022, arXiv:2206.14651.
32. Liu, S.; Li, X.; Lu, H.; He, Y. Multi-Object Tracking Meets Moving UAV. In Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 8876–8885.
33. Deng, K.; Zhang, C.; Chen, Z.; Hu, W.; Li, B.; Lu, F. Jointing Recurrent Across-Channel and Spatial Attention for Multi-Object
Tracking With Block-Erasing Data Augmentation. IEEE Trans. Circuits Syst. Video Technol. 2023, 33, 4054–4069.
34. Xiao, C.; Cao, Q.; Zhong, Y.; Lan, L.; Zhang, X.; Cai, H.; Luo, Z. Enhancing Online UAV Multi-Object Tracking with Temporal
Context and Spatial Topological Relationships. Drones 2023, 7, 389.
35. Keawboontan, T.; Thammawichai, M. Toward Real-Time UAV Multi-Target Tracking Using Joint Detection and Tracking. IEEE
Access 2023, 11, 65238–65254.
36. Li, J.; Ding, Y.; Wei, H.; Zhang, Y.; Lin, W. Simpletrack: Rethinking and improving the jde approach for multi-object tracking.
Sensors 2022, 22, 5863.
37. Sun, P.; Cao, J.; Jiang, Y.; Zhang, R.; Xie, E.; Yuan, Z.; Wang, C.; Luo, P. Transtrack: Multiple object tracking with transformer.
arXiv 2020,arXiv:2012.15460.
38. Meinhardt, T.; Kirillov, A.; Leal-Taixe, L.; Feichtenhofer, C. Trackformer: Multi-object tracking with transformers. In Proceedings of
the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 8844–8854.
39. Xu, Y.; Ban, Y.; Delorme, G.; Gan, C.; Rus, D.; Alameda-Pineda, X. TransCenter: Transformers with dense representations for
multiple-object tracking. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 7820–7835.
40. Zhou, X.; Yin, T.; Koltun, V.; Krähenbühl, P. Global Tracking Transformers. In Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 8771–8780.
41. Chen, M.; Liao, Y.; Liu, S.; Wang, F.; Hwang, J. TR-MOT: Multi-Object Tracking by Reference. arXiv 2022, arXiv:2203.16621.

Drones 2024, 8, 349 27 of 27
42. Wu, H.; He, Z.; Gao, M. GCEVT: Learning Global Context Embedding for Vehicle Tracking in Unmanned Aerial Vehicle Videos.
IEEE Geosci. Remote Sens. Lett. 2023, 20, 1–5.
43. Xu, X.; Feng, Z.; Cao, C.; Yu, C.; Li, M.; Wu, Z.; Ye, S.; Shang, Y. STN-Track: Multiobject Tracking of Unmanned Aerial Vehicles by
Swin Transformer Neck and New Data Association Method. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2022, 15, 8734–8743.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.

Multiple Object Tracking in Drone Aerial Videos by a Holistic Transformer and Multiple Feature Trajectory Matching Pattern

More Related Content

Similar to Multiple Object Tracking in Drone Aerial Videos by a Holistic Transformer and Multiple Feature Trajectory Matching Pattern (20)

Recently uploaded (20)

Multiple Object Tracking in Drone Aerial Videos by a Holistic Transformer and Multiple Feature Trajectory Matching Pattern