layers, where (7 + 4(NU − 1) + 4NT + 7NO) represents the state
information si = (sui, oi) of each UAV under the scenario of NU
UAVs, NT targets and NO obstacles within the detection range. For
UAV i, after four FCs, the target assignment network maps the state
information si = (sui, oi) to the probability (PiT1, PiT2, · · · , PiTNT )
of UAV i flying to targets (T1, T2, · · · , TNT ). The probability is first
normalized by the Softmax function [Equation (15)],
pij =
piTj
PN j T piTj (15)
and then the Cross-Entropy calculation is performed with the
assigned labels to update the assignment network [Equation (16)],
H(L, P) = − XN j T lj log pij (16)
(2) Construction of the assignment label
From the bottom section of the Figure 6, it can be seen that
the training labels of the assignment network are provided by TD3
framework. The task objective is to achieve a complete assignment
and minimize the total flight path, but it is not accurate to
only consider the distance between UAVs and targets to make
decisions in random and dynamic environments. As mentioned
in Section 2.2, a multi-UAV problem means to maximize the
joint cumulative reward of all UAVs in DRL, that is, each UAV
will choose the action that maximized the Q-value based on its
current state. Compared with selecting the target only according
to distance, this method determines the assigned target according
to the Q-value comprehensively taking into account UAVs, targets,
and obstacles, even if obstacles are moving, so the targets can This paper proposed the twin-delayed deep deterministic policy
gradient algorithm with target assignment network (TANet-TD3),
different from the existing methods that assign targets for the
whole task first and then planning the path according to the
assignment results, TANet-TD3 can solve the multiple UAVs target
assignment and path planning simultaneously in dynamic multiobstacle environments. The framework of the TANet-TD3 is shown
in Figure 5, it can be seen that the object of the task is to
minimize the total flight path length of all UAVs with the complete
target assignment constraint and collision-free constraint. TANetTD3 introduces a target assignment network into the framework
of TD3 to solve the two problems simultaneously. Among the
overall process, the target assignment network provides the optimal
complete assignment of targets for each step of UAVs (the green
dashed box), and then the TD3 algorithm guides each UAV plan
a feasible path for this step (the blue dashed box) according to
the assigned result (the yellow dashed box). In the meantime, the
training labels of assignment network are obtained from the process
of path planning driven by TD3 algorithm (the purple dashed box).
This method not only takes into account the distance between
UAVs and targets but also considers the dynamic obstacles in task
environments, so it can generate an optimal assignment and path