Abstract

Aiming at the problem that traditional heuristic algorithm is difficult to extract the empirical model in time from large sample terrain data, a multi-UAV collaborative path planning method based on attention reinforcement learning is proposed. The method draws on a combined consideration of influencing factors, such as survival probability, path length, and load balancing and endurance constraints, and works as a support system for multimachine collaborative optimizing. The attention neural network is used to generate the cooperative reconnaissance strategy of the UAV, and a large amount of simulation data is tested to optimize the attention network using the REINFORCE algorithm. Experimental results show that the proposed method is effective in solving the multi-UAV path planning issue with high real-time requirements, and the solving time is less than the traditional algorithms.

1. Introduction

Earthquake usually causes serious damages to roads and buildings in the local area. The relief staffs can grasp the local disaster situation at the first time and then take effective rescue measures which play a vital role in the post-emergency rescue and road repair after earthquake.

In recent years, with the rapid development of UAV technology, it is possible to quickly acquire and evaluate the disaster situation [1, 2]. When the disaster occurs, the first time to carry out on-site investigation by using UAVs has become a universally accepted emergency measure.

Due to its simple operation and sturdy maneuverability, UAV has, moreover, been widely utilized in several fields such as military, agriculture, and environmental protection [38] and has attracted the attention of countries around the world. Within the multi-UAV collaborative reconnaissance path planning framework, their distributed characteristic notably improves the working ability and robustness of the system. And, multi-UVA collaborative path planning strategy, as the basic of collaborative exploration, determines the utilization degree of high quality for current sources by aircraft. In the real environment, restricted by various environmental factors, the collaborative path planning method of multi-UAV can be regarded as an NP combinatorial optimization problem with a number of constraints [9]. At present, the main classical models for the multi-UAV cooperative reconnaissance problem include multitraveling salesman model (MTSP), mixed linear integer programming model (MILP), and vehicle scheduling and path planning model VRP [9]. These models are solved by heuristic algorithms such as genetic algorithm [10], simulated annealing algorithm, and evolutionary algorithm [11]. However, the traditional model cannot completely describe the constraints of multi-UAV cooperative reconnaissance missions. Under complex environmental conditions and in the face of constantly changing terrain data, the abovementioned heuristic algorithm needs to be reoptimized and solved, so it has poor adaptability and is unable to give the corresponding solution quickly [12]. For the model with variable environmental information, Yuan [13] proposed a genetic algorithm based on the bipartite successive revision method for solving the problem of which the distance matrix is dynamic over time and Shi [14] investigated the algorithm for the VRP problem with variable demand. In fact, all these models simply introduced variables in the original specific problem, which only were optimized and improved for the traditional heuristic algorithm. They have not found a reasonable internal law of path planning and could not avoid the problem of needing to reoptimize iterations of the traditional heuristic algorithm, and the real-time performance was not high.

With the rapid development of deep learning, neural networks are gradually being used to deal with combinatorial optimization problems. Vinyals et al. proposed pointer network (PN) [15] based on end-to-end model and used supervised learning to train networks and solve TSP problems. However, the effectiveness of supervised learning depends on the training sample data, which limits PN networks to find better solutions to problems. Nazari et al. used reinforcement learning to train the end-to-end network, which greatly improved the effectiveness compared with the PN network on the TSP problem [16].

In view of the shortcomings of traditional heuristic algorithm, this paper proposes a reinforcement learning collaborative algorithm based on the attention model. It uses neural networks to represent the collaborative reconnaissance strategy of UAVs, introduces a large amount of simulation data and uses the REINFORCE learning algorithm [17] to pretrain the model, and explores the inherent characteristics between the site model and optimal path. The neural network is used to represent the cooperative reconnaissance strategy of the UAV, and a large amount of simulation data is introduced to pretrain the model using the REINFORCE algorithm [17] for finding out the internal law between data (site model) and optimized path. Experimental results show that the algorithm based on the attention model can give an optimized solution more quickly when solving similar problems with the same location distribution, number of nodes, and constraints for UAVs’ collaborative reconnaissance issues. No reoptimization iterations are required during the computation, and its solution time and strain capacity are significantly better than traditional heuristic algorithms.

2. Multi-UAV Collaborative Reconnaissance Model

For multi-UAV collaborative reconnaissance mission, the threat and related location information of the target regions have been roughly assessed before the investigation, and multiple UAVs need to conduct detailed reconnaissance of multiple target regions in the affected area.

It is assumed that UAVs are required to detect the target regions with a number of in detail. Now, the endurance range of UAVs and the threat level of targets are comprehensively considered to set up model of the reconnaissance mission.

In this paper, the path planning issue of multi-UAV collaborative reconnaissance mission is described as follows:Base: it is the location from which the UAVs depart and the location to which they return after completing the reconnaissance mission, with coordinates .UAV: assume that there are UAVs, which are represented by set , and the maximum endurance range of UAVs is .Detection target: there are targets to be detected, which are represented by set so that , where 0 represents base and elements in T are called nodes, which represent local targets to be detected. The coordinates of the target is , and the survival probability of the UAV when detecting the target is .Path set: represents the detection path of UAV, where is according to the basic principle of safe flight after considering various terrain conditions.

Decision variables of multi-UAV collaborative reconnaissance mission are defined as

Then, the multi-UAV collaborative reconnaissance model is established: s.t.

Taking into account the UAV’s endurance range, survival probability, and execution capability, this paper adopts a relatively simple linear combination method, which uses different proportionality coefficients to combine the individual objective functions in the process of UAVs performing reconnaissance missions. Equation (2) represents the model function for multi-UAV cooperative detection, in which are the three optimized objective functions in the UAV reconnaissance model, and , respectively, correspond to the proportional coefficients of loss function of the three objective functions.

Formula (3) represents one visit to the node per UAV, and formula (4) represents departure from the node once. There are many restrictive factors when flying long time multiple UAVs at high altitude. For example, if the cruising range is too long, the mission may fail directly due to the limited power of the on-board battery in the first place. It may also cause communication failure due to interference from various external uncertainties, resulting in the inability to transmit valid data to the ground station. Therefore, in order to reduce the flight risk, the shorter the total flight distance of the UAV, the better it is recommended. Equation (5) indicates that the range constraint of each UAV cannot exceed its maximum range, i.e., . The total flight mileage of all UAVs is described by formula (6), which is one of the optimization objectives of the reconnaissance mission.

Since each node to be detected has an inconsistent threat level to the UAV, the UAV has to pick the safest flight path while detecting multiple nodes. The defined UAV survival coverage function is shown in equation (7), where is the start node of reconnaissance path for the UAV , is the end node, and is the survival probability of the UAV when detecting node i. Assuming that UAV passes through the nodes 1, 2, and 3, the survival probabilities of them are 0.9, 0.8, and 0.7, respectively. Then, survival coverage function of the UAV at node 1 is 0.9, node 2 is 0.9 ∗ 0.8 = 0.72, and node 3 is 0.9 ∗ 0.8 ∗ 0.7 = 0.50. The survival coverage function of entire path of UAVs for reconnaissance missions is the cumulative sum of the node’s survival coverage functions, as shown in equation (8), where . Formula (9) indicates that the survival coverage function is optimized to take the maximum value.

In the process of performing actual tasks, in order to avoid excessive load of a single UAV, the load of each UAV is adjusted by introducing the variance of nodes, as shown in formula (10).

3. Solving Algorithm for Model

3.1. Attention Model

The attention model [18] is a framework proposed by Google for dealing with problems related to sequence models and has achieved peer-leading results in neural network translation applications. While most traditional neural network machine translation utilizes RNN or CNN as the basis of encoder-decoder framework [9], the attention model is based only on its transformer structure, which discards the inherent stereotypes. The attention model can work highly in parallel manner, so it significantly improves the training speed of the model as well improves its translation performance.

The optimal combination of the multi-UAV collaborative reconnaissance model also belongs to one of the sequential models. In this paper, an end-to-end model based on attention is used to train a random strategy , and the path planning , , is given from the constraints of the current terrain, where s as an instance of the current planning:

In the end-to-end model, the encoder processes the embeddings’ encoding of the input coordinate nodes, and the decoder accepts the input from the encoder. The decoding process is constrained by a mask mechanism to give the rational path planning .

3.2. Encoder

In the problem of multi-UAV collaborative detection, the relevant information of each node is not sequentially correlated. The encoder of the attention model makes use of the transformer structure [18], due to the fact the transform model encoder will examine the sequence correlation of the data by default. In order to keep away from the interference of the order of the input nodes on the path planning, the position embedding layer is not used in the encoder structure adopted in this paper, as shown in Figure 1.

The coordinates of the nodes are mapped into dimensional embeddings by using a linear mapping layer with parameters , and aiming at distinguishing the base and the node, different parameters are used:

The embedding of the input nodes is updated through N-layer attention layers, each of which consists of two sublayers: multihead attention (MHA) layer and the nodewise fully connected feedforward (FF) layer [19]. Each sublayer is connected by a skip-connection structure, which solves the prominent problem that the learning efficiency deteriorates as the number of network layers deepens. Then, the batch normalization (BN) [20] is regularized, and the calculation formulas are shown in equations (13) and (14). is defined as the node embedding, , generated by the first layer, in which BN and MHA with superscript represent that parameters are not shared between layers. The MHA layer uses M = 8 heads, each of which has a dimension of . The FF layer uses a hidden layer with a dimension of 512 and uses the ReLU activation function. In course of the experiment, the results show that the effect of using BN regularization is better than the transformer encoder, so this paper uses BN to regularize the network output:

Graph embedding, , is obtained by calculating the average value of the network output layer embedding, and then, the output layer embedding and graph embedding are passed to the decoder.

3.3. Decoder

The decoder generates path planning by receiving embedding, graph embedding [21], and context embedding of output layer’s node of the encoder. The decoding process is carried out step by step, as shown in Figure 2. At each time point t, the decoder generates context embedding based on the embedding of the input node and at time t − 1. Define context embedding as follows:where is the remaining cruising range of UAV at the current moment and represents the distance during this mission at time. The symbol depicts a horizontal concatenation operator, which means to join three vectors horizontally into one vector.

It then uses an MHA layer (unlike the MHA of the encoder), where is calculated by and are calculated from . For computational efficiency skip connection, batch normalization and feedforward are not used:

The similarity can be calculated by the above formula, where ; at the same time, the nodes that do not meet the constraint conditions are shielded, and the compatibility is set to . To derive the strategy in equation (11), a single attention layer is used at the last layer of the network:

Finally, the access probability of each node is obtained, in which the visited nodes and the nodes that do not meet the constraints have been masked when calculating the similarity and probability:

In the process of decoding, the current best-fit node is sampled probabilistically by equation (18).

3.4. Pretraining Model

REINFORCE algorithm adopted in this paper is a strategy gradient algorithm, which means that the strategy is parameterized. The strategy function can be expressed as

The strategy function determines the probability of taking any action in a given state and a certain parameter setting; thus, it is substantially a probability density function. In order to find the optimal strategy, the gradient formula (20) is obtained by differentiating the objective function, wherein is the score function and is the action state variable value:

After constructing the model, the loss function equation (21) is defined in order for it to be trained. And, the objective function of reinforcement learning is modified to make achieve the purpose of optimizing planning:

In REINFORCE algorithm, the baseline can reduce variance and improve training speed of the model. In this paper, the sliding average model is regarded as the baseline for the REINFORCE algorithm.

As shown in Figure 3, after defining the model and the optimization goal, it is trained by using a large amount of simulation data so that the model can dig out the connection between the data and the optimization path. If the indicators are effective, the model can be extracted, and its parameters can be fixed, then the real data of the terrain can be input to solve the UAV reconnaissance problem in real time. Traditional heuristic algorithms, such as genetic algorithms, ant colony algorithms, and particle swarm algorithms, all have to go through a process of optimization, iteration, and evolution when processing new data [12, 13, 22], which is equivalent to re-executing training on a single set of data with long iteration times. However, terrain data are rapidly changing, and such a process is obviously not suitable for applications requiring high real-time performance. The trained neural network model finds rules and features from the simulated data and directly gives the optimized path from the data. This means that it skips the process of reiteration and directly uses known experience to solve similar problems, and its adaptability is significantly stronger than traditional heuristic models. Meanwhile, the model can also introduce an online training mechanism (as shown by the dashed box) and use the real data of the terrain to train the model, which continuously improve the effect and realize the self-evolution of the model.

4. Simulation Test

M of the moving average model is initialized to , and the attenuation rate . Using Adam optimizer [23], initialize the learning rate , the number of encoder attention layers N = 3, and the number of nodes as 20, and the network parameters obey a uniform distribution at . In this paper, five UAVs are simulated to scout 20 nodes in a fixed area, and the survival probability and total range of the UAVs are optimized under the condition that the UAV’s endurance range is satisfied. At the same time, the simulation scene meets the following conditions:(1)Each UAV reaches only one target(2)Each target is covered by a UAV(3)During the flight, the UAV should keep a certain safe distance from the obstacles(4)During the flight, UAV should keep a certain safe distance from other UAVs.

In the UAV collaborative reinforcement learning algorithm based on the attention model, the node and base coordinate data of the UAV obey the uniform distribution on [0, 1], and the survival probability of the UAV when detecting the target i is randomly generated from . The objective function coefficients in the loss function are [0.3, −0.5, 0.2]. In Table 1, the nodes, base coordinates, and survival data of the detection area are listed.

The research studies are implemented based on the deep learning framework PyTorch. The configuration and environment of the computer in the experiment are as follows. The CPU is AMD ryzen5 1600x and the GPU is GTX1080, with 16 GB of memory. After 100 rounds of network training, the test data sets are used to compare the model proposed in this paper with other traditional ways, and the results are shown in Table 2. Among them, AM is the model of this article, CVPR is the traditional model, and TSC-GA is a genetic algorithm based on the two-side successive revision method proposed in [3], and the time-consuming data of TSC-GA comes from the original text. The attention prefix in the table indicates that attention neural network is used to build the model, and the performance of different node numbers based on CPU and GPU are compared, respectively. The AM model used in this paper was tested 10 times with 1000 sets of data, and the traditional CVPR100 model was tested 100 times with 100 sets of test data, which all are a total 10000 sets. The adaptive large neighborhood search algorithm proposed in [2] for the SDVRP problem with variable demand points has an optimized iteration time magnitude of around 500 s, which is not listed in Table 1.

It can be seen from Table 2 that whether the algorithm based on the attention model is run by CPU or accelerated by CUDA, the time of a single group of simulated reconnaissance operations is much shorter than that of the traditional model based on evolutionary algorithm (TSC-GA). Various situations can be handled fast and real time in the actual terrain, which is much better than the adaptive large neighboring search algorithm also. The model based on traditional evolutionary algorithm needs to be iteratively optimized [13], when encountering different tasks with the same number of nodes, while the trained collaborative model based on attention mechanism can quickly deal with similar problems without retraining. Moreover, the model based on the attention mechanism can simulate in parallel in batches and can simultaneously process 1000 (100) sets of data at a single time. The throughput performance of problem processing is far superior to the traditional heuristic algorithm.

In terms of optimal path and survival probability on 1000 sets of test data, as can be seen in Table 3, the AM model UAV ensures that the path distance is within a reasonable acceptable range, and the survival probability has a great improvement compared to the traditional model.

The path planning of the AM model is shown in Figure 4, in which the legend lists the survival probability and path length of each UAV path planning, respectively. By comparing the objective function equation (2), it can be found that the trained model has well grasped the characteristics among the data and optimized the cruising range, load, and survival probability of UAV. The diagram shows that the model plans the path for the UAV in a balanced way and maintains a high survival probability, and none of the UAV flight paths exceed the range.

The proposed path planning based on the attention-UAV model could be used in different cases, mainly in military and civil. In military aspect, UAV could be used as reconnaissance aircraft and target aircraft. In the civil aspect, UAV technology could be widely used in aviation shooting, plant protection, news reporting, microselfie, express transportation, agriculture, wildlife observation, disaster rescue, film and television shooting, infectious disease monitoring, disaster relief, power inspection, romantic manufacturing, and other fields.

5. Conclusions

Through data analysis and experimental results, it can be seen that the path planning algorithm for UAVs’ cooperative reconnaissance based on the attention model has a great advantage over the traditional heuristic algorithm in terms of real time, which can cope with complex terrain changes and can be applied to earthquake relief. In addition, the algorithm based on the attention model can well excavate the internal relationship between terrain data and optimized path and make reasonable path planning for multi-UAV collaborative reconnaissance mission.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare no conflicts of interest.

Authors’ Contributions

T.W. developed the methodology; T.W. and S.Z. helped with software; T.W., B.Z., and M.Z. validated the study; B.Z. carried out formal analysis; T.W. investigated the study; T.W. helped with the resources; B.Z. curated the data; T.W. wrote and prepared the original draft; T.W. and B.Z. reviewed and edited the study; M.Z. visualized the study; T.W. supervised the study; T.W. administrated the project; T.W. helped with funding acquisition. All authors have read and agreed to the published version of the manuscript.

Acknowledgments

This work was supported in part by the Science and Technology Key Project of Henan Province, under Grant no. 212102210520, and the Science and Technology Key Project in the High-Tech Fields of Henan Province, under Grant no. 152102210123.