Abstract
Deep deterministic policy gradient (DDPG) algorithm is a reinforcement learning method, which has been widely used in UAV path planning. However, the critic network of DDPG is frequently updated in the training process. It leads to an inevitable overestimation problem and increases the training computational complexity. Therefore, this paper presents a multicritic-delayed DDPG method for solving the UAV path planning. It uses multicritic networks and delayed learning methods to reduce the overestimation problem of DDPG and adds noise to improve the robustness in the real environment. Moreover, a UAV mission platform is built to train and evaluate the effectiveness and robustness of the proposed method. Simulation results show that the proposed algorithm has a higher convergence speed, a better convergence effect, and stability. It indicates that UAV can learn more knowledge from the complex environment.
1. Annotation Demo Section
In recent years, unmanned aerial vehicles (UAVs) have been widely applied, and their high maneuverability and rapidly deployable UAVs have been applied to search and rescue [1], multi-UAV cooperation [2], formation flight [3], remote surveillance [4], and other fields [5–7]. UAV faces a variety of complex challenges and complicated tasks. Among them, path planning is the first problem faced by UAV. How to make the UAV safely fly to the destination in an unknown working environment becomes a hot topic for researchers. Faced with complex and uncertain environments, many algorithms have been proposed for solving the UAV navigation problems. The most common method is the motion control problem in the unknown environment such as A-Star [8], artificial potential fields [9], rapidly exploring random tree (RRT) algorithm [10], and so on [11–15]. However, due to the constraints of model mismatch, insufficient measurement means, high accurate cost, and model migration, it is difficult to obtain an accurate dynamic model of aircraft in practical engineering. These model-based strategies can hardly be applied to practice in a complex uncertain environment.
In order to overcome the limitations of model-based strategies in uncertain environments, some researchers introduce learning-based methods to overcome these shortcomings. The first is supervised learning that uses large amounts of data to simulate real situations [16, 17]. However, supervised learning needs enough data; it cannot simulate a lot of changes in the real world [18]. The other is reinforcement learning (RL), which uses the interactive learning of the mapping from environment to behavior to seek the most accurate or optimal action decision by maximizing the state-value function and the action-value function [19]. Reinforcement learning has been applied in UAV path planning problem [20]. It transforms the UAV online path planning problem into a decision problem. The next time series of UAV actions are decided according to the current environment and its own state determined by sensors or external information. In the unknown complex environment, UAV has little prior knowledge related to the environment. Therefore, it is required to have strong adaptive ability to such uncertainty. RL provides a better idea for this kind of problem by using historical data to obtain the nonlinear function relationship between approximate fitting state and overall performance [21–24].
Reinforcement learning has been extensively studied in recent years. For example, DeepMind innovatively proposed a deep reinforcement learning (DRL) through the combination of deep learning (DL) and RL. DRL transforms high-dimensional input into a lower-dimensional state. It achieved a promising result. Currently, Mnih et al. [25] proposed a deep -network (DQN) algorithm, which utilized the powerful function fitting ability of deep neural network to avoid the huge storage space of table. DQN enhances the stability of training process by using experiential replay memory and target network. Double DQN [26] and dueling DQN [27] are proposed gradually along with DRL research to overcome the defect of overestimation. DQN algorithm has achieved great success in discrete space. However, in high-dimensional continuous space, DQN will increase exponentially with the increase of discrete degree of action segmentation, resulting in training difficulties. Actor-critic (A-C) algorithm [28] adopts a method similar to policy gradient and uses actor network to output the probability value of the action, while critic network is responsible for evaluating the output action. A-C algorithm uses critic network approximate value function to guide agent update and provide low variance learning knowledge [29, 30]. Lillicrap et al. [31] proposed a deep deterministic policy gradient (DDPG) algorithm to improve the stability of A-C algorithm evaluation by using the target network and empirical replay of DQN. DDPG can be applied in the applications with continuous action space and achieve great success [32]. However, the performance of DDPG in practical applications is not very stable.
Reinforcement learning is independent of environmental models and prior knowledge. Thus, it can effectively solve the UAV path planning problem in unknown environments. The research of reinforcement learning in UAV path planning has received extensive attention from scholars. UAV navigation was modeled as a reinforcement learning problem and validated autonomous fight in unknown environments in [33]. Junell et al. [34] used reinforcement learning method to solve the flight test of quadrotor aircraft in an unknown environment. The continuity of DDPG is widely used in path planning, but its convergence is often unstable in complex environments. Model-free reinforcement learning algorithms are based on time division or Monte Carlo [35]. It suffers from the problem of overestimation. In large state space, the application of policy gradient method will bring a high variance of estimation results. It makes policy learning more sensitive and even leads to training failure.
Numerous researchers improved the accuracy of numerical estimation by improving the neural network. For example, double DQN [26] is guaranteed not to overestimate value via two critic networks. Twin-delayed DDPG (TD3) [36, 37] algorithm solves the overestimation problem by introducing three key technologies. In a real UAV flight, paper [38] solved the problem of slow convergence caused by sparse rewards by introducing the reward function of an artificial potential field. Paper [39] started with DDPG experience base combined with simulated annealing algorithm, and accelerated the learning process of DRL through multiexperience pool (MEP). Papers [38, 40] use LSTM to approximate the critic network by combining the current training observation sequence with the historical observation sequence, so that the UAV can break away from the U-shaped obstacle in large path planning. Paper [41] presented three improvements, environmental noise, delay learning, and hybrid exploration techniques, to enhance the robustness of DDPG. Nevertheless, robustness is still a great challenge for UAV path planning. In the TD3 algorithm, the algorithm solves the critic’s overestimation problem by the method of clipped double -learning for actor-critic. However, only using low estimation often leads to slow convergence.
In order to solve the problem that actor network relies heavily on critic network, which makes DDPG performance very sensitive to critic learning, this paper proposes a multicritic-delayed DDPG method for solving UAV path planning. It uses the average estimation of multicritics network to reduce DDPG’s dependence on critic network and delayed learning method to reduce the overestimation problem of DDPG and reduce the error accumulation of the target network. Considering the sensitivity of the UAV to parameters in the real environment, adding Gaussian noise to action and state increases the robustness of the UAV. The main contributions of this paper are as follows: (1)We propose a multicritic-delayed DDPG method, which includes two improvement techniques. The first is to add state noise and regularize it, which increases the robustness training network. The second is to use multicritic to average error and solve the error accumulation caused by overestimation(2)We apply the proposed multicritic-delayed deep deterministic policy gradient method for solving UAV path planning. A nonsparse reward mode is designed(3)A UAV mission platform is built to train and evaluate the effectiveness and robustness of the proposed method. Simulation results show that the proposed algorithm is effective with strong robust and adaptive capability for solving the path planning of the UAV flying destination under complex environment
The remainder of this paper is organized as follows: In Section 2, we briefly review reinforcement learning methods, i.e., DDPG, DT3, and MCDDPG, for solving UAV path planning. Section 3 gives a detailed description of the proposed multicritic-delayed deep deterministic policy gradient method. Section 4 provides the simulation results and analyses the empirical results. Finally, we draw a conclusion and future work in Section 5.
2. Reinforcement Learning for Solving UAV Path Planning
2.1. UAV Motion Model
Motion model of UAV is a basis of path planning problem. The UAV system is usually controlled by six degrees of freedom, representing three coordinates of the UAV position and controlling the three freedoms of the yaw angle , the roll angle , and the pitch angle . Six degrees of freedom kinematics mode is used to describe the internal state of the UAV.
For the sake of brevity and without loss of generality, we adopt the kinematic model of three degrees of freedom instead of six degrees of freedom. Assume that the UAV is fixed at a horizontal altitude, so that UAV’s activity is confined to the plane. Ignoring the momentum impact of the UAV during fight, assuming that the UAV adopts a constant velocity , the vector is used to simplify the description of the position and motion of the UAV. Therefore, the vector can be expressed as: where is the change in velocity and is the change in yaw angle.
The state obtained by the environment of UAV is composed of three parts, i.e., the internal state, the interactions with the environment, and the location of the target. There are six coordinates representing the information about the their internal state, where is the absolute position of the UAV, is the velocity component of the corresponding coordinate, is the speed of flight, and is the yaw angle. is used to control the internal information of the UAV flight. The surrounding environment state is determined by radar, range finder, and other tools in the real-time interaction between UAV and environment. In this paper, we use range finders to receive the environment state , where is the th range finder mounted on the UAV as shown in Figure 1(b). Besides, the target position of the UAV is expressed as as shown in Figure 1(a). Through the combination of the three observation methods, we can obtain the final description of the state by combining , , and .

(a)

(b)
The control of UAV is complicated in the actual situation, which requires multiple commands to achieve the motion of the UAV. In this paper, we appropriately selected the UAV’s speed and roll as the motion control commands. The control vector of UAV is , where denotes the ratio of the current speed to the maximum speed and is a steering signal that can be selected to turn the UAV to the desired roll angle.
2.2. Reinforcement Learning
In reinforcement learning, the agent changes its state through interaction with the environment so as to obtain returns and achieve the optimal strategy. The model is usually expressed using five tuples of a Markov decision process (MDP), where is a collection of environmental state descriptions of and is a set of all possible actions . represents the transition probability of taking an action from to the next . represents the immediate reward after the agent takes action. Reinforcement learning is designed to maximize future rewards, and a set of rewards can be expressed as . Based on the reward, reinforcement learning introduces two functions, the state-valued function when an agent adopts the policy : where is to map system states to a probability distribution over the actions.
And the action value function: where is the discounting factor which represents the difference in importance between future rewards and present rewards.
The value functions are used to measure the advantages and disadvantages of a certain state or action state, that is, whether it is worth an agent to select a certain state or execute an action in a certain state. Figure 2 illustrates the control of the agent under reinforcement learning model.

2.3. Nonsparse Reward Model
The reward function of traditional RL uses a simple sparse reward model, that is, an agent gets the reward only when the agent reaches the destination. This paper utilizes a nonsparse reward method to provide guidance for model learning. Obviously, nonsparse rewards provide more navigation domain knowledge than sparse rewards and do not change the policy invariance of rewards.
The nonsparse reward consists of four constructions: where , , , and are the contribution rates of the four items, and where represents the change in distance between the current position and the destination and and are the previous and current relative distance between UAV and the target. When , is a reward that is related to speed, guiding the UAV to its destination quickly. is a constant penalty advance to the UAV reaching its destination with a minimum number of steps. The UAV should be encouraged to complete its missions as quickly as possible and punished after each transition. represents a reward for flying without obstacles. Encourage UAV to fly to accessible places to explore more space. encourages UAV to shorten its range but at the same time ensure it can explore more space. The UAV should be able to fly toward its targets as soon as possible with penalties for deviations, but it should be encouraged to move towards free space if there are obstacles in the direction of flight. prevents the UAV from getting too close to the obstacle, and is the minimum distance between the UAV and the obstacle, and is a constant which is to control the size of the distance. Exponential function is used to prevent UAV from getting too close to obstacles, but it can also fly near obstacles. The UAV should actively avoid obstacles. If it gets close to obstacles, it will be punished greatly, so as to ensure that the UAV can stay away from obstacles.
2.4. Deep Deterministic Policy Gradient
Deep deterministic policy gradient (DDPG) adopts the network framework of actor-critic reinforcement learning, the critic can judge the value of the action based on the actor, and the actor can modify the probability of the action based on the value of the critic. The convolution neural networks network and network of DDPG method are used to approximate the state-action value function (3) and state-value function (4), respectively. and are the parameters of the network and network. The critic network learns the state-action value by minimizing time-difference (TD) errors:
In addition, deep -learning target network is used to remove the coupling during the update of formula (7). Figure 3 illustrates the DDPG motion control framework, where represents the target critic network and represents the target actor network.

Obviously, we know that the samples obtained by the agent in RL are highly correlated. The researcher uses the reply buffer to address this problem. The correlation of samples is broken by storing experiences , and then, random samples are taken from experience reply when the network trains. DDPG is derived from the deterministic policy gradient theorem for MDP. In this theorem, for MDP with continuous action space, the deterministic policy gradient exists. When the variance of probability policy approaches zero, it is deterministic action , i.e., where denotes the state distribution under a selected policy . The actor network guides the choice of actions by maximizing performance objectives (8). DDPG trains the network with the stochastic gradient descent (SGD) algorithm with minibatch and then updates the target network with the soft update algorithm: where is the updating rate.
2.5. Twin-Delayed Deep Deterministic Policy Gradient
Twin-delayed deep deterministic policy gradient (TD3) adopts an improved clipped variant of double -learning to reduce network overestimation problems. Following the idea of double -learning, TD3 uses two critics with the same pool of experience. Algorithm 1 contains the pseudocode of TD3. The minimum values of the two networks are used to update the critic networks: where and are the parameters of two different critic networks and and are the parameters of the corresponding target networks, respectively. The action network is updated only according to . Both and are updated by using Equation (10) to minimize the loss function. When the deterministic policy is updated, the value estimate of narrow peaks will occur. TD3 smooths the small area around the action and adds noise to the value of the target action: where is a Gaussian distribution with mean 0 and variance . Noise can be seen as a regularization method, which makes value function updates smoother. is for controlling the size of noise.
Deterministic policy is prone to errors caused by function approximation errors and increases the variance of the target. The regularization of Equation (11) can smooth the target policy. On the other hand, TD3 uses a stable target network approach to reduce error accumulation. The change is to update only the policy and target network after updating the target critic at a fixed frequency . The method of delayed updating the target network can interrupt the accumulation of errors and ensure the TD error is small so as to slowly update the target network , .
|
2.6. MCDDPG
Although DDPG is widely used in UAV path planning because it can well solve the continuous motion space, a large number of researchers have improved DDPG in UAV application due to the poor stability of the algorithm, and the convergence rate is slow. Because actor’s learning ability depends on the judgment, paper [42] proposed a multicritic deep deterministic policy gradient (MCDDPG) to overcome the sensitivity of training the critic network. MCDDPG is applied for solving UAV path planning in [20]. Algorithm 2 contains the pseudocode of MCDDPG. Specifically, it uses the average of the value of critics to approximate instead of the action-value function. where is the parameter of the th critic. The average of all critics can diminish the impact of the overestimation problem caused by the individual critic. We further rewrote the TD error according to Equation (7). Thus, the average TD error is: where the is the average of the target critic networks. Using the same TD error update can cause critics to lose diversity, while a separate update can make a big difference. Therefore, different from DDPG critic network, the local error and global error must be considered when calculating the loss of critic network. According to Equation (13), the loss function for th critic is defined as: where , , and represent the weighting factor. The values of , , and are between 0 and 1. And the sum of and is 1. When , should be the same as formula (7).
|
3. Multicritic-Delayed DDPG Method
The delayed updating of the critic network is very important in practical applications. The critic network of the traditional DDPG method is frequently updated in the training process. It may lead to increase the training steps and cause overestimation. If overestimated actions occur in the learning process, agents will get lost in the learning process, leading to training failure. From the perspective of UAV, delayed update strategy is equivalent to global guidance of UAV flight, while traditional DDPG guides UAV flight through small correction. The existing studies show that the delayed update strategy is more in line with actual UAV flight guidance. Thus, this paper proposes a multicritic-delayed DDPG method, named, MCD, for solving the UAV path planning problem. Algorithm 3 contains the pseudocode of MCD. It uses the delayed update strategy to improve the robustness of the algorithm. TD3 using clipped double network can effectively solve the overestimation problem caused by neural network, it also leads to underestimation at the same time. We adopt the error updating method formula (14) of multicritic to approximate the value of critic network in the proposed MCD.
|
MCD prevents network underestimation by retaining the global mean error of multicritic networks and preserving at the same time the error between the average and the individual guarantees diversity. Another improvement is to add noise to the state. DDPG increases agent’s exploration of the environment by adding an Ornstein-Uhlenbeck (OU) noise. In fact, the acquisition of the real environment by UAV is often inaccurate. If the UAV is overly dependent on the training model, the deviation of state will often cause the UAV to crash. We simulate the deviation of the environmental state input in the real situation by adding a Gaussian noise:
These noises obey the standard deviation of Gaussian distribution. When the model is trained to , the robustness of the state noise training network is introduced. Introducing noise after the network has been trained to a certain extent can ensure that the initial network will not be trained to fail by noise.
4. Experiments
In this section, in order to verify the performance of the proposed MCD, we compare it with three algorithms, i.e., DDPG, TD3, and MCDDPG, on a synthetic test problem.
4.1. Experimental Platform Setting
We built a random environment of different complexity. The terrain of each environment is a rectangular area of . As shown in Figure 4. The simulation environment is randomly generated by 49 cylindrical obstacles with diameters of in the rectangular region (there may be overlapping of cylindrical obstacles). The UAV is fixed at horizontal altitude of meters and equipped with nine sensors with a detection range of meters. The UAV flies from its initial position to its designated target. The maximum speed of the UAV is limited to , and the maximum yaw is .

The critic networks adopt the same network structure of . The observed states as inputs are normalized to 19 dimensions, and the actor network composed of uses 2 dimensions output action to control the UAV. The parameters are set to in MCDDPG and MCD. The number of the critic networks in Equation (12) is set to be . When is set too high, the overestimation ability of the algorithm will be greatly reduced but the operation efficiency will be too low. According to paper [42], is a more appropriate value. UAV observation and action are normalized to . Adam optimizer [43] is used to learn network parameters. The learning rates of the actor and critic networks are set to be . In addition, the discount factor is , and the soft update rate is . The other hyperparameters are given as follows: minibatch size , experience reply . In addition, Gaussian distribution is used to increase motion detection space, and the Gaussian distribution deviation of the state noise is 0.2. When the training number reaches , the state noise will be enabled. In TD3, the action adds smoothing and is clipped to . The parameter of reward is set to , and the maximum value of iteration is .
4.2. Performance of Multicritic Delayed
As shown in Figure 5, sets were used to train both models. MCD has a success rate of , which is significantly better than that of the compared algorithms. In the training of MCD, the training effect is not as good as that of TD3 and MCDDPG in the early, which is due to the minimization of multicritic networks. However, it ensures that the MCD estimation is not too high and can grow steadily, and it obviously exceeds other algorithms in the later period. For more specific verification, we trained the model three times, intercepted episodes in sets and averaged them, and selected the number of the average reward per episode and the total reward per episode, as shown in Figure 6. From this figure, we can see that MCD fluctuates much less and rewards better than the other three algorithms.


(a)

(b)
In order to prove the generalization ability of the algorithms, we further calculated the success rate, collision rate, and loss rate of agents. Table 1 lists the results obtained by the compared algorithms. The results showed that the generalization of MCD is better than that of TD3. The success rate is 94.3% which is more than 88.4% of TD3. It is proved that the algorithm can effectively improve the success rate and obtain more environmental information by using the average critic network. We can note that the loss rate of model exploiting is greatly improved compared with the mode training. The model training can avoid the collision of UAV with obstacles, which is useful in practical situations.
4.3. Testing of Different Algorithms
In this part, we will use the actor network completed in the training in the last section to observe the influence of different algorithms on UAV flight. We loaded the actor network with DDPG, TD3, MCDDPG, and MCD algorithms training onto the UAV to guide the UAV flight. From the same starting position , the UAV flies through 49 cylindrical obstacles composed of radius and reaches the target position . Figure 7 plots the flight paths obtained by the algorithms. Table 2 lists the length and the steps of the flight paths.

(a) DDPG

(b) TD3

(c) MCDDPG

(d) MCD
From this figure, we can see that UAV moves fast in a way that is close to the obstacle obtained by DDPG algorithm. UAV approaches the obstacle with fewer steps and faster speed, but at the same time, makes it easier to hit the obstacle obtained by DDPG. And because the target point is near the obstacle, DDPG tends to avoid the obstacle when planning the path, resulting in loss in the environment far from the obstacle. UAV flies in a safer manner and can correct the course if the target navigation goes wrong at a small cost obtained MCDDPG. TD3’s path planning is more conservative, but it sacrifices time for longer path planning. During the first half of the flight, hesitancy resulted in a zigzag flight path. Multicritic delay absorbs the advantages of the above three algorithms to reach the destination with the shortest path and takes less turns than TD3, which is not far from the other two algorithms. Obviously, we can get that multicritic delay is superior to other algorithms because it enables the UAV to complete the task with minimal path cost and lower time.
4.4. Testing of Complex Environment
To further verify the robustness of MCD in more complex environments, we set up a more complex environmental threat test to investigate the robustness of MCD. We set up a series of environments with different numbers of obstacles. As shown in Figure 4, it represents a complex environment with a density of 0.5. Density 1 represents 100 obstacles, and 10 obstacles are reduced for every decrease of 0.1. We repeated 2,000 episodes of the four algorithms in the same obstacle environment, redeploying drones and targets in each episode. The success rate for 2000 sets is shown in Figure 8. Obviously, as the number of obstacles rose, the success rate of all four algorithms began to decline. However, MCD algorithm declined the most slowly and still maintained a success rate of in the most complex environment. The other three algorithms MCDDPG, TD3, and DDPG are reduced to , , and , respectively. Therefore, MCD is highly adaptable to complex environments.

5. Conclusion
In this paper, we proposed a reinforcement learning method, named, MCD, for solving the UAV path planning problem under a complex environment. It uses multicritic networks and delayed learning methods to reduce the overestimation problem of DDPG and adds noise to improve the robustness in the real environment. Moreover, a UAV mission platform is built to train and evaluate the effectiveness and robustness of the proposed method. Simulation results show that the proposed algorithm is superior to the traditional DDPG in path planning. However, some issues remain to be resolved, such as MCD hyperparameter settings, improvements to nonsparse rewards, and experience reply settings.
Data Availability
The data used to support the findings of this study are available from the corresponding author upon request.
Conflicts of Interest
We declare that there is no conflict of interest.
Acknowledgments
This work was supported in part by the National Natural Science Foundation of China (62172110), in part by the Natural Science Foundation of Guangdong Province (2021A1515011839), and in part by the Programme of Science and Technology of Guangdong Province (2021A0505110004 and 2020A0505100056).