Research Article

UAV Path Planning Based on Multicritic-Delayed Deep Deterministic Policy Gradient

Algorithm 2

MCDDPG method.
1 Initialize the critic networks and actor network with parameters and , separately.
2 Initialize the target critic networks and actor target network with parameters and , separately.
3 Initialize the reply buffer , maximum flight time , parameters , updating rate .
4 fordo
5 Reset environment and receive initial observation state .
6 fordo
7 Select action according to the current policy and exploration noise.
8 Obtain the reward and observe new state .
9 Store transition in .
10 Sample a random minibatch of transitions from .
11 Update the critic networks by minimizing loss function in Equation (14).
12 Update the actor network with policy gradient Equation (8).
13 Update the parameters of target networks with updating rate .
14 end
15 end