Research Article

UAV Path Planning Based on Multicritic-Delayed Deep Deterministic Policy Gradient

Algorithm 3

MCD method.
1 Initialize the critic networks and actor network with parameters and , separately.
2 Initialize the target critic networks and actor target network with parameters and , separately.
3 Initialize the reply buffer , maximum flight time , parameters , updating rate .
4 fordo
5 Reset environment and receive initial observation state .
6 fordo
7 Select action according to the current policy and exploration noise.
8 ifthen
9 according to Equation (15).
10 end
11 Obtain the reward and observe new state , and Store transition in .
12 Sample a random minibatch of transitions from .
13 Update the critic networks by minimizing loss function in Equation (14).
14 if mod then
15 Update the actor network with policy gradient Equation (8).
16 Update the parameters of target networks with updating rate .
17 end
18 end
19 end