Research Article
UAV Path Planning Based on Multicritic-Delayed Deep Deterministic Policy Gradient
1 Initialize the critic networks and actor network with parameters and , separately. | 2 Initialize the target critic networks and actor target network with parameters and , separately. | 3 Initialize the reply buffer , maximum flight time , parameters , updating rate . | 4 fordo | 5 Reset environment and receive initial observation state . | 6 fordo | 7 Select action according to the current policy and exploration noise. | 8 ifthen | 9 according to Equation (15). | 10 end | 11 Obtain the reward and observe new state , and Store transition in . | 12 Sample a random minibatch of transitions from . | 13 Update the critic networks by minimizing loss function in Equation (14). | 14 if mod then | 15 Update the actor network with policy gradient Equation (8). | 16 Update the parameters of target networks with updating rate . | 17 end | 18 end | 19 end |
|