Research Article
UAV Path Planning Based on Multicritic-Delayed Deep Deterministic Policy Gradient
1 Initialize the critic networks , and actor network with parameters , , and , separately. | 2 Initialize the target critic networks , and actor target network , separately. | 3 Initialize the reply buffer , maximum flight time . | 4 fordo. | 5 Reset environment and receive initial observation state . | 6 fordo. | 7 Select action and obtain the reward and new state . | 8 Store transition in . | 9 Sample a random minibatch of transitions from . | 10 | 11 Update the critic networks by minimizing loss function in Equation (10). | 12 if mod then | 13 Update the actor network with policy gradient Equation (8). | 14 Update the parameters of target networks with updating rate . | 15 end | 16 end | 17 end |
|