Research Article
UAV Path Planning Based on Multicritic-Delayed Deep Deterministic Policy Gradient
1 Initialize the critic networks and actor network with parameters and , separately. | 2 Initialize the target critic networks and actor target network with parameters and , separately. | 3 Initialize the reply buffer , maximum flight time , parameters , updating rate . | 4 fordo | 5 Reset environment and receive initial observation state . | 6 fordo | 7 Select action according to the current policy and exploration noise. | 8 Obtain the reward and observe new state . | 9 Store transition in . | 10 Sample a random minibatch of transitions from . | 11 Update the critic networks by minimizing loss function in Equation (14). | 12 Update the actor network with policy gradient Equation (8). | 13 Update the parameters of target networks with updating rate . | 14 end | 15 end |
|