Wireless Communications and Mobile Computing

Research Article

UAV Path Planning Based on Multicritic-Delayed Deep Deterministic Policy Gradient

TD3 method.

1 Initialize the critic networks , and actor network with parameters , , and , separately.
2 Initialize the target critic networks , and actor target network , separately.
3 Initialize the reply buffer , maximum flight time .
4 fordo.
5 Reset environment and receive initial observation state .
6 fordo.
7 Select action and obtain the reward and new state .
8 Store transition in .
9 Sample a random minibatch of transitions from .
10
11 Update the critic networks by minimizing loss function in Equation (10).
12 if mod then
13 Update the actor network with policy gradient Equation (8).
14 Update the parameters of target networks with updating rate .
15 end
16 end
17 end