Research Article
Two-Loop Acceleration Autopilot Design and Analysis Based on TD3 Strategy
TD3: Twin Delayed Deep Deterministic Policy Gradient | Randomly initialize the network parameters , and | Initialize the target network parameters | Initialize the replay buffer R | for episode = 1, M do | Initialize an exploration noise for action exploration | Receive the initial environmental state quantity | for t = 1, T do | Select an action according to the current policy and exploration noise: | | Execute the action and observe the reward and the next state | Store the explored transition array in R | Extract sample data of the batch N from R | | | Update the critic-network parameters: | | if t mod d then | Update the actor-network parameters through deterministic policy gradients: | | Update the target network: | | end if | end for | end for |
|