Research Article

Two-Loop Acceleration Autopilot Design and Analysis Based on TD3 Strategy

Algorithm 1

TD3 algorithm.
TD3: Twin Delayed Deep Deterministic Policy Gradient
Randomly initialize the network parameters , and
Initialize the target network parameters
Initialize the replay buffer R
for episode = 1, M do
 Initialize an exploration noise for action exploration
 Receive the initial environmental state quantity
for t = 1, T do
  Select an action according to the current policy and exploration noise:
          
  Execute the action and observe the reward and the next state
  Store the explored transition array in R
  Extract sample data of the batch N from R
  
  
  Update the critic-network parameters:
       
  if t mod d then
   Update the actor-network parameters through deterministic policy gradients:
        
   Update the target network:
        
  end if
end for
end for