UAVs Maneuver Decision-Making Method Based on Transfer Reinforcement Learning
Table 2
Pseudocode of the 1vs1 countermeasure algorithm based on DDPG transfer learning.
Pseudocode of the 1vs1 countermeasure algorithm based on DDPG
(1) Randomly initialize the parameters and of the evaluated network of actor and critic. Initialize experience pool D with a capacity of M. The number of initialization batch samples is batch_size. The initial attenuation factor is . The initial soft update coefficient is . The initial Gaussian noise variance is noise. The maximum number of initialization rounds is Max_Episode. The maximum number of initialization steps per round is Max_Step
(2) For episode = 1 to Max_Episode do
(3) Obtain the respective state of both sides according to the initial settings of the simulation environment
(4) For t = 1 to Max_Step do
(5) Enter as the input of the actor evaluated network to get the UAV’s action , where represents the function of the upper and lower limits of the UAV’s restricted action
(6) If there is an enemy UAV, the enemy UAV takes the corresponding confrontation maneuver decision-making according to the description in Table 2and we need to execute action and update its own state to
(7) Select the action according to the strategy, that is, training the UAV to randomly select the action within the action range with a certain probability or the action of step 5, then obtain the corresponding reward value , and change the environment state to at the next moment
(8) Store the sample data of the interaction between the UAV and the environment in the experience pool D
(9) Randomly select batch_size of training sample data from experience pool D
(10) Calculate the loss function of the critic evaluated network and update the parameter of the critic evaluated network through backpropagation to minimize the loss function
(11) Calculate the loss function of the actor evaluated network and update the parameter of the actor evaluated network through backpropagation loss function
(12) Update the parameters and of the actor and critic target network for every step C