| Input: |
| Output: |
(1) | Randomly initialize the actor-critic network for their parameters , and , |
(2) | Initialize the target network and , and copy the online network parameters to the target network |
(3) | Initialize the experience playback buffer D, noise coefficient , and discount rate |
(4) | Set up external loop, the round number = 1, M |
(5) | Initialize State S as the current state, and obtain the start state |
(6) | Set up internal loop, the round number = 1, T |
(7) | Select action |
(8) | Conduct action , and obtain the reward and the new state |
(9) | Save the experience data (, , , ) in an experience pool |
(10) | Randomly select a certain number of samples (, , , ) from the experience pool |
(11) | Calculate the target value Q: |
(12) | Calculate the square error of the loss function and update the critic network: |
(13) | Update the actor network via the gradients of the sample data: |
(14) | Regularly update the parameters of the target network: |
(15) | End internal loop |
(16) | End external loop. |