Diversity Evolutionary Policy Deep Reinforcement Learning
Algorithm 1
DEPRL.
Input: the coefficient of soft update method τ, sampling size of the experience pool N and M, maximum number of time steps Tmax, discount factor γ, experience pool capacity Δsize, population parameter K and K1
Output: actor network parameters corresponding to the optimal policy
(1)
Initialize critic network parameters θ1, θ2, θtarg,1, θtarg,2 and actor network parameter distribution (μ0, ∑0)
(2)
Ttotal = 0, Tactor = 0
(3)
WHILE Ttotal < Tmax:
(4)
Extract K sets of parameters para from the current distribution (μ, ∑)
(5)
FOR k = 1 TO K/2:
(6)
Initialize the actor according to the parameter para[k]
(7)
FOR t = 1 TO 2 Tactor/K:
(8)
Sampling N samples from Δ to minimize the objective function (3)
(9)
Update θtarg,1 and θtarg,2 through equations (5) and (6)
(10)
FOR k = 1 TO K1:
(11)
Initialize the actor according to the parameter para [k]
(12)
FOR t = 1 TO Tactor:
(13)
Sample N samples from Δ to maximize the objective function (11)
(14)
Replace the original parameter para [k] with the new actor parameter
(15)
FOR k = K1 + 1 TO K/2:
(16)
Initialize the actor according to the parameter para [k]
(17)
FOR t = 1 TO Tactor:
(18)
Sample N samples from Δ to maximize the objective function (12)
(19)
Replace the original parameter para [k] with the new actor parameter
(20)
Tactor = 0
(21)
FOR k = 1 TO K:
(22)
Initialize the actor according to the parameter para[k]
(23)
Interact with the environment to calculate the cumulative payoff G and the total number of time steps used Tepisode
(24)
Store data in the experience pool Δ
(25)
Sample M samples from Δ to calculate the DMMD between them and
(26)
Tactor = Tactor + Tepisode
(27)
Ttotal = Ttotal + Tactor
(28)
Select elite samples according to G and DMMD, and update the distribution according to equations (12) and (13)