Research Article

Diversity Evolutionary Policy Deep Reinforcement Learning

Algorithm 1

DEPRL.
Input: the coefficient of soft update method τ, sampling size of the experience pool N and M, maximum number of time steps Tmax, discount factor γ, experience pool capacity Δsize, population parameter K and K1
Output: actor network parameters corresponding to the optimal policy
(1)Initialize critic network parameters θ1, θ2, θtarg,1, θtarg,2 and actor network parameter distribution (μ0, 0)
(2)Ttotal = 0, Tactor = 0
(3)WHILE Ttotal < Tmax:
(4)Extract K sets of parameters para from the current distribution (μ, )
(5)FOR k = 1 TO K/2:
(6) Initialize the actor according to the parameter para[k]
(7)FOR t = 1 TO 2 Tactor/K:
(8)  Sampling N samples from Δ to minimize the objective function (3)
(9)  Update θtarg,1 and θtarg,2 through equations (5) and (6)
(10)FOR k = 1 TO K1:
(11)Initialize the actor according to the parameter para [k]
(12)FOR t = 1 TO Tactor:
(13)  Sample N samples from Δ to maximize the objective function (11)
(14)  Replace the original parameter para [k] with the new actor parameter
(15)FOR k = K1 + 1 TO K/2:
(16)Initialize the actor according to the parameter para [k]
(17)FOR t = 1 TO Tactor:
(18)  Sample N samples from Δ to maximize the objective function (12)
(19)  Replace the original parameter para [k] with the new actor parameter
(20)Tactor = 0
(21)FOR k = 1 TO K:
(22)Initialize the actor according to the parameter para[k]
(23)  Interact with the environment to calculate the cumulative payoff G and the total number of time steps used Tepisode
(24)  Store data in the experience pool Δ
(25)  Sample M samples from Δ to calculate the DMMD between them and
(26)  Tactor = Tactor + Tepisode
(27)  Ttotal = Ttotal + Tactor
(28)  Select elite samples according to G and DMMD, and update the distribution according to equations (12) and (13)
(29)END WHILE