Computational Intelligence and Neuroscience

Research Article

Diversity Evolutionary Policy Deep Reinforcement Learning

Algorithm 1

DEPRL.

Input: the coefficient of soft update method τ, sampling size of the experience pool N and M, maximum number of time steps T_max, discount factor γ, experience pool capacity Δ_size, population parameter K and K₁
Output: actor network parameters corresponding to the optimal policy
(1)	Initialize critic network parameters θ₁, θ₂, θ_targ,1, θ_targ,2 and actor network parameter distribution (μ₀, ∑₀)
(2)	T_total = 0, T_actor = 0
(3)	WHILE T_total < T_max:
(4)	Extract K sets of parameters para from the current distribution (μ, ∑)
(5)	FOR k = 1 TO K/2:
(6)	Initialize the actor according to the parameter para[k]
(7)	FOR t = 1 TO 2 T_actor/K:
(8)	Sampling N samples from Δ to minimize the objective function (3)
(9)	Update θ_targ,1 and θ_targ,2 through equations (5) and (6)
(10)	FOR k = 1 TO K₁:
(11)	Initialize the actor according to the parameter para [k]
(12)	FOR t = 1 TO T_actor:
(13)	Sample N samples from Δ to maximize the objective function (11)
(14)	Replace the original parameter para [k] with the new actor parameter
(15)	FOR k = K₁ + 1 TO K/2:
(16)	Initialize the actor according to the parameter para [k]
(17)	FOR t = 1 TO T_actor:
(18)	Sample N samples from Δ to maximize the objective function (12)
(19)	Replace the original parameter para [k] with the new actor parameter
(20)	T_actor = 0
(21)	FOR k = 1 TO K:
(22)	Initialize the actor according to the parameter para[k]
(23)	Interact with the environment to calculate the cumulative payoff G and the total number of time steps used T_episode
(24)	Store data in the experience pool Δ
(25)	Sample M samples from Δ to calculate the D_MMD between them and
(26)	T_actor = T_actor + T_episode
(27)	T_total = T_total + T_actor
(28)	Select elite samples according to G and D_MMD, and update the distribution according to equations (12) and (13)
(29)	END WHILE