Research Article
Reinforcement Learning Guided by Double Replay Memory
Algorithm 1
Double experience replay (DER).
| Given: : | | βAn off-policy RL algorithm πΈ, where πΈ: DQN | | βSampling strategies (π1,π2) from replay | | βwhere: π1 : uniform sampling, π2 TD-error based sampling | | βan update probability strategy for update second replay, where | | Initialize πΈ | | Initialize replay buffer , | | βobserve S0 and choose a0 using πΈ | | ββfor episode =1, M do | | βββobserve | | ββββstore transition in to follow | | ββββfor t=1; T do | | βββββif N2>k then | | ββββββWith π1,π2, sampling ratio Ξ» | | ββββββsample transition from and | | βββββelse | | ββββββWith π1, sample transition from H1 | | βββββupdate weight according to πΈ | | βββββput used transition into with probability | | βββββif transitions from then | | ββββββupdate according to | | until converge |
|