| Initialize ,,,,,,,,. |
| Initialize experience pool and mini-batches . |
| Initialize the parameter of the estimation network as . |
| Initialize the parameter of the target network . |
| 1: For episode do |
| 2: For time-slot do |
| 3: Input into the estimation network and output the ; |
| 4: Select the action using the adaptive policy algorithm |
| And update according to the equation (19); |
| 5: Execute action and generate the observation and ; |
| 6: Compute from ,; |
| 7: Store into the experience-replay pool . |
| 8: Ifthen |
| 9: Randomly generate an index subset ; |
| 10: Sample from ; |
| 11: For each sample in do |
| 12: Compute the and obtain . |
| 13: End for |
| 14: Calculate the loss function according to the equation (16) and update according to the equation (17); |
| 15: Minimize the loss function with learning rate . |
| 17: End if |
| 18: Every time slots: Update by setting . |
| 19: End for |
| 20: End for |