1. Initialize Q network parameter w, target Q network parameter w = w'. |
2. Initialize replay memory D with capacity N, the priority of all Sum Tree leaf nodes pj=1. |
3. For i=1to T do |
4. Initialize s as the first state in the current state sequences of interceptor. |
5. While s is not Termination: |
6. a) Select an action a with ε-greedy. |
7. b) Execute action a, transfer to the next state s', and get the immediate reward r. Judge whether it is in the termination state d. |
8. c) Store transition {s, a, s', r, d} in D. Replace the oldest tuple if ‖D‖>N. |
9. d) Sample n tuples from D, { sj, aj,s'j, rj, dj }, j=1,2,3,…,n. The sampling probability is . Compute the weight of loss function: . |
10. e) Compute the current target Q value yi. |
. |
11. f) Compute the loss as equation (2). Updating Q network parameter w. |
12. g) Compute TD error of all sample data: . Update the priority of all Sum Tree nodes: . |
13. h) if T%C == 0, Update the target Q network parameter w'=w End if. |
14. i) s=s'. |
15. End For |