Input: number of episodes, Num_Episodes; number of time slices, Num_Timeslices; number of leaves of Sumtree structure, B; exploration rate, ε; update frequency, F; learning rate, α ; number of satellites, N; |
Output: the weight of online Q-Network, θ; |
1. Initialize the state of STINs, including the capacity C of BSs, antenna model, channel model, satellite orbit parameters, and the initial positions of satellites; |
2. Randomly initialize the weight θ of online Q-Network; for the weight in target Q-Network, θ'=θ; |
3.For episode =1 to Num_Episodes do |
4. For time =1 to Num_Timeslices do |
5. For i =1 to N do |
6. Get the state information of satellite i from the ground control centre at time t, si; |
7. End for |
8. Get the sate information of BSs from the ground control centre at time t, H; |
9. Obtain the state information of STINs at time t, S=(s1,s2,s3,…,sN , H); |
10. Get the next state of STINs S’ from the ground control centre at t+1 and its termination flag; |
11. For i =1 to N do |
12. Use ε-greedy strategy to select an action, ai; |
13. The agent execute action ai and obtain the instant reward ri by Equation (20); |
14. Store state transition information (si, ai, ri, si’ ) in the Sumtree structure; |
15. End for |
16. S= S’; |
17. The BSs and satellites send their state information to the ground control centre for updating the state information of STINs; |
18. Sample samples from the Sumtree structure, and compute the loss of Q-value of each sample according to Equation (29); |
19. Compute the gradient of each sample according to Equation (30); |
20. Update the weight of online Q-Network according to the back propagation algorithm; |
21. Compute the TD-error value of each sample according Equation (32) and update its priority by Equation (33); |
22. Update the parameters of target Q-Network every frequency F, let θ'=θ; |
23. End for |
24.End for |