Wireless Communications and Mobile Computing

Research Article

Deep Reinforcement Learning-Based Joint Satellite Scheduling and Resource Allocation in Satellite-Terrestrial Integrated Networks

Algorithm 1

PSDDQN for joint satellite association and channel allocation.

Input: number of episodes, Num_Episodes; number of time slices, Num_Timeslices; number of leaves of Sumtree structure, B; exploration rate, ε; update frequency, F; learning rate, α ; number of satellites, N;
Output: the weight of online Q-Network, θ;
1. Initialize the state of STINs, including the capacity C of BSs, antenna model, channel model, satellite orbit parameters, and the initial positions of satellites;
2. Randomly initialize the weight θ of online Q-Network; for the weight in target Q-Network, θ'=θ;
3.For episode =1 to Num_Episodes do
4. For time =1 to Num_Timeslices do
5. For i =1 to N do
6. Get the state information of satellite i from the ground control centre at time t, s_i;
7. End for
8. Get the sate information of BSs from the ground control centre at time t, H;
9. Obtain the state information of STINs at time t, S=(s₁,s₂,s₃,…,s_N , H);
10. Get the next state of STINs S’ from the ground control centre at t+1 and its termination flag;
11. For i =1 to N do
12. Use ε-greedy strategy to select an action, a_i;
13. The agent execute action a_i and obtain the instant reward r_i by Equation (20);
14. Store state transition information (s_i, a_i, r_i, s_i’ ) in the Sumtree structure;
15. End for
16. S= S’;
17. The BSs and satellites send their state information to the ground control centre for updating the state information of STINs;
18. Sample samples from the Sumtree structure, and compute the loss of Q-value of each sample according to Equation (29);
19. Compute the gradient of each sample according to Equation (30);
20. Update the weight of online Q-Network according to the back propagation algorithm;
21. Compute the TD-error value of each sample according Equation (32) and update its priority by Equation (33);
22. Update the parameters of target Q-Network every frequency F, let θ'=θ;
23. End for
24.End for