• Initialize the CRB. |
• Initialize policy network , twin evaluation network and for each JE i with weights 、、, respectively. |
• Initialize twin target network and for each JE with weights 、, respectively. |
• Training episode =1. |
• While Training episode ≤ Edo |
• Initialize the environment state S(t) = (0, 0, …, 0). |
• for time step t =1, 2, …, Tmax |
• Each JE i selects the jamming action according to the current observation . |
• Obtain and carry out the joint jamming action at, , then each JE i obtains the shared reward rt and achieves the next observations . |
• The experience from all JEs is stored in CRB: |
• If the capacity of CRB is larger than β, then the training process begins: |
• Stochastically Sampling mini-batch of experiences from CRB. |
• for each JE i =1, 2, …, N |
• Update the weight and of twin evaluation network with (23) |
• Update the weight of the policy network by (24) and (25) |
• Soft update the weight and of twin target network through (26) |
• end for |
• end If |
• end for |
• end while |