| • Initialize the CRB. |
| • Initialize policy network , twin evaluation network and for each JE i with weights 、、, respectively. |
| • Initialize twin target network and for each JE with weights 、, respectively. |
| • Training episode =1. |
| • While Training episode ≤ Edo |
| • Initialize the environment state S(t) = (0, 0, …, 0). |
| • for time step t =1, 2, …, Tmax |
| • Each JE i selects the jamming action according to the current observation . |
| • Obtain and carry out the joint jamming action at, , then each JE i obtains the shared reward rt and achieves the next observations . |
| • The experience from all JEs is stored in CRB: |
| • If the capacity of CRB is larger than β, then the training process begins: |
| • Stochastically Sampling mini-batch of experiences from CRB. |
| • for each JE i =1, 2, …, N |
| • Update the weight and of twin evaluation network with (23) |
| • Update the weight of the policy network by (24) and (25) |
| • Soft update the weight and of twin target network through (26) |
| • end for |
| • end If |
| • end for |
| • end while |