| Input: Learning rate , discount factor , learning exploration factor , Individual original MDP , Individual |
| optimal -values of agent i, threshold value proportion , integer for Monte Carlo sampling, time limit for |
| per episode learning |
| () Initialize with individual optimal policy of agent , initialize to ; |
| () Identify coordinated states for agent calling Algorithm 1; |
| () Initialize local state for agent , check whether initial states is in coordination; |
| () for do |
| () observe current local state for agent ; |
| () ; |
| () if agent , agent is in coordination at time then |
| () select according to using ; |
| () else |
| () select according to using ; |
| () end if |
| () receive reward and transition state for each agent ; |
| () if agent , is part of an augmented coordinated state and is included in the new global |
| state then |
| () if is not in state space then |
| () extend to include joint state and all the available actions pair ; |
| () ; |
| () end if |
| () mark agent is in coordination at time and coordinated states for agent is ; |
| () end if |
| () if agent , agent is in coordination at time then |
| () if agent is in coordination at time then |
| () Update according to (5); |
| () else |
| () Update according to (6); |
| () end if |
| () else |
| () if agent is in coordination at time then |
| () Update according to (7); |
| () else |
| () Update according to (8); |
| () end if |
| () end if |
| () , , ; |
| () if is a terminal state then return; |
| () end for |