Research Article
The Implementation of Deep Reinforcement Learning in E-Learning and Distance Learning: Remote Practical Work
Algorithm 2
The DQN algorithm and experience replay.
| Initialize memory M to capacity C | | | Initialize value function Q (action) with random weights | | | For episode = 1, N do (for each episode N) | | | Initialize state S1 = (X1) (starting computer screen pixels) at the beginning of each episode | | | //Pre-process and feed the computer screen to our DQN, | | | //which will regress the Q-values of all possible actions in the state. | | | For t = 1, T do | | | Chose an action using the epsilon-greedy policy. | | | //With the prospect epsilon, we chose a random action a and with probability 1-epsilon, | | | //chose an action that has a maximum Q-value, such as | | | Execute action Atin emulator and observe reword and image Xt+1 | | | Set St+1 = St, At, Xt+1and pre-process | | | Store the transition in M | | | Sample random mini-batch of the transitions from M | | | Set | | | figure the loss function | | | //which is just the squared difference between goal Q and predicted Q. | | | Do gradient descent with respect to our actual network parameters in order to reduce the loss function. | | | After every “” iteration, copy our actual network weights to the goal network weights. | | | End for | | | End for |
|