Reinforcement Learning-Based Multiple Constraint Electric Vehicle Charging Service Scheduling
Algorithm 1
Policy gradient algorithm
(1)
In the neural network, initialize the parameter set randomly and initialize .
(2)
Initialize , randomly initialize action and output state , calculate local reward , and then add the trajectory generated by the action to the stored trajectory of the training.
(3)
Input state to the neural network and select a random action .
(4)
After the simulation environment executes action , obtains the output state , and calculates the local reward , the trajectory generated by the action is added to the stored trajectory of the training.
(5)
Judge whether is true; if it is true, go to step 6; otherwise, assign to and go to step 3, where is the variable to be accumulated and is the expected value of the total reward for a single trajectory.
(6)
Calculate the strategy optimization strategy function .
(7)
Assign to , update the parameter set in strategy to , and judge whether is true; if so, go to step 2; otherwise, the reinforcement learning training process is over; save the updated parameter set as the most optimal parameter set and the optimal strategy ; is the maximum number of trajectories