Learning to Drive in the NGSIM Simulator Using Proximal Policy Optimization
Algorithm 1
PPO.
Input: Randomly initialize the parameters of the Actor-Critic as , the initial learning rate
For = 0 to , repeat the following steps
Using the policy to interact with the NGSIM environment for steps, record the trajectories of the agent as , calculate the reward according to equation (10) for every state in the trajectories.
Compute advantage using GAE.
Compute the gradient according to equation (7) with epochs and minibatch size , and update using Adam optimizer.