Research Article

Supervised Reinforcement Learning for ULV Path Planning in Complex Warehouse Environment

Algorithm 1

The Training Procedure of the SDRL.
Input: Expert data, initial parameters and ;
fordo
 Update the discriminator by ascending the stochastic gradient;
 Update the internal rewards and external rewards ;
 Update the value function by ;
 Update the policy of the DRL by ;
end