| 
 | 
| ā | Study title | Approach | Merit | Limitations | Ref | 
| 
 | 
| DRL based on value function | An improved algorithm of robot path planning in complex environment based on double DQN | Double DQN | The problem of lacking experiments is solved by redefining the initialization of the robot and the reward function for the free position | Slow convergence speed of the algorithm | [25] | 
| The USV path planning of dueling DQN algorithm based on tree sampling mechanism | Dueling DQN | The algorithm can identify and avoid static obstacles in the environment and realize autonomous navigation in complex environments | Internal connection between the state-action pairs is not strong enough | [26] | 
| Tactical UAV path optimization under radar threat using deep reinforcement learning | DQN-PER | Alleviates the sparse reward problem | Overvaluation of the action-state value | [27] | 
| 
 | 
| DRL based on strategy gradient | Advanced double layered multi-agent systems based on A3C in real-time path planning | A3C | The correlation between state distribution samples is eliminated, and the sample storage mode of experience playback mechanism is replaced | Convergence to local optimal strategy | [28] | 
| The path-planning algorithm of unmanned ship based on DDPG | DDPG | The algorithm can be applied to continuous state space and action space | Sensitive to hyperparameters | [29] | 
| Hindsight trust region policy optimization | TRPO | The algorithm can choose a more appropriate step length during training | Large environments and policies are prone to large errors | [30] | 
| PPO-based reinforcement learning for UAV navigation in urban environments | PPO | The algorithm has better data efficiency and robustness | The difference between the old and new policies cannot be too large with each update | [31] | 
| 
 |