[Retracted] Deep Reinforcement Learning-Based Path Control and Optimization for Unmanned Ships

Wu, Di; Lei, Yin; He, Maoen; Zhang, Chunjiong; Ji, Li

doi:https://doi.org/10.1155/2022/7135043

Wireless Communications and Mobile Computing

On this page

Abstract Introduction Related Works Conclusions Data Availability Conflicts of Interest References Copyright Related Articles

Research Article Retraction

!

This article has been Retracted. To view the article details, please click the ‘Retraction’ tab above.

Special Issue

Machine Learning Enabled Signal Processing Techniques for Large Scale 5G and 5G Networks

View this Special Issue

Research Article | Open Access

Volume 2022 | Article ID 7135043 | https://doi.org/10.1155/2022/7135043

[Retracted] Deep Reinforcement Learning-Based Path Control and Optimization for Unmanned Ships

Di Wu,¹Yin Lei,¹Maoen He,²Chunjiong Zhang,²and Li Ji³

Academic Editor: Mohammad Farukh Hashmi

Received06 Mar 2022

Revised14 Mar 2022

Accepted18 Mar 2022

Published11 May 2022

Abstract

Unmanned ship navigates on the water in an autonomous or semiautonomous way, which can be widely used in maritime transportation, intelligence collection, maritime training and testing, reconnaissance, and evidence collection. In this paper, we use deep reinforcement learning to solve the optimization problem in the path planning and management of unmanned ships. Specifically, we take the waiting time (phase and duration) at the corner of the path as the optimization goal to minimize the total travel time of unmanned ships passing through the path. We propose a new reward function, which considers the environment and control delay of unmanned ships at the same time, which can reduce the coordination time between unmanned ships at the same time. In the simulation experiment, through the quantitative and qualitative results of deep reinforcement learning of unmanned ship navigation and path angle waiting, the effectiveness of our solution is verified.

1. Introduction

Unmanned ships are intelligent platforms that rely on shipboard sensors to navigate in an autonomous or semiautonomous manner on the surface of the water and can be widely used in the fields of marine transportation, antimine, and antisubmarine. The unmanned ship is an important node in the networked unmanned system, which will overturn the traditional naval warfare style and give rise to a new marine equipment system and is of great significance to the development of marine resources and the maintenance of national maritime rights and interests [1]. Compared with other unmanned systems, unmanned ships face special challenges such as harsh marine environment (e.g., strong wave and current surges) and special characteristics of unmanned ship motion models (e.g., highly nonlinear model, strong time lag, and time variability) [2].

Path planning [3, 4] is a very important technique in the field of unmanned ships, attracting the attention of countless researchers. Depending on the planning method, there are two different types of path planning: point-to-point and full-coverage traversal. If the basic information about the entire environment is known to the unmanned vessel during the completion of the task, this is called global path planning, and the main algorithms include greedy algorithms, genetic algorithms, and others [5–7]. If the unmanned vessel can know part of the environmental information during the work process and cannot grasp the full information, this is called local path planning, and the main algorithms include potential field method [8], fuzzy control method [9], and neural network [10, 11].

In recent years, with the rapid progress of high-performance computing, big data, and deep learning technology, the core technology of artificial intelligence software, reinforcement learning algorithms, and their applications have been more widely focused and developed more rapidly [12]. In particular, the combination of reinforcement learning and deep learning has led to several breakthroughs in deep reinforcement learning, and the game between AlphaGo and top human chess players has led to a wider interest in deep reinforcement learning in academia and industry. Not only has reinforcement learning been a great success in computer gaming, but it is also considered to be the most promising approach to advanced artificial intelligence in areas such as unmanned ship control, inverted pendulum control, and intelligent driving of steam unmanned ships [13–16].

Standard QL algorithms generally give the values an initial value and let them start at 0 or a random number, so unmanned ships lack a priori knowledge of the environment, and convergence is slow. During the learning process, it is difficult to weigh long-term and short-term benefits, and it is easy to fall into trap regions when faced with a “symmetric dilemma.” Therefore, the convergence of QL algorithms, balanced exploration, and exploitation, as well as dangerous regions and incomplete access to state-action pairs, has become the focus of reinforcement learning research.

This paper uses reinforcement learning for local path planning of mobile unmanned ships to improve the planning efficiency of optimal paths for unmanned ships. The research objectives are to solve the problems of slow convergence, exploration, and exploitation dilemmas as well as trap regions and incomplete access to state-action pairs of the reinforcement learning algorithm, to speed up the path planning efficiency, and to find the control rules for the optimal path from the starting point to the endpoint.

2.1. Path Optimization

Based on the situational awareness map, the unmanned vessel navigation planning and navigation consider mission requirements, safety, efficiency, rules, maneuverability, and uncertainty to calculate the key elements of planning and navigation and form intercompatible commands with different granularity, so that the unmanned vessel can be effective while satisfying the navigation safety envelope. Unmanned vessel navigation planning and navigation face special challenges, such as many maritime rules with fuzzy properties, the large time lag in the hull model, high inertia, and large differences between different vessel types. [17] proposes a predictor for estimating the sideslip angle of an unmanned ship to achieve path following based on predicted line of sight. [18] uses an extended state observer for real-time estimation of unmanned ship sideslip angle and combines it with line-of-sight navigation to solve the path-following problem under disturbing conditions. [19] proposed global and local route planning based on Dijkstra and artificial potential field methods. [20] proposed inverse adaptive sliding mode control in the Serret-Frenet coordinate system to solve the path-following problem under model and disturbance uncertainty. [21] implements unmanned ship path-following control based on an improved backstepping method. [22] proposed three PID control methods to solve the linear path-following problem under constant surge disturbance.

[23] used evidence-based reasoning to evaluate the hazards and, based on the evaluation results, used mutual collision avoidance algorithms that satisfy maritime collision avoidance rules to achieve real-time safe obstacle avoidance of unmanned vessels [24, 25]

2.2. Unmanned Boat Control

Unmanned vessel control solves the problems of dynamic positioning, trajectory tracking, and path tracking during navigation, giving the unmanned vessel the control capabilities of an experienced pilot to successfully and stably perform the various maneuvers required for navigation. With the development of control theory, researchers in the marine field can apply the latest control techniques to unmanned ship control. However, the control of unmanned ships faces challenges such as high model nonlinearity and uncertainty, system underdrive, the time lag in the ship itself and in the actuators, saturation characteristics of the actuators, and unpredictable strong external disturbances.

[10, 11] systematically described the progress of research on control of marine electromechanical systems and course keeping of unmanned vessels, respectively. [12] is based on the active method for model identification of unmanned vessels. In [13], model identification of an unmanned boat with integrated propulsion is based on the idea of MMG separation modelling. A robust controller with fast convergence is also proposed based on the idea of multimodel control. The [14] active antidisturbance control law based on compound errors is used to suppress external disturbances. [15] implemented the control of an unmanned ship based on the GPC-PID method and conducted sea trials in the southern Yellow Sea. [16] proposed the robust control with variable delay using smith predictor and extended state observer.

2.3. Unmanned Vessel Cluster Control

Single-ship capabilities are particularly weak in the face of vast and hostile oceans. The path-following control layer uses an inverse step and neural network approach. A graph-theoretic approach is used for the speed and route planning layer. [16] designed a layered control architecture for cluster target tracking, cluster obstacle avoidance, and collision avoidance for members within the cluster. The control architecture is divided into three layers: the cluster strategy layer, the motion planning layer, and the control input layer. [17] uses neural optimization for distributed cluster navigation and a fuzzy approach for approximating the model of an unmanned ship for unmanned ship path maneuver cluster control.

3. Preparation

This section provides an overview of reinforcement learning and the deep reinforcement learning utilised in this paper.

Reinforcement learning is the scheme by which an agent infers the best action rule through its interaction with the environment. MDP is defined by a 4-tuple (, , , ). Call the state space and the action space, and call the respective original states and actions. is called the state transfer function and determines the transfer probability to the next state when action is performed in state . is the reward function.

Once a strategy has been formulated, the intelligence can interact with the environment as shown in Figure 1. At each moment , the intelligence in state decides on action according to strategy . The next moment of the intelligence’s state and reward is then decided according to the state transfer function and the reward function. Repeating this action gives the state and action of the intelligence’s history . Subsequently, the state and action that have been repeated for historical transfers from time 0 are noted as .

Define the value function, which averages the (discounted) reward sum when the action is selected in state and thereafter continued according to the strategy π, as given by

where is the discount rate and represents the average operation about the occurrence mode in policy . When a policy , meets in any , , since strategy can be expected to bring more rewards to the agent than ; is to strengthen learning and obtain the best method to meet any scheme and .

The optimal strategy function is obtained by using its value function (the optimal value function) set to . The optimal strategy function is the optimal Bernoulli equation:

The conditions are known to be satisfied and are estimated using the relational formulation above. The representative method is a method called -learning, and many experiments have shown that it works well but is difficult to apply to continuous and large state problems if the state space is discrete and the number of states is not huge.

4. Reinforcement Learning for Unmanned Ship Control Mapping

In this subsection, we describe the method proposed in this study for simultaneously controlling the path of movement of an unmanned ship and path corner waiting. We first describe the path corner waiting for optimization problem as a reinforcement learning task by giving details of the state, behaviour, and reward as shown specifically in Figure 2.

4.1. Action Space Definition

The intelligent body must consider the current unmanned vessel signal and the selected action when switching between unmanned vessel signals in order to ensure safety.

When the control object is a beacon and an unmanned ship movement path, the action space related to the path corner waiting as a whole and the space related to the unmanned ship movement path as a whole are defined as . Each action space, e.g.,= {north-south, east-west} and= {routeA, routeB, routeC}, the north-south (up and down) direction of the signal turns blue, and the east-west (left and right) direction turns red.

Define the action space as the product of the action space for path corner waiting and the action space for each unmanned ship present, with the size of the action space growing according to the number of unmanned ships.

4.2. Path Indicator

We address the onerous task of learning a strategy due to the growth in the number of unmanned vessels in the action space of the intelligence by introducing a virtual machine defined as a path indicator.

For definition, a machine that performs the same route indication for all control object drones present in a defined section is called a path indicator.

In addition, the “virtual” machine, as it is called here, does not necessarily need a physical counterpart, unlike the usual path corner waiting. For example, even if there is no device that can be seen by the pilot as in the case of path corner waiting, it is possible to use a device that is set remotely to allow the unmanned vessel to receive information within a defined section of road and to act as a path indicator by giving the unmanned vessel information on the path it should take.

Based on the above discussion, the action space of the proposed solution is defined as the product of the action space of the path corner wait and the action space of the unmanned vessel where each path indicator is present. In the case of the control object area defined in Figure 3, the action space of path indicator 1 is = {indication 1, indication 2}, and the action space of path indicator 2 is = {indication 3, indication 4}. The action space is therefore shown below, with a total number of actions of . The definition of equation (3) thus allows the problem of simultaneously optimizing path corner waiting and moving paths to be solved as a practical problem.

5. Improved Reward Function

Deep Policy Gradient (DPG) methods attempt to optimize the policy function iteratively by learning the parameter θ to estimate the gradient of the policy performance. A disadvantage of DPG methods is that they typically lead to high variance in the gradient estimation. This problem arises because the trajectories in the gradient estimation are randomly sampled. That is, the derivative of the log and reward variance of a strategy can be very high.

The reward is a scalar value which is the step obtained after each execution of an action by agent, defined as follows:

where and are the total cumulative delays between the current and previous times.

We can also consider the flow of unmanned vessels in the unmanned ship channel by looking at the occupancy rate, defined as the percentage of the given unmanned ship channel that is full. In addition, we can consider the number of unmanned vessels parked unmanned in the unmanned ship channel as a frustrating phenomenological bonus. Considering the combination of the above parameters, we can positively reward the unmanned boat flow as follows: where is a constant used to prevent division by zero. Intuitively, if the unmanned boats in the unmanned boat lane do not stop to encourage unmanned boat traffic, we will give a positive reward for a full unmanned boat lane.

Suppose we have a highway that goes 90 miles per hour. Also, suppose that all the unmanned ships in the graph on the left have a speed of 80 miles per hour. Then, the delay for each unmanned boat is , since there are 3 unmanned boats in the left graph and the delay at is . For the graph on the right, suppose that another unmanned boat enters the highway at at a speed of 80 miles per hour. Then, in this case, the delay at time is , and the second reward function is , assuming that each unmanned boat lane can accommodate 10 unmanned boats and .

6. Experiment

The starting position of the unmanned ship in the simulation scene is (0 m, 0 m), and the target point is (4.5 m, 4.5 m). The scene is surrounded by a wall with some random obstacles inside. The unmanned ship continuously learns and explores the scene according to the reinforcement learning algorithm proposed in this paper, and once the unmanned ship collides with the obstacles or reaches the target point, the whole scene is reset. The parameters of the simulation experiment are set as follows: learning rate , , , discount rate , , , and total training times 5,000 [26, 27].

Figure 4 shows the path diagrams during training. Figures 4(a)–4(d) are the results of the 423rd, 1,566th, 3,532nd, and 4,879th training sessions, respectively. Because of the high random probability in the early stage of training, our algorithm in Figure 4(a) does not converge, and the unmanned boat collides with the obstacle.

(a)

(b)

(c)

(d)

Figure 5 shows the values obtained by the unmanned ship at each step during the above training process. (a) to (d) in the figure correspond to (a) to (d) in Figure 3, respectively, and the unmanned ship selects the action with the largest value in the current state to execute each time. In (b), (c), and (d), the unmanned boat successfully avoided the obstacles and reached the end point, especially in the late training period when the unmanned boat was approaching the end point. The expected future payoffs for these states were high, and the selection of these actions resulted in higher payoffs, so the values were higher, in line with the results that the reinforcement learning method could produce [28].

Figure 6 shows the average value of each selected action during the training process, and it can be seen from Figure 6 that in the early stage of training, the average value is low because the random probability is large and the unmanned ship has gained less knowledge and experience, so the number of times the unmanned ship reaches the target point is low; after the training reaches 1,000 times, the number of times the unmanned ship reaches the target point after learning gradually increases, and the number of times it gains positive reward is increasing, so the averagevalue gradually increases, i.e., the average cumulative reward value obtained by the unmanned ship for the selected action becomes higher and higher; after that, as the number of training times increases, the number of arriving at the target point becomes more and more, the reward also becomes more and more, the value gradually increases, and the algorithm finally converges gradually.

Figure 7 illustrates the cumulative payoffs ( and ravg are the payoffs and average payoffs, respectively) during the training process. The highest payoff value of 1 is obtained when the unmanned ship reaches the end point, the lowest payoff value of -1 is obtained when a collision occurs, and the corresponding positive payoff is obtained when the unmanned ship moves towards the target point.

7. Conclusions

In this paper, we use deep reinforcement learning to solve the optimization problem in unmanned boat path planning management; specifically, we take the timing of path corner waiting as the optimization objective to minimize the total travel time of an unmanned boat crossing the path. In the experiments, quantitative and qualitative results of deep reinforcement learning on unmanned ship travel and path corner waiting are reported to verify the effectiveness of our solution.

Data Availability

The raw data supporting the conclusions of this article will be made available by the authors, without undue reservation.

Conflicts of Interest

The authors declared that they have no conflicts of interest regarding this work.

References

J. Woo, C. Yu, and N. Kim, “Deep reinforcement learning-based controller for path following of an unmanned surface vehicle,” Ocean Engineering, vol. 183, pp. 155–166, 2019.
View at: Publisher Site | Google Scholar
N. Wang, Y. Gao, H. Zhao, and C. K. Ahn, “Reinforcement learning-based optimal tracking control of an unknown unmanned surface vehicle,” IEEE Transactions on Neural Networks and Learning Systems, vol. 32, no. 7, pp. 3034–3045, 2021.
View at: Publisher Site | Google Scholar
H. Xu, N. Wang, H. Zhao, and Z. Zheng, “Deep reinforcement learning-based path planning of underactuated surface vessels,” Cyber-Physical Systems, vol. 5, no. 1, pp. 1–17, 2019.
View at: Publisher Site | Google Scholar
D. H. Chun, M. I. Roh, H. W. Lee, J. Ha, and D. Yu, “Deep reinforcement learning-based collision avoidance for an autonomous ship,” Ocean Engineering, vol. 234, article 109216, 2021.
View at: Publisher Site | Google Scholar
Y. Zhao, X. Qi, Y. Ma, Z. Li, R. Malekian, and M. A. Sotelo, “Path following optimization for an underactuated USV using smoothly-convergent deep reinforcement learning,” IEEE Transactions on Intelligent Transportation Systems, vol. 22, no. 10, pp. 6208–6220, 2021.
View at: Publisher Site | Google Scholar
Y. Cheng, Z. Sun, Y. Huang, and W. Zhang, “Fuzzy categorical deep reinforcement learning of a defensive game for an unmanned surface vessel,” International Journal of Fuzzy Systems, vol. 21, no. 2, pp. 592–606, 2019.
View at: Publisher Site | Google Scholar
B. Du, B. Lin, C. Zhang, B. Dong, and W. Zhang, “Safe deep reinforcement learning-based adaptive control for USV interception mission,” Ocean Engineering, vol. 246, article 110477, 2022.
View at: Publisher Site | Google Scholar
L. Li, D. Wu, Y. Huang, and Z. M. Yuan, “A path planning strategy unified with a COLREGS collision avoidance function based on deep reinforcement learning and artificial potential field,” Applied Ocean research, vol. 113, article 102759, 2021.
View at: Publisher Site | Google Scholar
D. Wu, C. Zhang, L. Ji, R. Ran, H. Wu, and Y. Xu, “Forest fire recognition based on feature extraction from multi-view images,” Traitement du Signal, vol. 38, no. 3, pp. 775–783, 2021.
View at: Publisher Site | Google Scholar
Y. Zhang, Y. Zhang, and Z. Yu, “Path following control for UAV using deep reinforcement learning approach,” Guidance, Navigation and Control, vol. 1, no. 1, p. 2150005, 2021.
View at: Publisher Site | Google Scholar
J. Xie, R. Zhou, Y. Liu et al., “Reinforcement-learning-based asynchronous formation control scheme for multiple unmanned surface vehicles,” Applied Sciences, vol. 11, no. 2, p. 546, 2021.
View at: Publisher Site | Google Scholar
P. An, Z. Wang, and C. Zhang, “Ensemble unsupervised autoencoders and Gaussian mixture model for cyberattack detection,” Information Processing & Management, vol. 59, no. 2, article 102844, 2022.
View at: Publisher Site | Google Scholar
Y. Zhao, Y. Ma, and S. Hu, “USV formation and path-following control via deep reinforcement learning with random braking,” IEEE Transactions on Neural Networks and Learning Systems, vol. 32, no. 12, pp. 5468–5478, 2021.
View at: Publisher Site | Google Scholar
Q. Zhang, W. Pan, and V. Reppa, “Model-reference reinforcement learning control of autonomous surface vehicles,” in 2020 59th IEEE Conference on Decision and Control (CDC), pp. 5291–5296, Jeju, Korea (South), 2020.
View at: Google Scholar
X. Wu, H. Chen, C. Chen et al., “The autonomous navigation and obstacle avoidance for USVs with ANOA deep reinforcement learning method,” Knowledge-Based Systems, vol. 196, article 105201, 2020.
View at: Publisher Site | Google Scholar
N. Wang, Y. Gao, and X. Zhang, “Data-driven performance-prescribed reinforcement learning control of an unmanned surface vehicle,” IEEE Transactions on Neural Networks and Learning Systems, vol. 32, no. 12, pp. 5456–5467, 2021.
View at: Publisher Site | Google Scholar
A. Hasankhani, Y. Tang, J. VanZwieten, and C. Sultan, “Comparison of deep reinforcement learning and model predictive control for real-time depth optimization of a lifting surface controlled ocean current turbine,” in 2021 IEEE Conference on Control Technology and Applications (CCTA), pp. 301–308, San Diego, CA, USA, 2021.
View at: Google Scholar
K. Zhu and T. Zhang, “Deep reinforcement learning based mobile robot navigation: a review,” Tsinghua Science and Technology, vol. 26, no. 5, pp. 674–691, 2021.
View at: Publisher Site | Google Scholar
K. Zhu, Z. Huang, and X. Wang, “Tracking control of intelligent ship based on deep reinforcement learning,” Chinese Journal of Ship Research, vol. 16, no. 1, p. 43, 2021.
View at: Google Scholar
K. Liu, P. Li, C. Liu, L. Xiao, and L. Jia, “UAV-aided anti-jamming maritime communications: a deep reinforcement learning approach,” in 2021 13th International Conference on Wireless Communications and Signal Processing (WCSP), pp. 1–6, Changsha, China, 2021.
View at: Google Scholar
S. Xie, X. Chu, M. Zheng, and C. Liu, “A composite learning method for multi-ship collision avoidance based on reinforcement learning and inverse control,” Neurocomputing, vol. 411, pp. 375–392, 2020.
View at: Publisher Site | Google Scholar
A. Radwan, K. M. S. Huq, S. Mumtaz, K. F. Tsang, and J. Rodriguez, “Low-cost on-demand C-RAN based mobile small-cells,” IEEE Access, vol. 4, pp. 2331–2339, 2016.
View at: Publisher Site | Google Scholar
C. Wang, X. Zhang, L. Cong, J. Li, and J. Zhang, “Research on intelligent collision avoidance decision-making of unmanned ship in unknown environments,” Evolving Systems, vol. 10, no. 4, pp. 649–658, 2019.
View at: Publisher Site | Google Scholar
A. T. Azar, A. Koubaa, N. Ali Mohamed et al., “Drone deep reinforcement learning: a review,” Electronics, vol. 10, no. 9, p. 999, 2021.
View at: Publisher Site | Google Scholar
E. Artusi, F. Chaillan, and A. Napoli, “Path planning for a maritime suface ship based on deep reinforcement learning and weather data,” in OCEANS 2021, pp. 1–8, San Diego–Porto, 2021.
View at: Google Scholar
K. Duan, S. Fong, and C. P. Chen, “Reinforcement learning based model-free optimized trajectory tracking strategy design for an AUV,” Neurocomputing, vol. 469, pp. 289–297, 2022.
View at: Publisher Site | Google Scholar
X. Lu, C. Zhai, V. Gopalakrishnan, and B. G. Buchanan, “Automatic annotation of protein motif function with gene ontology terms,” BMC Bioinformatics, vol. 5, no. 1, p. 122, 2004.
View at: Publisher Site | Google Scholar
H. L. Chien, Y. H. Chiu, and Y. C. Lee, “Maskless lithography based on oblique scanning of point array with digital distortion correction,” Optics and Lasers in Engineering, vol. 136, article 106313, 2021.
View at: Publisher Site | Google Scholar

Copyright

Copyright © 2022 Di Wu et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

PDF Download Citation

Download other formats

Order printed copies

Wireless Communications and Mobile Computing

Machine Learning Enabled Signal Processing Techniques for Large Scale 5G and 5G Networks

[Retracted] Deep Reinforcement Learning-Based Path Control and Optimization for Unmanned Ships

Abstract

1. Introduction

2. Related Works

2.1. Path Optimization

2.2. Unmanned Boat Control

2.3. Unmanned Vessel Cluster Control

3. Preparation

4. Reinforcement Learning for Unmanned Ship Control Mapping

4.1. Action Space Definition

4.2. Path Indicator

5. Improved Reward Function

6. Experiment

7. Conclusions

Data Availability

Conflicts of Interest

References

Copyright