Abstract

Unmanned surface vehicle (USV) is a robotic system with autonomous planning, driving, and navigation capabilities. With the continuous development of applications, the missions faced by USV are becoming more and more complex, so it is difficult for a single USV to meet the mission requirements. Compared with a single USV, a multi-USV system has some outstanding advantages such as fewer perceptual constraints, larger operation ranges, and stronger operation capability. In the search mission about multiple stationary underwater targets by a multi-USV system in the environment with obstacles, we propose a novel cooperative search algorithm (CSBDRL) based on reinforcement learning (RL) method and probability map method. CSBDRL is composed of the environmental sense module and policy module, which are organized by the “divide and conquer” policy-based architecture. The environmental sense module focuses on providing environmental sense values by using the probability map method. The policy module focuses on learning the optimal policy by using RL method. In CSBDRL, the mission environment is modeled and the corresponding reward function is designed to effectively explore the environment and learning policies. We test CSBDRL in the simulation environment and compare it with other methods. The results prove that compared with other methods, CSBDRL makes the multi-USV system have a higher search efficiency, which can ensure targets are found more quickly and accurately while ensuring the USV avoids obstacles in time during the mission.

1. Introduction

Unmanned surface vehicle (USV) is a robot system with capabilities of autonomous planning, driving, and navigation [1]. It can be carried and deployed to mission area via shore-based transportation or large vessels to independently complete missions such as environmental monitoring [24], sampling [5, 6], search, communication [7, 8], harbor protection, and patrol [912]. Among these missions, search mission is one of the most suitable missions for USV. Generally speaking, mission areas need to be regulated according to the mission requirements and environmental conditions (such as wind, current, and obstacles). The mission areas are usually large, and a manned vessel needs to take a long time to search the whole areas, so the crew needs to stay on the vessel for a long time. As is known to all, long-term offshore operation is very difficult, inconvenient, and dangerous for people. However, due to its own characteristics, USV can perform a search mission autonomously for a long time in the mission area without manual intervention, which not only greatly improves the search efficiency, but also greatly reduces the work intensity and risk.

Nevertheless, as most USVs are smaller than manned vessels, the space of USV is small and it cannot be equipped with some high-power observation sensors, so the sensing range of USV is limited, which means that the search efficiency of USV per unit time is low. In addition, once USV has some unforeseen accidents in the process of mission (device failure, unavoidable obstacles, strong electromagnetic interference, etc.), the mission must be suspended, which is a negative influence for some urgent situations (such as searching for victims and antimine).

Although improving the performance of USV can solve the above problems to a certain extent, we have to admit that no matter how we improve the performance of USV; the search efficiency of a single USV is far less than that of a multiple-USV system. Multiple-USV system can enhance the robustness and reliability of USV and provide different policies for various types of missions. In the system, the USV can transmit information to each other and adjust the mission plan at any time according to the process of mission. Once one of the USVs fails to continue working, other USVs can quickly replace it to ensure that the mission not interrupted during the execution.

Although the use of multiple-USV system in search missions is rare, scholars have conducted some research on multiagent search missions. Using a probability map to represent the search regions divided into units and update the probability map through Bayesian rules is a common search method now. Millet et al. [13] propose a distributed search algorithm which includes both map update and fusion procedure. Based on this algorithm, Hu et al. [14] design an algorithm for a cooperative target search mission called coverage control path planning. The main advantage of this method is that it can directly calculate the optimal direction of the agent at the next moment, which improves the search efficiency. However, this method has disadvantages, that is, the number of agents cannot be too small, it cannot escape obstacles, and it is easy to fall into local stability. Apart from this, recent studies have also explored some new methods for the problems of multiagent cooperative search [15, 16]. At the present stage, the control of the multiple-USV system still faces many challenges; therefore, multiple-USV system requires more advanced control methods to enhance the collaboration capability of USV.

In recent years, the continuous development of the RL method has provided new options for solving the problem of target search. The essence of RL is to learn the policy from the interaction between the agent and the environment. In order to overcome the problems of area coverage control, Adepegba et al. [17] combine RL algorithm with the Voronoi-based coverage control algorithm, which uses RL algorithm to approximate the control law. Zhao et al. [18] develop a flocking control framework based on an RL algorithm called DDPG which uses the centralized training and distributed execution framework; it differs from this paper in both the mission and training framework.

In this paper, a novel cooperative search learning algorithm (CSBDRL) based on RL method and probability map method is proposed to apply in the search mission for multiple stationary underwater targets by a multi-USV system in the environment with obstacles. The proposed algorithm is composed of the environmental sense module and policy module, which are organized by the “divide and conquer” policy-based architecture. The environmental sense module focuses on providing environmental sense values. The policy module focuses on how to learn the optimal policy. In CSBDRL, the mission environment is modeled and the corresponding reward function is designed to effectively explore the environment and learning policies. We test CSBDRL in the simulation environment and compare it with other algorithms. The results prove that, compared with other algorithms, CSBDRL makes the multi-USV system have a higher search efficiency, which can make sure targets are found more quickly and accurately while ensuring the USV avoids obstacles in time during the mission.

The rest of this paper is organized according to the structure below. The related background is reviewed in Section 2. In Sections 35, we introduce the architecture and each part of the algorithm in detail. The simulation is carried out in Section 6. Finally, the conclusion and future works are given in Section 7.

2. Background

2.1. Multi-USV Cooperation

Compared with a single USV, a multi-USV system can expand sensing range and have wider application in ocean observation, autonomous sampling, search, and other missions [19, 20]. The US Navy completes a mission of underwater mine clearance mission in Iraq using a multi-USV system. The system takes just 16 hours to complete the whole mission, which is originally planned to take 21 days. Throughout the mission, the search areas of the system reach 2.5 × 106 m2, which successfully improves the mission efficiency and minimized casualties. Arrichiello et al. [21] develop two USVs for a tracking mission. Then, they add another four USVs to form a multi-USV system which has obstacle avoidance capability and follows international navigation traffic rules, greatly improving the mission accuracy. Joseph et al. [2224] use four low-cost USVs to complete some missions such as port monitoring and environmental monitoring. The Portuguese Institute of Advanced Technology-Systems and Robotics (IST-ISR) [25, 26] carries out a large number of experiments on issues related to the motion control of multi-USV system under different scenarios, and the major research includes collision avoidance under dynamic environments, time-varying communication topology, and effectiveness of the control method.

2.2. RL Method

RL is a kind of learning algorithm that maps from environment state to action, and the goal is to make agent obtain the maximum accumulated reward in their interaction process with environment [27]. Markov decision process (MDP) can be used to model RL problems. MDP is usually defined as a four-tuple , where is a set of all environmental states, and indicates the states of the agent at time . is a set of actions that the agent can execute, and indicates the action taken by the agent at time . is the probability distribution function of state transition. indicates the probability that the agent performs action at in the state st and moves to the next state . is a reward function. represents the value of immediate reward obtained by the agent when performing the action in state .

Q-learning algorithm [28] is one of the most widely used RL algorithms. It learns the optimal policy by calculating the state-action value of different actions taken by the agent in each state. In the Q-learning algorithm, iteration formula (1) is the core of the algorithm:

However, in practical problems, a large state space makes the calculation cost of formula (1) too high to solve the optimal policy. To solve this problem, Mnih et al. [29, 30] propose a deep Q-network (DQN) algorithm by combining convolution neural network with Q-learning algorithm. DQN algorithm has shown a comparable level to human players in solving complex problems that are similar to real environment, such as Atari 2600 games. In some uncomplicated nonstrategic games, DQN outperforms some experienced human players. On this basis, some improved DQN algorithms are proposed continuously. Van et al. [31] propose a deep double Q-network (DDQN) algorithm based on the double Q-learning algorithm [32]. Bellemare et al. [33] propose an advantage DQN algorithm based on the advantage learning [34]. The advantage DQN algorithm increases the difference between the optimal action value and the suboptimal action value to alleviate the evaluation error caused by selecting the action corresponding to the maximum Q value in the next state every time. Lakshminarayanan et al. [35] propose a dynamic frame skipping DQN (DFDQN), which uses dynamic frame skipping instead of the action that is repeating times at each moment.

2.3. Probability Map Model

The probability map model uses graphs to represent the joint probability distribution of variables related to the model. Howard et al. first put forward the concept of probability map model. Later, through the efforts of Heckerman et al. [36], the probability map model is developed to a great extent. At present, probability map model is the major method to deal with uncertain data and knowledge in the field of artificial intelligence. When the probability map model is applied to search for targets, the whole search area is divided into cells and associates each cell with the probability or confidence level of the targets in the cells, thus forming the probability map of the whole area [3740].

An online planning and control algorithm for cooperative search of UAVs is proposed in [41]. In this algorithm, each agent keeps a separate probability map for the whole region and updates the map according to the Dempster–Schafer theory. Millet et al. [42] develop a completely decentralized search algorithm that does not require complete connection. Every agent uses Bayesian rules to update their probability maps obtained through the observation and then fuse with their probability maps of the neighboring agents. Tian et al. [43] propose a cooperative search algorithm combining genetic algorithm (GA) and model predictive control (MPC) to solve the search problem in an uncertain environment. This algorithm uses a probability map to describe the uncertainty of the mission area, takes the gains of the information as the optimization target, and finally uses GA to solve the problem of the optimal control input.

3. Architecture

3.1. Control Architecture

Multi-USV system mainly has two control structures: centralized control and distributed control [44]. Centralized control targets the overall performance of the system and controls all USVs through a central node. Centralized control takes the overall performance of the system as the target value and controls all USVs through a central node. Compared with centralized control, distributed control has no central node, and each USV plans the next action according to its own state to maximize the reward function. Information exchange is used to compensate for the lack of observation capability of the single USV. According to the position distribution of USV in the system, this paper adopts the distributed control structure as shown in Figure 1, which enhances the robustness of the system and makes the system easy to expand.

3.2. Algorithm Architecture

CSBDRL is composed of environmental sense module and policy module, which are organized by the “divide and conquer” policy-based architecture. The environmental sense module focuses on providing environmental sense values which are transmitted to the policy module. The policy module focuses on how to use the environmental sense values that are obtained by the environmental sense module to learn the optimal policy. The architecture diagram of the CSBDRL is shown in Figure 2.

4. Environmental Sense Module

The responsibility of the environmental sense module is to generate the environmental sense values, which will directly affect whether the policy module can learn an effective policy. Each environmental sense value consists of four parts, including target information, collaborative information, obstacle information, and collision avoidance information (in the following discussion, we mainly focus on the policy of the USV in the search process, rather than the underlying motion control of the USV platform, so it is assumed that the USV is a particle in the mission).

4.1. Target Information

Target information is to model uncertain information in a distributed environment. To solve this problem, modeling method based on probability map is widely used. As the mission progresses, the probability map is also dynamically updated according to the corresponding update rules, and it can always reflect the latest USV’s understanding of the mission situation, so that the USV can determine the next action based on the mission situation. When generating the probability map model, we assume that there are N USVs in a multi-USV system. Each USV moves in the mission area which is a rectangular area with length and width . As shown in Figure 3(a), we use the top left corner of as the origin to create a coordinate system and then is partitioned into cells. The coordinates of the center position of each cell are expressed as and total number of cells is indicate the presence or absence of a target in the cell, respectively. The coordinate of in the mission area at time can be described as . We model each cell as Bernoulli distribution, i.e., with probability and with probability . Due to the limited observation capability of the USV, the at time can only sample in the sensing region which is defined by sensing radius , where and denotes the 2-norm for vectors. When the coordinates of the center position of a cell are located in , it is considered to be completely within . The sampling results of at time for are represented by , where indicates that a target is detected and indicates that no target is detected. Therefore, (detection probability) and (false alarm probability) are used to model the sampling process. In summary, each generates an individual probability map , where denotes the estimation of probability of a target existence within cell by at time The commonly used method of updating the probability map by measurements is based on the Bayesian rule [13], which is given as follows (the initial value of is set to 0.5 meaning there is no information and should be updated during the search process):

When and , then formula (2) becomes

We use the a nonlinear transformation of instead of :to facilitate the calculation more efficiently. Therefore, the probability map updating formula can be simplified as follows:where ( is detection probability and is false alarm probability)

A related proof of convergence is given in [14]. Compared with formula (2), formula (5) converts the nonlinear function into a linear function, which simplifies the calculations, and formula (4) can be recovered uniquely whenever needed. In order to make the USV obtain the global information in a short time, after the USV obtains the probability map, it uses the following formula to fuse the probability map with the neighbors:where is the number of neighbors, and . The target information is a local w-map centered on the coordinates of the USV, denoted as and denote the x-coordinate and y-coordinate, respectively. An illustration of is shown in Figure 3(b).

4.2. Collaborative Information

The purpose of collaboration information is to make the USV adjust the search area according to the positions of other USVs, thereby preventing multiple-USVs repeatedly searching the same area. USV’s collaboration information is related to its communication capability, but this capability is limited, so it can only interact with other USVs within a communication radius . The USVs that can interact with at time are called neighbors which are defined as (including ). Collaborative information is represented by a c-map, which maps the coordinates of the neighbors within the scope of communication range to a matrix of size , where is a communication radius of the USV. The collaborative map of at time is denoted as , where is the superposition of Gaussian distributions centered on the coordinates of the neighbors.

4.3. Obstacle Information

Obstacle information is to make the USV evade local obstacles (other USVs are not included) in the process of mission. Obstacle information is represented by an obstacle -map, which maps the coordinates of obstacles within the scope of USV’s obstacle avoidance range into a matrix with size , where is the obstacle avoidance radius. The obstacle map of at time is denoted as .

4.4. Collision Avoidance Information

Collision avoidance information is to make the USV evade other USVs in the process of the mission. Collision avoidance information is represented by a collision map, which maps the coordinates of USV within the scope of the USV’s collision avoidance range into a matrix with size , where is the dangerous radius of collision. The collision map of at time is denoted as .

5. Policy Module

The responsibility of the policy module is to learn the optimal policy. In the search mission, we must consider how to find a policy to improve the target search efficiency and accuracy of the multi-USV system based on USV’s navigation safety. Therefore, a policy learned by the policy module consists of two parts: one is collision avoidance policy and the other is cooperative search policy.

5.1. Action Definition

Policy refers to the mapping from state to action, so we first define the USV’s actions. The action range of the USV is discretized. For example, when the degree of discretization is 8 and the maximum turning angle , the action range of the USV can be expressed in Figure 4(a). Assuming at time , the set of possible positions for is and the position of at time is ; it must be satisfied due to the limitation of USV’s maneuverability, as shown in Figure 4(b). Assuming that the moving direction of at time is east, then , and at this time, if the degree of discretization and the maximum turning angle , then the .

5.2. Collision Avoidance

Collision avoidance policy refers to the USV’s position transfer policy when the USV finds other USVs within the danger radius . As shown in Figure 5, assuming that appears in the danger radius of at time , where represents the orientation of relative to . If the azimuth of connection line between the and is closest to one of the azimuths of the action, is the relative direction of to . We stipulate that the collision avoidance policy is to choose the opposite relative direction . The calculation formula of is given by

At the same time, considering that the action of USVi is restricted by the maximum turning angle , it may not be able to move directly according to the collision avoidance policy. Since the next transfer position of USV must meet the condition: , we choose the position that minimizes as the next transfer position, where is calculated as follows:

When USVs appear within the dangerous radius of at time , the position that minimizes the value of is selected as the next transfer position, where the calculation rule of is determined by formula (9). Figure 5 is a schematic diagram of the collision avoidance policy.

5.3. Cooperative Search Policy

As shown in Figure 6, when no other USVs are detected within the , the USV needs to select a search policy which is to cooperate with other USVs and reduce the global uncertainty to find the targets as quickly as possible. In order to learn the optimal cooperative search policy, we use RL algorithm. DDQN is a classic RL algorithm which is a value-based method and can be easily integrated into the environment.

The state space includes the state of each USV , which consists of target information , collaborative information , obstacle information , and collision avoidance information . The action space of the USV is {0, 1, 2, 3, 4, 5, 6, 7}, and each number represents an action that has been defined in Section 5.1. The reward function is defined in Section 5.4. The USV’s states are input into DDQN algorithm and then choose the appropriate action to generate a large number of sample data and store them in the replay buffer. When enough sample data are accumulated in the replay buffer, DDQN algorithm randomly extracts samples from the replay buffer, and these sample data are used to learn the cooperative policy. Algorithm 1 gives the training process of DDQN algorithm.

(1)Initialize Q-network for the USV, replay buffer
(2)fordo
(3)Initialize the environment, state and time
(4)while not ( or targets are found) do
(5)for do
(6)Receive observation
(7)Select action according to
(8)Execute action , receive reward and reach state
(9)Get observation
(10)Store transition in if is trained
(11)end
(12)Sample random minibatch of transitions from
(13)Perform a gradient descent step on
(14)Update time
5.4. Reward Function

Reward function is the direct interface between the agent and environment. In the search mission, whether reward function can provide appropriate reward values to the USV according to the USV’s state and behavior directly determines whether the algorithm can guide the USV to explore and learn efficiently. The reward function designed here consists of four parts, namely, target reward: , time consumption reward: , guiding reward: , and obstacle reward: . Therefore, the reward function of at time is represented as follows:

5.4.1. Target Reward

is a reward that can be obtained when the USV accurately determines the location of the targets. can encourage the USV to discover as many targets as possible while ensuring a certain accuracy. Assuming that there are m targets in the mission area, we stipulate that when , it indicates that the USV determines the location of a target, where is a threshold. The USV can get a positive reward when a target is found:if the multi-USV system can correctly determine the location of all targets after an episode, each USV will receive an additional target reward :

and are weight coefficients of each part of the reward and can be set empirically as 0.5 and 5.0. Finally, the target reward of at time is composed as follows:

5.4.2. Time Reward

In order to optimize efficiency, we set up a time reward to encourage the USV to find the targets in a shorter time. is designed to be a piecewise linear function:where and are additional weight coefficients for the current segment and normally set to 1, is the number of episode steps, and are preset segmentation points, and is the preset maximum number of episode steps when an episode is forced to end. is a weight coefficient, which is 0.01. Obviously, time consumption reward is the same for all USVs in the system.

5.4.3. Guiding Reward

Since the rewards mentioned above are too scarce in the early stage of training to provide effective guidance for the USV, in order to make the USV have better exploration ability, we reduce the global uncertainty to make the reward as a dense reward. The global uncertainty is defined as where , in which is a constant coefficient that set as 2.0. is calculated as follows:where is a weight coefficient which is set as 1.0.

5.4.4. Obstacle Reward

Obstacle reward is used to guide the USV to avoid local obstacles. Different from the above rewards, the value of the obstacle reward is always negative. The size of the obstacle reward is related to the average distance between the USV and obstacles. When at time does not detect any obstacle, the obstacle reward is 0. The obstacle reward is defined as follows:where obstaclen represents the coordinates of the nth obstacle, , and is a constant coefficient which is usually set as 100.

6. Simulation Test

We use Python 3.5 to build a simulation environment on the computer equipped with i7-8700 CPU and GTX 1080ti graphics card for tests. The number of USVs and obstacles can be adjusted according to the needs of the test. The sensing radius is set to 7, the communication radius is set to 9, the danger radius is set to 5, and the obstacle detection radius is set to 3. The detection probability and false alarm probability are set to 0.9 and 0.3, respectively. Initial coordinates of the USV and obstacles are random. The total number of training steps is 1e7. At the beginning of training, each USV in the system is sampling, moving, and fusing. Figure 7 is the reward of the USV with CSBDRL under different training frequencies during the training.

In test after training, we stipulate that when the system determines the location of all targets or any USV in the system colliding with any obstacle or the search time exceeds 500 steps (the search time is measured by the total number of steps of the multi-USV system), the one-round episode ends. Each test is performed in 500 episodes, and the results are measured by three parameters: search time of the system, search accuracy of the system (the number of episodes that the system correctly determines the location of all targets divided by the total number of episodes), and collision rate of the system (the number of episodes that the collision occurred divided by the total number of episodes). The test parameters are set as shown in Table 1. Figure 8 is a visual interface of the test process, where triangles are represented by USV, red rectangles are represented by obstacles, and black rectangles are represented by targets that are finally determined. As can be seen from Figure 8(b), the USV in the system can avoid obstacles during the process of searching.

Tests 1, 2, and 3 are used to test the impact of the number of USVs on the mission. It can be seen that when the number of obstacles is constant, the search time gradually decreases with the increase of the USVs, proving that CSBDRL effectively enables the USV to learn a cooperative search policy; thus, increasing the number of USVs can increase the search efficiency. In tests 1, 2, and 3, the search accuracy of the system is above 99%, and the collision rate is 0, which proves that the algorithm can effectively find targets in the mission area while ensuring the navigation safety.

Tests 4, 5, and 6 are used to test the impact of the obstacles on the system’s capability of collision avoidance. It can be seen that when the number of USVs is constant, the search time gradually increases with the increase of obstacles because obstacles make the USV spend more time to evade. However, the increase of obstacles does not affect the search accuracy of the system. In tests 4, 5, and 6, the search accuracy is still higher than 99%. When the number of obstacles reaches 5, the USVs and obstacles collided, but the number of collisions throughout the test is only 0.2%.

In order to better reflect the performance of the CSBDRL, we compare the algorithm with random algorithm and coverage control algorithm. In the comparison test, there are no obstacles in the mission area because random algorithm and coverage control algorithm do not have the function of obstacle avoidance. We also do 500 episodes for every test. In terms of accuracy, as can be seen from Table 2, CSBDRL has the highest search accuracy and fastest search time, followed by coverage control algorithm, and random algorithm has the lowest search accuracy and slowest search time. When the number of targets increases, the search accuracy of CSBDRL does not change significantly, and the average search accuracy exceeds 99%. The search accuracy of the coverage control algorithm decreases, the performance of the random algorithm is poor, and the search accuracy decreases significantly. In addition, in order to compare the convergence performance, we measure the convergence rates of the uncertainty for the different algorithms. Figure 9 shows the decreasing curve of the uncertainty for the different algorithms. As can be seen from Figure 9, the uncertainty of CSBDRL has converged in 200 steps, which has a great advantage over the random algorithm and coverage control algorithm.

7. Conclusion

In this paper, we studied the cooperative underwater target search by a group of USVs. A novel cooperative search learning algorithm (CSBDRL) based on RL method and probability map method is proposed to apply in the search mission for multiple stationary underwater targets by a multi-USV system in the environment with obstacles. The algorithm uses the “divide and conquer” policy-based architecture. First, the environmental sense values of the system are produced by environmental sense model based on the probability map method, which is simplified to a linear update and fused. Then, based on these values, the policy module learns the corresponding policy using RL method. We test CSBDRL in the simulation environment and compare it with other algorithms. The results prove that CSBDRL makes the multi-USV system have better search efficiency, which can make sure the targets are found more quickly and accurately in the mission area while guaranteeing the USV’s navigation safety in the system. In future research, we will consider the dynamic factors of the USV and the influence of environmental interference factors (such as current and wind) on the USV and improve the algorithm to enable the multi-USV system to search for dynamic targets.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was supported in part by the National Natural Science Foundation of China (61403245).