Abstract
This paper proposes a goal-directed locomotion method for a snake-shaped robot in 3D complex environment based on path-integral reinforcement learning. This method uses a model-free online Q-learning algorithm to evaluate action strategies and optimize decision-making through repeated “exploration-learning-utilization” processes to complete snake-shaped robot goal-directed locomotion in 3D complex environment. The proper locomotion control parameters such as joint angles and screw-drive velocities can be learned by path-integral reinforcement learning, and the learned parameters were successfully transferred to the snake-shaped robot. Simulation results show that the planned path can avoid all obstacles and reach the destination smoothly and swiftly.
1. Introduction
Research in the field of snake-shaped robots has been carried out for several decades [1–3]. This is because snake-shaped robots can not only be used as experimental platforms [4, 5] but also be employed for searching and rescuing operations, such as mine disaster rescue, search, and detection of people buried after the earthquake [6]. The snake-shaped robots are multimodule serially connected robots with no fixed base by imitating the snake’s body structure and movement mode.
Generally speaking, the snake-shaped robot consists of several segments connected in a serial manner as shown in Figure 1, and the main form of movement is undulation movements, which imitate real snake movements [7, 8]. In this paper, the snake-shaped robot is composed of screw-drive units with passive wheels, and the screw-drive units are serially connected through active joints. With this structure, the snake-shaped robot can move in any direction by appropriately adjusting the angular velocity of screw-drive units.

The problem of robot path planning in an unknown environment has attracted more and more researchers’ attention. The main research methods include the genetic algorithm, the gradient algorithm, the nonlinear mapping algorithm, the linear mapping algorithm, and so on [9, 10]. These traditional approaches are improved to solve the path planning problem. Liljebäck obtains the deviation of the centroid of the snake-shaped robot from the expected path through real-time monitoring. This deviation is used as a joint control parameter to adjust the head movement direction so that the snake-shaped robot follows the desired straight path [11–16]. Rezapour uses the desired angle of the head joint to adjust the swing of the head link, so that the centroid of the snake-shaped robot follows the desired straight path [17]. Hasanabadi uses sensors to measure the distance between the robot and the desired path to calculate the desired angle, and the angle is used as the joint control parameter to realize that the robot follows a straight path and a circle without side slip path [18]. Matsuno introduces a functional model of the desired path to establish the controller, so that the robot head follows the desired circular path [19]. Mohammadi controls the velocity components in all directions to make the snake robot’s center of mass follow a circular path, but it requires that the path function is known [20]. Ariizumi established a cost function based on the expected path and head trajectory and used the cost function as the control parameter to realize the snake-shaped robot following a sinusoidal path [21].
In order to make the snake-shaped robot reach the destination safely and smoothly, the first task is to find proper control parameters for generating goal-directed locomotion. This is a complex and challenging task, and we apply a special type of reinforcement learning, called path-integral reinforcement learning combining the time series analysis method for the task. Path-integral reinforcement learning is selected because this method can stably solve the “dimensional disaster” problem [22] and has proved successful for different robot learning tasks [23].
This paper proposes a collision-free goal-directed locomotion method based on path-integral reinforcement learning in a three-dimensional environment. The novelty of this paper and the main contributions of this work are as follows: a path-integral reinforcement learning model is established, which describes in detail how the robot can accumulate experience from historical data to optimize decision-making. The table is updated based on the prior information and provided to the robot as historical experience, which greatly improves the learning efficiency of the robot. To describe the influence of collision avoidance, the perturbation matrix is introduced and the planned path can avoid obstacle safely and reach the destination eventually. The simulation results show that the planned paths can avoid all obstacles smoothly and swiftly and reach the destination eventually.
The rest of the paper is organized as follows: Section 2 describes the kinematic model of the snake-shaped robot. Section 3 focuses on the path-integral-based reinforcement learning algorithm. Section 4 explains goal-directed locomotion of the snake-shaped robot. The simulation results are given in Section 5. Section 6 concludes the paper.
2. The Description of Kinematic Model of Snake-Shaped Robot
The kinematic model of the snake-shaped robot [7] can be described aswhere is the state vector and is the control input vector. is the position of the head of the snake-shaped robot. is the orientation of the first screw-drive unit with respect to and is automatically adjusted with the snake-shaped robot kinematics. are the yaw joint angles which are relative orientations of one screw unit to the previous one. are the angular velocities in corresponding screw-drive units from the head, respectively. are the angular velocities of the joints starting from which is the angular velocity of the first joint from the head. and are the system matrices and depend on system configurations such as screw-drive unit radius, inclination of blade, and length of one screw-drive unit [7].
The learning of locomotion control parameters such as screw velocities and joint angles can be carried out as follows. Suppose that the initial position of the head of the snake-shaped robot is at and the target position of the snake-shaped robot is . The two parameter vectors used to represent the control policy can be described as follows:
The snake-shaped robot’s locomotion control policy following the kinematic model in equation (1) can be represented by U of equation (2). After learning final values of U, the snake-shaped robot will start from the initial position and then move toward the desired goal .
We know that the cost function or the objective function is the key component that affects the convergence of learning. In this article, we use Euclidean distance as a cost function as follows:
The cost function provides the distance between the reached head position and the target position of the snake-shaped robot. It can be seen that when the cost function is almost zero, the learned parameter vector will reach the final value. That is to say, the snake-shaped robot reaches the goal and completes the task. Therefore, with the final parameter vector, the robot will move toward the given target point.
In addition, the influence of external noise on path-integral reinforcement learning needs to be considered. As an open parameter in the learning process, external noise needs to be designed in a task-based manner. For path-integral reinforcement learning in this paper, a random value is selected by us from the normal distribution , and can be dynamically adjusted according to the noise-free cost obtained at the end of the previous update. Then, the external noise interference can be expressed as follows:where is the number of update, is the number of noisy rollout, and is the external noise interference during -th noisy trajectory or rollout of the -th update.
Suppose that the noise applied to screw velocities is , respectively. Similarly, assume that the noise applied to joint angles is , respectively. All these noise distributions follow above description of the external noise interference in every noisy trajectory , but their patterns are different in a trajectory as is a random value.
The main steps for path-integral reinforcement learning applied to goal-directed locomotion of the snake-shaped robot are as follows: firstly, the parameter vector can be selected according to the learning task. Secondly, the number of rollouts per update needs to be fixed and in this paper. According to the parameter vector to be learned, in every rollout , the trajectory with noisy parameters is simulated starting from robot start position :where represent the noisy of screw velocity parameters and denote the noisy of joint angles. Trajectory of the snake-shaped robot can be generated by using the robot kinematic model described as equation (1) and the parameters of locomotion control such as screw velocities and joint angles. It can be seen that both the noisy screw velocities and the noisy joint angles as required for learning U are given in equation (5). According to the reached position of the snake-shaped robot’s head at the end of this rollout, then the final cost for this rollout can be calculated by evaluating the cost function as follows: . With this method, all noisy rollouts from start position of the snake-shaped robot within one update process can be completed and corresponding can be stored. Then, the exponential value can be calculated for each rollout as follows:
Then, the probability weighting for each rollout can be calculated as follows:
Then, corrections for the parameter vector can be calculated based on :
The complete update vector can be constructed according to equation (8):
Then, the control vector at the end of an update can be calculated as .
3. Path-Integral-Based Reinforcement Learning Algorithm
The path-integration algorithm is used to solve the stochastic optimal control problem, and its purpose is to minimize the performance index of the stochastic dynamic system. The analysis and derivation of stochastic optimal control is based on the optimal Bellman principle and the Hamilton–Jacobi–Bellman (HJB) equation. In this paper, we first analyze the stochastic optimal control of the mobile robot and then use the generalized path-integral control method to obtain the numerical solution of the stochastic dynamic system.
3.1. Stochastic Dynamical System
In general, the stochastic dynamic system can be expressed as follows [10]:where denotes the system state, is the control matrix, represents the passive dynamics, stands for the control vector, and is the Gaussian noise with variance and zero expectation. For mobile robots, the general stochastic dynamic system can be expressed by the following equation:where and are the linear and angular velocities, denotes the position of the robot on the 2-D plane, and represents its orientation, and , , , , and . This paper studies the path planning of the snake robot based on the path-integration RL algorithm. Therefore, the linear velocity remains constant, and only the angular velocity is used as a control to reduce the parameters in the RL, thereby accelerating the training convergence process.
3.2. Stochastic Optimal Control
For a trajectory with a start time of and an end time of , its performance function is defined as follows:where is a terminal value and is a process value. The immediate cost can be defined aswhere is the cost function related to the state and is the positive semidefinite weight matrix of the quadratic control cost. The goal of stochastic optimal control is to find the control to minimize the following performance criterion function:where is the expectation of all trajectories starting from . In order to solve the noiseless optimal control problem, there are two main methods: one is to use the Ponteria Gold Minimal Principle (PMP), which requires solving a pair of ordinary differential equations, and the other is to use partial differential equations (PDEs) of the HJB equation. For the stochastic dynamic system with noise shown in equation (10), the extension of the HJB method for stochastic optimal control is expressed aswhere and are the corresponding Jacobian and Hessian matrices and is defined as . The corresponding optimal control can be calculated as [10]. Then, a nonlinear second-order PDE can be obtained as follows:
can be transformed into exponential form , where is a positive constant, and then use [24, 25]; then, PDE (16) can be simplified to the following form with the boundary condition which is :
3.3. Simplification of Path-Integral-Based RL
It is very difficult to obtain the analytical solution of PDE (17), so we replace it with commonly used numerical solutions, and one of which uses the Feynman–Kac theorem [26] as follows:
After discretization, the stochastic optimal control problem can be approximately transformed into the following path-integration problem:where is a sample trajectory with its probability as . After some analysis [27], the following numerical solutions of optimal control can be obtained:where is used to generate the trajectory and the probability is defined aswhere is the general version of the path cost defined in [25]. However, the analytical form of calculating the path cost is very time-consuming because it involves many matrix inversions. Considering this fact, in order to simplify the calculation and speed up the training, we choose the linear path cost function:
It can be clearly seen that the generalized path-integral control method using the numerical solution of the stochastic dynamic system can avoid the calculation of gradient and kinematic inverse, ensuring fast and reliable convergence.
4. Goal-Directed Locomotion Algorithm
In the process of snake-shaped robot path-integral reinforcement learning, the reward that the snake-shaped robot gets depends on its own actions. Therefore, this paper extends the Markov decision process (MDP) model of path-integral reinforcement learning to multiple Markov decision processes (MDPs). Suppose that the snake-shaped robot can choose actions (), and the number of states of the snake-shaped robot is (). Based on the basic model of path-integral reinforcement learning, combined with task goal, this paper defines a snake-shaped robot goal-directed locomotion learning framework as shown in Figure 2.

It can be seen from Figure 2 that in order to improve the learning speed of the robot, first of all, environmental factors are preprocessed to eliminate some irrelevant environmental factors, and then a three-dimensional environmental model is established. Then, the prior information is updated to the knowledge base, which improves the learning efficiency. In this model, based on the current state , the snake-shaped robot takes an action in the action set according to a certain policy rule based on the historical experience of the knowledge base. Next, when the action is executed, the environment will be transformed into a new state , and a new reward value will be obtained. Finally, the historical experience is updated through continuous learning and further the knowledge base is improved. The snake-shaped robot periodically interacts with the environment, repeating the process of “exploration-learning-decision-utilization,” learning from historical actions, and updating their knowledge base as historical experience to guide the next action selection.
4.1. Guidelines for Setting the Reward Function
The reward function defines the snake-shaped robot’s learning goal and determines its perception state based on the environment; that is, it defines the value of the action. Since the robot tries to maximize the reward, the reward function is essentially used to guide the robot to achieve its goal. Therefore, the setting of the reward function will determine the convergence speed and degree of the reinforcement learning algorithm. In this article, the reward function is defined in the form of sparse reward, as shown in the following:where are constants and .
4.2. Guidelines for Setting the Update Value Function
The goal-directed locomotion of the snake-shaped robot uses the Q-learning algorithm; that is to say, after determining all the environment states and actions , a matrix will be generated, and the element in the matrix represents the value of robot’s choice of action under the environment state .
The process of updating the value function is as follows: in the environment state , when the snake-shaped robot selects an action according to the action selection strategy, the snake-shaped robot reaches a new environment state after executing the action. At this time, we update the value of in the matrix . The update formula of is as follows:where is the attenuation coefficient and is the learning efficiency.
4.3. Guidelines for Setting the Action Selection
In the path-integral reinforcement learning problem, exploration and utilization are a pair of contradictions: exploration means that the robot must try different behaviors and then collect more information, but utilization is that the robot makes the best decision under the current information. Exploration may sacrifice some short-term benefits and obtain more long-term and accurate benefit estimates by collecting more information; utilization focuses on maximizing short-term benefits based on the information already available. The exploration cannot be carried out endlessly, otherwise, too much short-term benefits will be sacrificed, and the overall benefits will be damaged; at the same time, short-term benefits should not be valued too much while ignoring some unexplored behaviors that may bring huge benefits. In this paper, -greedy exploration is used, where is the probability of the snake-shaped robot randomly selected (). In the case of probability , the snake-shaped robot selects an action corresponding to the maximum value. If there is multiple with the same value, an action is randomly selected. When the probability is , the snake-shaped robot randomly selects an action from the action set.
4.4. Guidelines for Setting the Collision Avoidance
Collision avoidance must be considered in the process of the goal-directed locomotion, to simplify the calculation process, the ideal velocity for robot is assumed to be , and then the actual velocity can be calculated as follows:where is the improved perturbation matrix [28], and the actual velocity is calculated by modifying the target velocity . Then, the planned path can be obtained by the recursive integration of . To describe the influence of collision avoidance, the perturbation matrix is defined as follows:where is a identity matrix, is a column vector, is a tangential vector (the partial derivative of on ) at the point . is a saturation function defining the orientation of tangential velocity; is 2-norm of a vector or a matrix. Besides, and are defined as the weight of and , respectively: where is the repulsive parameter, is the tangential parameter, is the minimum permitted distance between the boundary of the robot, and is a small positive threshold of the saturation function . Then, the actual velocity can be expressed as
It can be seen from equation (28) that consists of three parts: can be called the target velocity; is taken as the repulsive velocity; can be called the tangential velocity. Similarly, the perturbation matrix can be divided into three parts: attractive matrix , repulsive matrix , and tangential matrix . Therefore, we can readjust the shape of the path by changing parameters and . However, the perturbation matrix in this method can fully consider the shape of the obstacle and the position of the goal-directed locomotion and can more objectively describe the influence of the obstacle on the planned path.
Theorem 1. The planned path can avoid obstacle safely.
Proof. Vectors and are perpendicular exactly, i.e., . Suppose that the distance between the boundary of the robot is close to the minimum permitted distance ; that is, suppose that and is a monotonically decreasing function:It can be inferred that , then , , and ; therefore, . can be simplified asTherefore, we can inferBecause the equation means that , so the path is outside of the minimum permitted distance and there is no collision.
Theorem 2. The planned path can reach the destination eventually.
Proof. Because the goal of the path planning is to make the robot reach the destination safely, therefore, the velocity of the robot should have a component in the direction of the target velocity; that is, velocity and should satisfy , and the planned path will converge to the target point. Besides, should be satisfied if the robot near to the destination :where is the inner product of vectors, where denotes the angle between and . It is obvious that when the robot approaches to the destination, then :As and hold, it can be inferred that . From equation (30), it can be inferred that . Therefore, holds. When the robot approaches to the destination, that is to say, and , then according to equation (29), it can be inferred that . Therefore holds.
4.5. The Algorithm Steps of Goal-Directed Locomotion
In the process of goal-directed locomotion, the snake-shaped robot continuously interacts with the environment according to the current environment state and updates and modifies the learning results to guide the robot’s action selection. Finally, find a set of action sequences that can maximize the reward, and complete the goal-directed locomotion. The pseudocode of this method is shown in Algorithm 1.
|
5. Simulation Examples
5.1. Build the 3D Environment Model
To guarantee real-time performance, parameters are first learned offline according to different initial states and then correspondingly adjusted online based on the robot’s states. To verify the effectiveness of the method proposed in this paper, a virtual environment is set up, which contains several obstacles of different shapes. Suppose that is robot position and obstacle center is , with axis lengths , , , and index parameters , , :
Then, the parameters of the path-integral-based reinforcement learning algorithm are given in Table 1.
Suppose that there are eight obstacles in environment as shown in Table 2, and and .
5.2. Collision-Free Goal-Directed Locomotion that the Robot Does Not Need to Climb Obstacles
The method proposed in this paper adds prior information and has historical experience as decision support, so collisions are avoided during path planning. After 660 rounds of training, robots can successfully reach the target point. When there is no trap area, the robot does not need to climb obstacles and can avoid all obstacles smoothly and swiftly and reach the destination eventually. Suppose that the starting position of the robots is and the target position is . The planned path is shown in Figure 3.

It can be shown from Figure 3 that when the snake-shaped robot starts from different starting points, the algorithm proposed in this paper can avoid all obstacles smoothly and swiftly and reach the destination eventually.
5.3. Compared with Ant Colony Algorithm
By using Taguchi’s experimental design method and other intelligent optimization methods [29, 30], the ant colony algorithm is compared to the algorithms proposed in this paper. Suppose that the starting position of the on-water automatic rescue intelligent robot is , and the target position is . In the ant colony algorithm of this paper, the number of ants is 100. After each iteration, the pheromone attenuation ratio is 0.5, and each time the ant passes a path, the pheromone increase ratio is 2. It can be seen from Figure 4 that when the 3D ant colony algorithm is used for fast path planning, with different iteration times, the number of ants, etc., the planned path is also different. Moreover, 3D ant colony algorithm generally only seeks the shortest path, so the planned path is not very smooth as shown in Figure 5.

(a)

(b)

It can be seen from Figures 4 and 5 that when the number of iterations is the same, the best individual fitness curve varies with the number of ants. In other words, ant colony algorithm relies on prior knowledge to a large extent, so it has its shortcomings in fast path planning. The method proposed in this paper considers prior information, uses historical experience as decision support, and introduces a climbing coefficient, so that obstacles can be successfully overcome. After more than 900 rounds of training, the two robots reached the target point safely without collision. In the process of goal-directed locomotion, the saturation represents the climbing ability of the robots. Furthermore, different will lead to different planned paths and will cause different climbing difficulties and different climbing heights. Therefore, we need to consider how to choose an appropriate coefficient to plan the most effective path. Suppose that the starting position of the robot is , and the target position is . The planned path is shown in Figure 6.

It can be seen from Figure 6 that the planned path according to different saturation and the saturation of the robot is . Both paths can bypass obstacles, escape the trap area, and reach the destination.
5.4. The Situation that the Robot Encounters the Trap Area
During the process of goal-directed locomotion, the robot will encounter the trap area. It is necessary to consider the distribution of obstacles and the climbing ability of the robot; otherwise, the robot will not be carried out timely and effectively. Suppose that there are six obstacles in environment as shown in Table 3, and and . Suppose that the starting position is , and the target position is .
It can be seen from Figure 7 that the trap area is formed by two different obstacles, and the snake-shaped robot will fall into the trap area. Therefore, we need to consider the climbing ability of the robot during path planning in the 3D environment.

6. Conclusions
This paper proposes a collision-free goal-directed locomotion method based on path-integral reinforcement learning in a three-dimensional environment. First, a path-integral reinforcement learning model is established, which describes in detail how the robot can accumulate experience from historical data to optimize decision-making. In order to improve the learning efficiency of this method, some irrelevant environmental conditions are removed. The table is updated based on the prior information and provided to the robot as historical experience, which greatly improves the learning efficiency of the robot. The simulation results show that the planned paths can avoid all obstacles smoothly and swiftly and reach the destination eventually.
Data Availability
The data used to support the findings of this study are included within the article.
Conflicts of Interest
The authors declare that they have no conflicts of interest.
Acknowledgments
This work was supported by the National Natural Science Foundation of China (no. 61304088), the Fundamental Research Funds for the Central Universities (no. 2013QNA37), the China Postdoctoral Science Foundation (no. 2015M581886), and Jiangsu Province Postdoctoral Research Foundation (no. 1501080B).