Abstract
In this paper, a novel guidance law based on a reinforcement learning (RL) algorithm is presented to deal with the maneuvering target interception problem using a deep deterministic policy gradient descent neural network. We take the missile’s line-of-sight (LOS) rate as the observation of the RL algorithm and propose a novel reward function, which is constructed with the miss distance and LOS rate to train the neural network off-line. In the guidance process, the trained neural network has the capacity of mapping the missile’s LOS rate to the normal acceleration of the missile directly, so as to generate guidance commands in real time. Under the actor-critic (AC) framework, we adopt the twin-delayed deep deterministic policy gradient (TD3) algorithm by taking the minimum value between a pair of critics to reduce overestimation. Simulation results show that the proposed TD3-based RL guidance law outperforms the current state of the RL guidance law, has better performance to cope with continuous action and state space, and also has a faster convergence speed and higher reward. Furthermore, the proposed RL guidance law has better accuracy and robustness when intercepting a maneuvering target, and the LOS rate is converged.
1. Introduction
Among modern missile missions, improving the accuracy of the guidance system is the most important and difficult process. The main issue is designing a guidance law, which plays an important role in the missile guidance system, since it will directly affect the relative motion of the missile and target, and also has great effects on the final miss distance. Proportional navigation guidance (PNG) law has been widely used in many aircrafts of different types for a long time. Although PNG algorithms have achieved excellent performances in many works, they still have the disadvantage of insufficient terminal guidance capability due to their own properties. However, PNG makes it easy to produce a divergent acceleration command in the terminal guidance stage since the acceleration is proportional to the LOS rate. As a result, researchers have developed various sophisticated guidance strategies such as sliding-mode guidance law [1–3] and finite-time convergent guidance law [4–6] to further improve the performance. Some researchers are attempting to merge early artificial intelligence hypotheses in order to design new intelligent algorithms. For instance, Hossain et al. [7] employ a genetic algorithm to generate training data for neural networks and then apply neural networks to optimize guidance commands based on the current states and terminal conditions. For irregular and difficult guidance situations, Kasmaiee et al. [8] couple two types of computational intelligence algorithms, including neural networks and genetic algorithms for optimization. In [9, 10], a genetic algorithm was implemented as the optimization method; since a large number of numerical simulations were required for this purpose, an artificial neural network was employed for training a function between the control parameters and the airfoil aerodynamic coefficients. Kasmaiee and Tadjfar’s use of image processing in aerospace applications and improving the efficiency of the spraying system is fully described in the papers [11, 12]. Reference [13] proposes a new guidance law based on the fuzzy logic method. Li et al. [14] consider target acceleration as a bounded disturbance, then use a primal-dual neural network to solve the optimal solution under this constraint, and develop a guidance law based on model predictive control.
Nowadays, with the powerful nonlinear approximation and data representation, data-driven-based RL methods have attracted considerable attention in the design of guidance law, and model-free RL has been implemented challenging constrained, uncertain, multidimensional, and nonphysical systems to estimate the long-term behaviours of the systems under the current policy and to determine the optimal future policies under unknown disturbances [15]. In [16–18], Yang et al. proposed three novel algorithm frameworks for optimal control problems based on the Hamilton-Jacobi-Bellman (HJB) equation. As a computational method of learning through environmental interaction, the Q-learning algorithm is first presented to solve the problem of noncontinuous control. In reference [19], an improved path planning method is proposed, in which the authors combined a fuzzy Q-learning method with a simulated annealing algorithm in the action search policy to balance action exploration and utilization, and the guidance and obstacle avoidance information is also used to design a reward function for the problem of UAV local path planning in an unknown environment. In [20, 21], the authors have applied a Q-learning algorithm to provide a solution for the path planning of multiple aircraft formation flights. The authors in [22, 23] use the Q-learning algorithm, respectively, to design a reinforcement learning guidance law algorithm in two-dimensional and three-dimensional simulation environments and compare it with proportional guidance. It is proved that the guidance law based on the RL algorithm has better accuracy. However, both of them regard the discretized LOS rate as the state space and the discretized normal acceleration as the action space, which is obviously inconsistent with the actual situation. In [24], the authors successfully solved the problem of discontinuity of normal acceleration by discretizing the proportional coefficients as action space; however, such a guidance law has the same defects as PNG, and it cannot solve the problem of divergence of LOS rate in specific circumstances. In order to solve the discontinuity problem of the Q-learning algorithm, in [25], the authors use a convolutional neural network to approximate the behavior value function, which can solve the problem of continuous state space and achieve better results than Q-learning in Atari games. In [26], the authors propose a time-controllable reentry guidance law based on the Deep Q-Network (DQN) algorithm. That is, the neural network is used to generate the bank angle command online and then combined with the amplitude information to form the final bank angle command so that the designed reentry guidance law has good performance in task adaptability, robustness, and time controllability. Schulman et al. [27] propose a Proximal Policy Optimization (PPO) algorithm based on the AC framework, which can solve the problem of continuous action space very well. Gaudet et al. [28] improve the PPO algorithm and propose a three-dimensional guidance law design framework based on reinforcement learning for interceptors trained with the PPO algorithm. Comparative analysis shows that the deduced guidance law also has better performance and efficiency than the extended Zero-Effort Miss (ZEM) policy. This is a good solution to the problem of continuous control, but the PPO algorithm uses random policy to explore and utilize, and this method estimates the accurate gradient and requires a large number of random actions to explore the accurate gradient. Therefore, such a random operation deduced the algorithm’s convergence speed. Lillicrap et al. [29] also propose the deep deterministic policy gradient (DDPG) algorithm based on the AC framework. Compared with the PPO algorithm, the DDPG algorithm adopts a deterministic policy to explore and uses Hindsight Experience Replay (HER) to improve the efficiency of training samples and greatly improve the convergence speed of the algorithm. By improving the DDPG algorithm, in [30], the authors propose an online path planning method based on deep reinforcement learning for the problem of UAV maneuvering, target tracking, and obstacle avoidance control. The authors [31] propose a computational guidance algorithm for missile target interception based on the DDPG algorithm. In [32, 33], the authors also proposed a deep reinforcement learning guidance algorithm based on the DDPG algorithm by improving the reward function. In [34], the authors propose a terminal guidance law based on the DDPG algorithm; by designing the environment state and action of the interception problem, the guidance law with optimal learning reward from the interactive data of the simulation environment is realized. In [35], the authors design a missile guidance law using DDPG for maneuvering target interception, but which only considers a maneuvering mode. However, in reinforcement learning algorithms based on value learning, such as DQN [25], the existence of function approximation errors will lead to the problem of value overestimation and suboptimal strategies. Fujimoto et al. showed in the paper [36] that the structure that causes the overestimation of the value and the accumulation of errors exists in the AC framework, so the authors proposed the “Twin-Delayed Deep Deterministic policy gradient algorithm.” The algorithm limits the overestimation of the value by selecting the minimum value of the estimates generated in the two critic estimation networks and uses a delayed update policy to reduce the error of each update; this algorithm has been verified on a set of tasks on OpenAI, and it has demonstrated the highest level in each task environment. In [37], the authors propose autonomous navigation of UAV in multiobstacle environments based on TD3. Therefore, we apply TD3 instead of DDPG to improve the estimation accuracy and robustness.
The main contributions are summarized as follows: (i)This paper presents a new framework for missile guidance law training. According to the problem that may be caused by overestimation bias in the previous RL guidance law, the minimum value between two critics is adopted to reduce overestimation and improve the robustness and accuracy of action output(ii)According to the analysis of the PNG algorithm, the reward function is designed, which is beneficial for the missile to move towards the direction of decreasing LOS rate and higher accuracy(iii)The open-source Python code of Gym is used to build a digital simulation environment with continuous action space and state space. The simulation results show that compared with the RL guidance law based on the DDPG algorithm, our guidance law has better convergence speed and higher returns under the same knowing environment and has higher accuracy than PNG, and normal acceleration converges better
Section 2 discusses the origin of the problem and previews the basics, including the relative motion model and the Markov decision process. Section 3 establishes the guidance problem under the RL framework and deduces the algorithm principle. Section 4 discusses and analyzes the simulation results of this paper. Section 5 discusses the conclusions from this work.
2. Preliminaries and Problem Setup
2.1. Relative Kinematics and Guidance Problem
To simplify the problem, our work only focuses on the two-dimensional environment of the relative motion between the missile and the target and uses the dynamic analysis method to study the guidance trajectory. The simulated model does not consider gravity, thrust, and drag.
The engagement scenario between the missile and the target in this paper is shown in Figure 1.

In Figure 1, denotes the inertial coordinate, and refer to the interceptor and the target, and represent the velocity of the missile and the target, respectively, is the line-of-sight angle between the missile and the target, and are the flight path angles of the missile and the target, respectively, and are the parameters of the leading angles of the missile and the target, respectively, and and represent the normal acceleration of the missile and the target. In this paper, it is assumed that both the missile and the target are only subjected to a normal overload which is perpendicular to the direction of velocity; that is, the overload is only used to change the direction of each speed.
In a 2D coordinate system, the dynamic equations of motion between the missile and the target are as follows: where is the relative distance between the missile and the target, is the relative speed of the missile and the target, is the LOS rate, is the position coordinate of the target, and is the position coordinate of the missile.
At present, the most commonly used guidance method is the method of PNG, in which it usually commands the missile to move to the target by calculating the normal acceleration of the missile. And the missile’s normal acceleration is proportional to the change in the line of sight angle rate. To this end, the output of PNG is the normal acceleration , and its description is shown in the following: where denotes the normal acceleration command, denotes the proportionality coefficient, and denotes the relative speed between the missile and the target.
We differentiate the 4th formula in equation (1) with respect to time to obtain the following dynamics:
Substitute equation (2) into the final formula in equation (1), we can get , then substitute it into equation (3).
Substitute equation system (4) into equation (3) where
Suppose the target moving in a straight line with a constant speed and the missile velocity is also kept constant, and are going to be equal to 0. So it can be known from equation (6): .
Thus, equation (5) can be written as
From equation (7), it is easy to see when , i.e., , then . In this case, the sign of will be opposite to that of . Therefore, the missile’s required normal acceleration decreases as decreases, and the missile trajectory becomes flat; this property is the same as PNG.
However, the above conclusion holds in the assumption of a missile with a constant speed moving and a target keeping in a straight line and maneuvering at a constant speed. In real word, the reality will be very complex, so it is obviously different.
2.2. Markov Decision Process
Most algorithms for reinforcement learning are based on the Markov decision process (MDP). When an agent conducts reinforcement learning, the MPD model is often used to establish corresponding mathematical models for decision problems with uncertain state transition probability, state space, and action space and solve the reinforcement learning problem by solving the mathematical model.
A MDP consists of a five-tuple: . The elements of the five-tuple can be described in the following.
represents the state set of the environment, and the state refers to the information that the agent can obtain useful for decision-making. In the reinforcement learning framework, the agent relies on the current state to make decisions.
represents the action set of the agent. It is the set of actions the agent can choose from in the current reinforcement learning task.
represents the probability of state transition. represents the probability that in the current state , it will be transferred to another state after action .
Given a policy and an MDP: , the probability of a state transfer from to when executing policy is equal to the sum of a series of probabilities, which refers to the product of the probability of executing an action and the probability that the action can cause a state transfer from to when executing the current policy . The specific mathematical expression is as follows: where is the reward function. represents the reward obtained after taking action in the current state . The specific mathematical expression is as follows: where is the discount factor, take . The purpose of using the discount factor is to take into account the immediate return in the future when calculating the cumulative return in the current state.
Since this article uses the model-free reinforcement learning method, MDP does not involve the part of state transition probability. Therefore, we adopt a simplified MDP consisting of a four-tuple: . Next, we design corresponding items based on the background of solving the problem.
3. Guidance Law with TD3 Algorithm
3.1. Markov Decision Process Design
The environmental information observed by the agent can describe the environmental state to a certain extent. If the simulation covers all the state space encountered by the guidance law in practice and there is enough sampling density, the obtained guidance law will be the optimal guidance law relative to the missile and environmental dynamics modeled. However, in order to better solve the problems in this paper, some information that may interfere with the decision-making task should not be included in the state set denoted by, so we only choose the LOS rate as the state space, which can also cover the whole process of guidance. To this end, we set . Taking into account that the characteristics of the projectile itself have certain restrictions on overload, the LOS rate is usually in the range of (-0.5, 0.5) rad/s.
Traditional PNG takes the relative speed and the LOS rate as input, and the output is the normal acceleration. Our goal is to use the neural network to directly map the LOS rate to the normal acceleration so that our actions denoted by may select the normal acceleration, i.e., . Here, is the acceleration of gravity. However, in order to ensure the normal operation of various components, the actual aircraft needs to consider the impact of overload and limit it to a certain range. The overload is limited to 10 times the acceleration of gravity.
The reward function serves as a feedback system that indicates to the missile from the environment whether its performance is good or bad. In order to solve the two problems of achieving LOS rate convergence to 0 and improving guidance accuracy, the reward function designed in this work can be divided into two parts:
The first part is shown in equation (11). Here, is the reward generated at the current moment of . The smaller the is, the higher the reward is, and the highest one is 100. And the second part is a terminal reward. The reward can only be generated when the distance error . It is expected that the estimated minimum miss can reach 0.01, and the upper limit of the reward will be controlled. The upper limit of the reward will be controlled, and the smaller the is, the higher the terminal reward the agent gets, which can promote the agent to explore with higher accuracy.
To this end, the final reward function can be defined as follows: where is when the termination condition is true.
3.2. Reinforcement Learning
Reinforcement learning considers the paradigm of the agent’s interaction with its environment, and its purpose is to make the agent take the behavior that maximizes the benefits, so as to get an optimal strategy. Based on the above MDP design, the continuous time can be divided into multiple discrete moments . At each moment, the agent adopting policy selects the corresponding action according to the state and then interacts with the environment to obtain the reward of this step and the state . The return is defined as the discounted sum of the reward.
Here is a discount factor used to determine the priority of short-term rewards, is the result of the return function at time , and is the cumulative sum of subsequent rewards after the time point.
The action value is defined as the will be obtained when the agent takes action by following the policy in the state at the time . This value is obtained through the critic function.
According to the meaning of accumulative rewards above, reinforcement learning proposes the definition of value function calculated by the Bellman equation, which can provide an evaluation value for a certain state of the agent, as shown in the following:
The deterministic policy gradient (DPG) algorithms [38] map the state to a deterministic action by expressing the policy as a policy function. When the policy is deterministic, the equation for calculating the behavior value function using Bellman’s equation will become the following equation:
In the RL framework, the agent’s goal is to find an optimal policy with parameter to maximize the total reward received over a sequence of time steps:
Supposing policy is a deterministic policy, the gradient of the behavior value function to the parameter can easily be calculated by the rule of chain derivation as follows:
TD3 uses off-policy to calculate the deterministic policy gradient. Because the deterministic policy gradient itself is not exploratory, the data it generates lacks diversity, so there is no way to learn. The off-policy method is that the agent uses exploratory policy to generate data and calculates the policy gradient based on these data. We still use to represent the behavior policy, and the policy function parameter is updated by the following gradient terms:
3.3. Network Structure of TD3
The network structure of the TD3 algorithm can be described in the following:
As illustrated in Figure 2, the actor-network input is environment states, and the output is agent action. Critic network input is the state and action, and the output is the corresponding value. The purpose of the actor network is to output the action that maximizes according to the state . The larger the calculated by is, the better the network training. The purpose of the critic network is to output its action value according to the state action . The difference between the actor network and the target actor network is that the actor network is updated in the replay buffer at each step, while the target actor network copies the actor’s network parameters at intervals to achieve its update. This “lag update” is to ensure the stability of training when training the actor network. The purpose of the target critic network is the same as that of the target actor network. It also wants to use a network that is not frequently updated to make the critic network converge stably. The soft update method between them is as follows: where and , respectively, are the parameters of the critic and the actor network.

Figure 3 shows the process of generating experience. Given a state , get action through the actor network, and then add noise to get action (noise is to ensure certain exploration), and then we input into the environment to get and , thus get an experience like this: , and then put the experience in the replay buffer.

The significance of the existence of the replay buffer is to eliminate the correlation of experience because, in reinforcement learning, the adjacent action is usually strongly correlated, if we break it up and put it into the replay buffer, the neural network can be better trained when we randomly select a batch of experiences from the replay buffer to train it.
The actor-network update procedure is shown in Figure 4. The represents known items, we take out a batch of experience from the replay buffer; here is an experience: as an example to describe the process of training a neural network.

According to the description of the actor network in Section 3.1, the loss function of the actor network is , and the smaller the is, the better the performance. This needs to be obtained from the critic1 network, as shown in Figure 5.

Input the in the experience into the actor network to get the predicted action without noise, directly input and into the critic1 network to get the value of , and then use as a loss function to modify the actor network. The loss function equation is as follows:
As mentioned in Section 3.2, the network parameters of the actor network are updated through a deterministic gradient, so equation (19) becomes
As shown in Figure 5, represent known items. This section describes the critic network update procedure, which we still use experience: as an example to describe the process of training the critic network.
As shown in Figure 5, the TD3 algorithm uses two target critic networks considering that the critic network always overestimates the value in actual applications. It borrows the idea of dueling double DQN (DDQN) [39] and adopts two networks to estimate the value and choose the smaller one to avoid overestimating the value as much as possible.
Because two target critic networks are used, the frequently updated critic network needs a corresponding number and is finally updated with the smaller of the two values:
The above equation is used as the mean square error of and , respectively. And then we use the mean square error as the loss function for gradient descent. The loss function is defined as follows:
By calculating equation (24), the gradient of value with respect to the parameter is obtained, which will be used in the calculation of equation (22).
In addition, note that the entire critic module actually trains the parameters of two critic neural networks. The parameters of the target actor network and two target critic networks are updated by the actor network and two critic networks (as shown in equation (21)), and noise is added to the predicted action of the target actor network, which becomes action before being used as input to the two target critic networks, thus making the value of the next step more precise.
The detailed pseudocode of TD3 is summarized in Algorithm 1.
|
4. Simulation and Analysis
In this section, we consider two target maneuvering modes: sinusoidal maneuvering and constant maneuvering, and then test them from three aspects: flight trajectory, normal acceleration, and LOS rate. Finally, we analyze the performance of the three guidance laws from the miss distance.
First, when the target is in sinusoidal maneuver, TD3 and DDPG algorithms were used to train the neural network with the parameters in Tables 1 and 2 for 10,000 episodes. The actor and critic functions are implemented by four-layer fully connected neural networks; critic1 and critic2 use the same network architectures, and the network architectures are as shown in Table 3. Except for the critic output layer, each neuron in other layers is activated by the Relu function, which is
The output layer of the actor network is activated by the tanh function, which is defined as
The policy and value functions are periodically updated during optimization after accumulating trajectory rollouts of replay buffer size.
As shown in Figure 6(a), the TD3 algorithm has converged when it is close to 4000 episodes, while the DDPG algorithm has converged when it is close to 8000 training rounds, and the reward obtained after TD3 is stable is higher than that of the DDPG algorithm. Figure 6(b) shows the variation trend of the final miss distance during the training of the two algorithms.

(a)

(b)
To test the convergence of the two algorithms, the agent is trained at several learning rates, as presented in Figure 7. The learning rate of the critic network in the two algorithms is twice that of the actor network. From Figure 7, one can note that our guidance algorithm is more stable than the DDPG guidance law; the DDPG algorithm cannot even converge at learning rates between 0.001 and 0.00001.

(a)

(b)

(c)

(d)
When the trained TD3 algorithm is compared to the PNG, the two are randomly tested 2000 times. Because the LOS rate converges only when is large enough, the PNG method takes the minimum value under the three conditions of the proportionality coefficient . We call the miss distance <10 m as an effective hit, and the miss distance <2 m as a direct hit.
As shown in Figure 8 and Table 4, in 2000 tests, the RL guidance law based on the TD3 algorithm is much better than PNG in guidance accuracy, but there is not much difference between it and DDPG. In the number of miss distance less than 10 m, TD3 is slightly higher than PNG, but if the miss distance is reduced to less than 2 m, the gap is obvious, TD3 is almost 100% successful, but PNG is only 50%.

We take 4 of these tests and then take 2 boundary values for testing. The proportional coefficient of these 6 tests is , and the target performs sinusoidal maneuvers. Table 5 shows the recorded 6 initial dynamic model simulation parameters and the final miss distance comparison:
As can be seen from Figure 9, all 6 tests of the RL guidance law have a quick-hitting time compared with PNG. In the first test, it can be seen that PNG guidance accuracy has a large error. The second and fifth tests clearly show that the PNG algorithm did not hit the target of a locally enlarged image.

Finally, take the fifth and sixth times as examples to compare the LOS rate and normal acceleration. The changes in the LOS rate and normal acceleration of the RL guidance law based on the DDPG algorithm in the fifth test are shown in Figure 10. They have relatively unstable oscillations at the guidance end, and the oscillations will have a greater impact on the stability of the projectile. It can be seen that the RL-based guidance law implemented in the DDPG algorithm has not achieved a good training effect after ten thousand training times.

(a)

(b)
As shown in Figure 10(b), the robustness of the RL-based guidance law implemented in DDPG is poor, so it will not participate in the subsequent comparison of the LOS rate and the normal acceleration.
As shown in Figure 11, the LOS rate of the two algorithms is generally stable in the early stage. However, as mentioned in Section 2.1, PNG’s LOS rate fails to converge in the final trajectory stage when the target presents non-straight motion, and the RL guidance law based on the TD3 algorithm has better convergence than PNG.

(a) The 5th test

(b) The 6th test
The PNG algorithm follows the change rule of equation (2), so the normal acceleration diverges at the last moment, and the normal overload required at the terminal stage is too large. In contrast to the reinforcement learning algorithm, although the curve is not as smooth as the normal acceleration, it does not oscillate as violently as Figure 10(b), and its normal acceleration is better than PNG in terms of terminal convergence, which is more realistic.
In addition, we also tried the second scenario to see if it has a good generalization to the engagement scenarios we did not experience in the training process. As a result, the fifth and sixth initial state parameters in Table 5 were left unchanged, while the target maneuvering mode was changed to constant maneuvering 20 m/s2.
As shown in Figures 12 and 13, when the target maneuvering mode is changed to continual maneuvering, the performance of the two guidance laws improves, but the guidance law still outperforms PNG in the terminal stage, and its accuracy is still higher, as presented in Table 6. This demonstrates that the guidance law has a high degree of generalization.


(a) The 5th test

(b) The 6th test
5. Conclusion
We have demonstrated that when facing a maneuvering target, the performance of the resulting guidance strategy is better than that of the traditional proportional guidance method by directly mapping the LOS rate to the overload instruction via a neural network. The simulation results show that (1) RL guidance law based on TD3 has better accuracy and convergence of normal acceleration compared with PNG; (2) when designing guidance law based on RL algorithm framework, the TD3 algorithm outperforms DDPG in terms of robustness; and (3) the guidance law also has a good generalization to new engagement scenarios that have not been experienced during training. Furthermore, because the RL algorithm is a model-free framework, it can optimize guidance laws in more complicated environmental frameworks. Therefore, we believe that designing guidance law based on the RL framework will be an effective way to intercept maneuvering targets in the future. The following work may consider RL guidance law performance in more realistic combat environments that contain acceleration of gravity, aerodynamic forces, thrust, and air humidity.
Nomenclature
MPD: | Markov decision process |
RL: | Reinforcement learning |
: | State vector in the MDP |
: | Action vector in the MDP |
: | The parameter of the critic network |
: | The parameter of the actor network |
: | Reward for being in state when selecting the corresponding action at the time |
: | The cumulative sum of subsequent rewards after the time point |
: | The probability density associated with the random variable |
: | Expectation of , i.e., |
: | The action value for agent taking action by following the policy in the state at the time . |
Data Availability
The data used to support the findings of this study are available from the corresponding author upon request.
Conflicts of Interest
The authors declare that they have no conflicts of interest.
Acknowledgments
This work has been supported by the National Natural Science Foundation of China (Grant No. 1147213611402117).