Abstract
A two-loop acceleration autopilot is designed using the twin-delayed deep deterministic policy gradient (TD3) strategy to avoid the tedious design process of conventional tactical missile acceleration autopilots and the difficulty of meeting the performance requirements of the full flight envelope. First, a deep reinforcement learning model for the two-loop autopilot is developed. The flight state information serves as the state, the to-be-designed autopilot control parameters serve as the action, and a reward mechanism based on the stability margin index is designed. The TD3 strategy is subsequently used to offline learn the control parameters for the entire flight envelope. An autopilot control parameter fitting model that can be directly applied to the guidance loop is obtained. Finally, the obtained fitting model is combined with the impact angle constraint in the guidance system and verified online. The simulation results demonstrate that the autopilot based on the TD3 strategy can self-adjust the control parameters online based on the real-time flight state, ensuring system stability and achieving accurate acceleration command tracking.
1. Introduction
The autopilot is a critical component of the guidance and control system, capable of producing control forces and torques by driving actuators in response to control commands. Thus, the speed and direction of the missile’s flight can be changed, the stability of the missile’s centroid and attitude can be maintained, and the guided missile can hit the target based on the required flight trajectory and attitude [1, 2]. The key to designing an autopilot is identifying the control parameters that meet specific performance indices. Traditional autopilot design methods include pole placement [3, 4], optimal control [5, 6], sliding mode variable structure [7], event-triggered attitude control policy [8], and active disturbance rejection [9]. Due to the diversification of tactical missile combat scenarios and the intelligent development of missiles, the design of autopilots must meet more stringent requirements, such as improved mobility and stability at large attack angles and large-scale flight scenarios [10, 11]. However, traditional autopilot design methods select typical state feature points in the flight envelope to design the control parameters [12]. Consequently, it is difficult to meet the requirements for the entire control domain of the flight envelope.
In recent years, deep learning (DL) and reinforcement learning (RL) have become hot topics in artificial intelligence technology, providing new design options for aircraft guidance and control systems [13–15]. The principle of RL originates from the process of intelligent species learning new things. For a specific task, agents learn through real-time interaction with the external environment and continuous trial and error; ultimately, a task-appropriate action strategy is obtained [16]. Deep reinforcement learning (DRL) combines the autonomous decision-making capability of RL with the feature extraction capability of deep neural networks (DNNs) and has demonstrated outstanding performance in high-dimensional data control problems. Unlike conventional control methods, DRL is less dependent on models and more transferable [17].
In order to solve the strong coupling problem between the adaptive updating loop and the strict-feedback control loop in traditional control, a low-frequency learning structure consisting of a low-pass filter, a state estimator, and a multilayer perceptron (MLP) neural network was proposed in [18], which reduced the computation complexity and effectively avoided the update explosion problem. Literature [19] designed a two-loop autopilot based on a missile longitudinal channel model employing the RL principle and solved the quadratic optimal tracking control problem employing an actor-critic structure. The results demonstrated that the designed controller possessed an excellent tracking effect and dynamic performance. Different from literature [19], literature [20] transformed the aircraft control procedure into a Markov decision procedure. Utilizing the deep deterministic policy gradient (DDPG) algorithm [21], the optimal PID controller control parameters were determined iteratively. Compared to the LQR controller, the result of the PID controller exhibited better control effects and robustness. Similar to literature [20], based on the acceleration autopilot, literature [22] developed a control parameter design method for acceleration autopilots using the policy gradient algorithm [23]. In order to reduce the time and workload for identifying model information during the autopilot design process, literature [24] developed a learning-based design method for UAV autopilot using the DDPG algorithm by designing appropriate observation and reward functions. In addition, literature [25] made a comparative analysis of PID neural network controller and DDPG controller and also provided a conceptual proof that reinforcement learning can effectively solve the adaptive optimal control problem of nonlinear dynamic systems. However, most of the methods in [19, 20, 22, 24] and [25] were designed and analyzed based on nonglobal profile state rather than entire flight envelope, which led to insufficient consideration of global constraints and performance indicators. When there was great uncertainty in the flight environment, the autonomy and robustness of the nominal trajectory tracking guidance mode were poor.
In order to ensure that the missile can be well controlled in the entire flight envelope, literature [26] used wavelet analysis to monitor the incremental RL’s attitude control stability online. Then, they adaptively modified the learning rate of the RL algorithm based on the gradient descent principle, which effectively enhanced the aircraft’s control stability under large-scale dynamic changes. Literature [27] remodeled the autopilot in the RL framework, trained a two-degree-of-freedom- (DOF-) linearized dynamic model for missiles using the DDPG algorithm. By taking into account different flight conditions and uncertainties in the aerodynamic coefficients, literature [27] validated the ability of the designed controller to maintain closed-loop stability in the presence of uncertain models. Literature [28] developed a method for aircraft control using the DDPG algorithm, and the complete flight control was achieved by controlling the position, speed, and attitude angle. By adding a PD controller, the stability in the early stage of training was effectively improved. Literature [29] proposed a model-free coupling dynamic controller design method for jet aircrafts capable of withstanding multiple types of faults. After offline training, adequate results were achieved under highly coupled maneuvers and were robust to various failure situations. To surmount the reliance on global position data in hostile surroundings where GPS was being attacked or disrupted, literature [30] designed a new GPS-free cooperative elliptical circling controller, in which not only the energy consumption was reduced but also the dependency on global position was removed when multiple nonholonomic vehicles moved together. Literature [31] fixes the autopilot structure as typical three-loop autopilot, and deep reinforcement learning is utilized to learn the autopilot gains, and the state-of-the-art deep deterministic policy gradient algorithm is utilized to learn an action policy that maps the observed states to the autopilot gains. Influenced by literature [31], literature [32] used the pole placement algorithm of a three-loop autopilot as the foundation and designed an intelligent training method for converting three-dimensional control parameters into one-dimensional design parameters using the proximal policy optimization algorithm. The simulation results demonstrated that its control effect was good. However, these design methods had some problems such as poor stability and low strategy learning efficiency. How to improve the practicability of deep reinforcement learning method in the field of autopilot has become an urgent problem to be solved.
Motivated by the previous investigations, this paper employs the DRL principle to design a parameter tuning model for a two-loop autopilot using the TD3 algorithm [33], which has the following striking advantages over existing autopilot design schemes: (1)Contrasting to the previous alternatives [19, 20, 22, 24, 25] based on nonglobal profile state, herein the proposed method takes the entire flight envelope state space as the research object. The TD3 algorithm is used to offline learn the control parameters for the entire flight envelope, which can ensure the autopilot has good performance during the flight envelope. In addition, after offline training, a fitting model with the real-time flight state of the aircraft as the input and the autopilot parameters as the output is obtained through the policy network, which can ensure the rapid online tuning under the condition of large-scale flight state changes(2)Different from the reward design mechanism in literatures [26–29] and [31], this paper considers the stability margin when designing the reward mechanism to ensure the autopilot has a desired margin space in the entire flight envelope, which makes the autopilot has stronger robustness. In addition, compared to [31, 32] which randomly selected states during training, this paper arranges the flight states in an orderly manner and then samples them sequentially, which can improve the strategy learning efficiency while ensuring the convergence of the algorithm
The remainder of this paper is organized as follows. Problem formulation is stated in Section 2. The specific process of two-loop acceleration autopilot design method based on TD3 strategy is given in Section 3. Section 4 is the simulation analysis of the proposed method. Conclusion of this paper is described in Section 5.
2. Problem Description
2.1. Two-Loop Autopilot Model
This paper uses an air-to-ground guided missile as the research object. The missile model is illustrated in Figure 1, where represents the velocity, represent the lift, resistance, and gravity, respectively, is the pitching moment, denotes the ballistic inclination, is the pitch angle, represents the angle of attack (AOA), is the elevator deflection angle, denotes the reference length, is the projectile diameter, is the pressure center position of the whole missile, and is the centroid position.

The expressions of aerodynamic force, moment, and gravity are as follows: where , , and denote the resistance coefficient, lift coefficient, and pitching moment coefficient, respectively, denotes the reference area, represents the incoming flow pressure (, where represents the air density at the flying altitude of the missile), is the missile mass, and is the gravitational acceleration.
The projectile dynamics at the longitudinal plane can be described as follows: where represents the pitch angle rate, is the axial thrust of the missile, represents the derivative of the lift coefficient to the AOA, denotes the derivative of the lift coefficient to the elevator deflection angle, represents the derivative of the pitching moment coefficient to the AOA, represents the derivative of the pitching moment coefficient to the dimensionless pitch angle rate, denotes the derivative of the pitching moment coefficient to the elevator deflection angle, is the pitching rotational inertia, and , , and represent the aerodynamic noise interference. The aerodynamic noise obeys the following bounded Gaussian distribution: where is the upper bound of noise, represents the variance of the Gaussian distribution , and the clip function is defined as follows:
The research object of this paper is a thrust-free missile; thus, . Figure 2 shows the two-loop acceleration autopilot model composed of an accelerometer and a velocity gyroscope [34]. In Figure 2, denotes the acceleration command based on the guidance command, is the actuator command with the actuator dynamic delay being temporarily ignored, and and are the control parameters to be designed.

The geometric relationship between the two-dimensional missile plane and the target is illustrated in Figure 3, where represents the missile position, is the target position, denotes the line-of-sight angle of the missile and the target, is the lead angle, and is the relative distance between missile and target.

For stationary targets,
It is well-known that the dynamics of the airframe change with the flight state during the flight process. To maintain control stability and accurate acceleration tracking throughout the entire guidance procedure, the control parameters must be continuously adjusted based on the flight state. A missile guidance control model was designed in combination with the impact angle constraint guidance (Figure 4). In this paper, a two-loop autopilot with self-adjustable control parameters based on the model depicted in Figure 4 will be developed.

2.2. Autopilot Parameter Design Method and Online Application Framework
The key to the design of the autopilot is to ensure that the control parameters and meet the expected performance indices of the control system, so that the aircraft can accurately and robustly track the guidance acceleration command, accelerate the response speed, and improve the damping of the missile. Among the frequency domain indices of the system, the amplitude margin and phase margin are often used as important indices to assess the control system performance. In general, the amplitude and phase margins of the autopilot should not be less than 6.5 dB and 40°, respectively. Therefore, in this paper, the amplitude margin and phase margin are selected as the performance indices that must be satisfied during the design process of the two-loop autopilot. According to Figure 2, the conventional open-loop system is the transfer function from “in” to “out,” which is disconnected at the actuator. As a result, and can be calculated as follows: where represents the phase crossover frequency, meeting , and is the cut-off frequency, meeting .
In this paper, the guidance control model described in Section 2.1 is modeled by RL. The TD3 algorithm was utilized to offline learn the driving control parameters for each flight state across the entire flight envelope. At the same time, a MLP neural network is used to model the nonlinear relationship between different flight states and control parameters in the TD3 algorithm model. As depicted in Figure 5, the MLP neural network fitting can be directly applied to the missile guidance control loop. The corresponding autopilot control parameters can be self-adjusted online during flight based on the real-time flight state.

3. DRL Design Method for the Autopilot
3.1. Markov Decision Process
The flight state of the missile at two adjacent moments can be approximately regarded as the transfer between the states under a given control command, and the state of the next moment is only related to the state at the current moment, so the flight state of the missile has Markov property. By discretizing the flight process in the time dimension, the guidance process of the missile can be approximately modeled as a discrete-time Markov chain (DTMC). When the guidance law is determined, the control parameters determine whether the autopilot can follow the guidance command. Therefore, the design process of the autopilot is to add decision-making command to the Markov chain of the missile flight process, so the design process of the missile autopilot can be modeled as a Markov decision process (MDP).
MDP is a sequential decision process, which can be described by a 5-tuple . The specific model definition is as follows:
In this paper, the agent cannot directly access the transition function and the reward function but can only obtain specific information about the state , action , and reward by interacting with the environment. The interaction process between the agent and the environment in the MDP is depicted in Figure 6.

The autopilot design problem is a continuous control problem. In each round, the agent observes the state at each time and decides the action to be taken according to the current strategy. The strategy is the mapping from the state to the action (). Autopilot model gets the next state under the action and gets a reward from the environment. The trajectory generated by the control loop from to the termination state is expressed as .
Formalizing the optimization objective, the reward is defined as the weighted sum of all rewards in the trajectory: where represents discount factor and represents the state number of a scene data trajectory. The objective function is defined as the expectation of the trajectory return:
With defined as the state action-value function of , the Behrman equation of the state action-value function can be expressed as
Throughout the interaction between agent and environment, the optimal policy is continuously updated to maximize the state action-value function.
The Behrman optimal equation of the state action-value function can be obtained through the Behrman optimal principle:
3.2. TD3 Algorithm
The TD3 algorithm is an off-policy algorithm for continuous action space. Based on the DDPG algorithm, the actor- and critic-networks are simultaneously improved, thereby resolving the problem of the critic-network overestimating the value and enhancing the algorithm’s stability. To increase the agent’s ability to explore the environment, the TD3 algorithm chooses actions based on the current policy and exploration noise: where represents the actor-network parameter, is the lower limit of the action space, denotes the upper limit of the action space, and is the exploration noise which obeys the Gaussian distribution: . In addition, to ensure the algorithm convergence in the later stage, the Gaussian distribution variance obeys the following: , where is the attenuation factor.
The experience replay mechanism is adopted for the TD3 algorithm. The agent inputs the trial data into the experience buffer pool and then randomly selects the data of the batch from to train the network and update its parameters. To solve the overestimation problem of the value, the TD3 algorithm uses two sets of critic-networks to represent different values and estimates the target value using the following equation: where is the parameters of the target critic-network and is expressed as follows: where represents the parameter of the target actor-network. Random noise is introduced to enhance the stability of the target policy. The noise follows an independent distribution controlled by parameters different from those of the exploration noise , i.e., , where is the upper limit of the noise.
The TD3 algorithm updates the critic-network parameter by minimizing the temporal difference (TD) error: where is the critic-network parameter.
To reduce the number of incorrect updates and improve the algorithm’s stability, the TD3 algorithm does not update the actor-network until the value becomes stable. Therefore, the actor-network in the TD3 algorithm has a slightly lower update frequency than the critic-network. Through a deterministic policy gradient, the TD3 algorithm updates the actor-network parameters.
To ensure the stability of training the neural network, the soft update strategy is adopted for the target network parameters: where is a smoothing constant that represents the update speed. The pseudocode of the TD3 algorithm is given in Algorithm 1.
|
3.3. RL Model for Autopilot Design
To use the TD3 algorithm to solve the problem of two-loop autopilot parameter design, it is necessary to transform the autopilot design process into an RL problem and to design the various components of the RL model, i.e., state, action, and reward. At the same time, the algorithm’s network structure and hyperparameters must be determined to ensure its effectiveness.
The design of the two-loop autopilot requires information regarding the flight state feature points in the full flight envelope. The Mach number and can affect the change of aerodynamic characteristics, and the flight altitude and Mach number can affect the change of dynamic pressure, so the state vector can be designed as , where represents the Mach number and represents the flight height. The objective of the design of a two-loop autopilot is that the control parameters satisfy the performance index requirements. Thus, the action vector is designed as .
The reward signal is the objective that must be maximized by the agent. A reasonable reward mechanism is crucial for training effectiveness. The frequency domain stability margin of the control system is selected as the performance index, and the following reward function is designed: where is the expected magnitude margin, represents the expected phase margin, and and are the influence weights of the magnitude and phase margins on the reward, respectively, meeting , , and .
As shown in Figure 7, both the actor- and critic-networks of the TD3 algorithm employ an MLP neural network with four layers, and an activation function is connected behind the neurons of each layer besides the input layer. To prevent control input saturation, the tanh function activates the actor-network of the TD3 algorithm:

The critic-network is activated by the ReLu function:
The dimensions of each layer of the actor- and critic-networks are defined in Table 1.
The inputs of the actor- and critic-networks include state vectors. The state vectors must be normalized to eliminate the effect of input data dimensions on the neural network training procedure. In this paper, the state vectors were processed by (0,1) normalization: where , , and are the normalized network input values, in which , , and .
Since the tanh activation function limits the output of the actor-network to the range (-1, 1), the output value of the actor-network needs to be denormalized to obtain the action vector value: where and are the output values of the actor-network, are the maximum and minimum values of the control parameters, respectively, and are the maximum and minimum values of the control parameters, respectively.
To prevent the overfitting problem, both the actor- and critic-networks are trained by the Adam optimizer with regularization. The hyperparameter settings significantly affect the performance of the TD3 algorithm. Table 2 shows the hyperparameters suitable for the application scope of this paper.
3.4. Offline Training Framework
Before training, computational fluid dynamics (CFD) software was used to calculate the missile’s pneumatic parameters throughout its entire flight envelope, and a library of pneumatic parameters was compiled.
In the offline training process, the flight speed can be obtained using the following atmospheric model: where represents acoustic velocity and is the atmospheric density in the sea level. The correlation between the temperature and the troposphere can be approximately expressed as where is the reference temperature at sea level.
Firstly, the actor-network and the critic-network are initialized. Then, the experiment begins, the initial state is selected and brought into the library of pneumatic parameters to obtain the corresponding pneumatic parameters, and the dynamic coefficient of the two-loop autopilot is computed. In this paper, thrust , so the dynamic coefficient is defined as follows: , , , , and .
According to the structure of the two-loop autopilot, the open-loop transfer function can be derived as follows [34]: where and . , , and are expressed as follows:
At the same time, the flight state determined the control parameters and through the actor-network and exploration noise and output the action . Then, the system stability margin is calculated according to Equation (6) and Equation (26), and the real-time reward is calculated according to Equation (19).
It is worth noting that the offline design of the autopilot parameters in the whole flight envelope is not a traditional sequential decision-making problem but can be regarded as a problem to obtain the optimal value of control parameters in the flight envelope. Therefore, the acquisition of the next flight state in the interaction with the environment needs to be designed. After several tests, it is determined that, compared to random sampling for training, the orderly arrangement of flight states and sequential sampling can improve the efficiency of the TD3 algorithm and the rate of convergence of the reward function. Therefore, in this study, all flight states in the library of pneumatic parameters are sequentially sorted. The ranking rule is fixed first, and then, and are determined in turn. The next flight state is determined according to the ranking rule.
The explored data are imported into the experience buffer pool . Then, the data of the batch are selected from the experience buffer pool to train the network according to the TD3 algorithm. It should be noted that this sampling method does not affect the neural network’s ability to generalize, as the network model can “remember” the order of the samples. This is because the TD3 algorithm employs the experience replay mechanism, and random selection of data from the experience pool during training breaks the correlation between sequences. The offline training framework for the two-loop autopilot parameter design based on the TD3 algorithm is illustrated in Figure 8.

4. Simulation Analysis
The sample space for the two-loop autopilot parameter training process was determined based on the application scene of the selected missile in this project. Subsequently, the TD3 algorithm is used to train the model offline, and the training results are analyzed. Finally, in conjunction with the impact angle constraint guidance problem, the neural network fitting model obtained after training was directly implemented in the guidance control loop for trajectory simulation to demonstrate its performance.
4.1. Autopilot Design Simulation Based on the TD3 Algorithm
First, the design state space of the autopilot parameters is determined: , , , and . Then, the design action space of the autopilot parameters is set as follows: , , and .
Subsequently, the expected performance indices are selected, , , , and , and the hyperparameters are set according to Table 2 for offline training. The change process of the episode cumulative reward during the offline training process is exhibited in Figure 9.

It can be seen that compared with random sampling, sequential sampling improved the strategy learning efficiency of the algorithm, and the stable value of cumulative reward was higher than random sampling. This is because random sampling in the training process leads to no correlation in state transfer, which makes the agent unable to determine how the state transfers through learning, so that there will be a large difference in each calculation of the value, and it is difficult to obtain a stable value. However, with sequential sampling, the state transition is determined, which greatly reduces the learning difficulty of the agent and enables faster convergence of cumulative reward. In the early stages of training, the cumulative reward value was small, with relatively large fluctuation due to the random selection of actions within the set range. Nevertheless, as the training progressed, the agent gradually chose better actions; as a result, the cumulative reward value gradually increased, and the fluctuation decreased. At 600 epochs of training, the cumulative reward value of sequential sampling was greater than -60 but still exhibited an upward trend. When the training reached 1550 epochs, the cumulative reward value was stable at approximately -9.9, with a small fluctuation range. It can be considered that the training process achieved the ideal results.
The design results of three characteristic points in the flight envelope are selected for analysis.
The feature points 1 and 2 in Table 3 are selected according to the predicted trajectory. To verify the generalization ability of the algorithm, feature point 3 with a negative AOA is selected for analysis. Table 4 lists the design results and stability margin of the TD3 algorithm for the autopilot control parameters at the feature points.
According to Table 4, the autopilot design results at the feature points met the design requirements of the amplitude and phase margins. Subsequently, the pole placement [27] method is used to determine the autopilot control parameters at the feature points, the damping coefficient is taken as 0.7, and the natural frequency is taken as 35 rad/s. The design results are listed in Table 5.
The autopilot actuator adopts the second-order actuator model: where represents the actuator natural frequency and is the damping coefficient.
The design results of the control parameters at the feature points are incorporated into the two-loop autopilot model, and Figure 10 depicts the step response curves. According to the simulation results, the performance of the autopilot designed by the TD3 algorithm was superior to that obtained by the conventional pole placement method. Under the RL framework, the autopilot designed by the TD3 algorithm can obtain the control parameters that satisfy the ideal performance indices through autonomous learning, and the actuator’s performance requirements are more reasonable.

(a) Autopilot step response curve at the feature points

(b) Actuator response curve at the feature points
4.2. Online Application of the Autopilot Parameter Fitting Model
After offline training of the TD3 algorithm, the MLP neural network fitting model with the flight state as input and the autopilot control parameters as the output in the full flight envelope is obtained by the actor-network. This model can be directly incorporated into the guidance control loop to automatically adjust the autopilot control parameters based on the current flight state. To further validate the effectiveness and adaptability of the proposed method, a flight experiment was designed with the guidance and impact angle constraint control problem in mind. The three-DOF motion equation in the plumb plane from Ref. [35] is chosen as the model for the theoretical flight simulation. To ensure the end impact angle, proportional guidance plus an offset term is required to meet the end impact angle constraint: where represents the control item to be designed and is the proportional coefficient, for which . The following relationship can be easily obtained:
Assuming that is the end time of guidance, Equation (30) can be integrated on the interval , and the following relationship can be obtained:
Then, is set to a specified impact angle, and it must meet
From Equations (31) and (32), the following relationship can be deduced:
Consequently, and can be obtained. It is known that, as long as converges to zero during the guidance process, the impact angle requirements can be met. Therefore, the following equation can be formulated [36]: where is an adjustable parameter, . Based on Equation (33), the following relationship can be obtained:
The offset proportional guidance relationship with the impact angle constraint can be obtained by combining Equations (34) and (35):
Based on the structure of the two-loop autopilot, the expression for the actuator command can be deduced as follows:
Set the initial missile position to (X0 = 0 m, Y0 = 500 m), the initial pitch angle to ϑ0 = 0°, and the initial velocity to V0 = 200 m/s. Set the target position to (Xt = 2000 m, Yt = 0 m), Ny =3, K = 3, and the acceleration command is constrained to [-50 m/s2, 50 m/s2]. The desired attack angles are set to -15°, -45°, and -80°, respectively. The simulation results are obtained and exhibited in Figure 11.

(a) Flight trajectory curve

(b) Speed change curve

(c) AOA curve

(d) Trajectory inclination curve

(e) Elevator deflection variation curve

(f) Acceleration tracking curve
According to simulation results, the missile can consistently strike the target at the preset impact angle. Figure 11(f) illustrates that during the flight process, the control parameters obtained by the online adjustment of the real-time flight state via the neural network fitting model are able to effectively implement acceleration tracking. This indicates that the autopilot parameter fitting model trained by the TD3 algorithm is robust and capable of completing the flight task specified.
Figure 12 depicts the change processes of the control parameters and stability margins.

(a) variation curves

(b) variation curves

(c) Amplitude margin variation curve

(d) Phase margin variation curve
Figure 12 demonstrates that the autopilot parameter fitting model trained by the TD3 algorithm can self-adjust the autopilot control parameters online based on the real-time flight state, ensuring that the amplitude margin and phase margin of the control system are within the expected range. In addition, it has strong generalizability and can be implemented in the guidance control loop to implement acceleration instruction tracking.
5. Conclusions
In this paper, the TD3 algorithm is used to design a two-loop autopilot for the full flight envelope that can be deployed directly to the guidance control loop, allowing online self-adjustment of the control parameters. First, a deep reinforcement learning model is constructed, with the flight state taken as the state and the control parameters to be designed taken as the actions. According to the amplitude and phase margins in the frequency domain index of the control system, a reasonable reward mechanism is designed. The TD3 algorithm is then used offline to learn the control parameters of the complete flight envelope, and an MLP neural network fitting model is obtained. Finally, a verification study for the designed autopilot is conducted in combination with the impact angle constraint guidance problem. The results indicate that the TD3 algorithm-based autopilot can satisfy the performance index requirements. The fitting model can self-adjust the control parameters of the autopilot based on the real-time flight state, ensuring that the control system’s stability margin is within the expected range and can accurately track acceleration. In addition, the fitting model has strong generalization capability and robustness, eliminating the robustness problem of the conventional overload autopilot, which is insufficiently robust to meet the performance requirements for the entire flight envelope. The method proposed in this paper for designing two-loop autopilots based on the TD3 algorithm is also applicable to the design of three-loop autopilots.
Data Availability
The data used to support the findings of this study are available from the corresponding author upon request.
Conflicts of Interest
The authors declare that there is no conflict of interest regarding the publication of this article.
Acknowledgments
This work is supported in part by the National Key Research and Development Program under Grant No. 2020YFC1511705, the National Natural Science Foundation of China under Grant No. 61801032, and the Project of Construction and Support for High-Level Innovative Teams of Beijing Municipal Institutions (BPHR20220123).