Abstract

More and more renewable energy sources are integrated into novel power systems. The randomness and fluctuation of such renewable energy sources bring challenges to the static stability and safety analysis of novel power systems. In this work, a multilayer deep deterministic policy gradient is proposed to address the fluctuation of renewable energy sources. The proposed method is stacked with multilayer deep reinforcement learning methods that can be continuously updated online. The proposed multilayer deep deterministic policy gradient is compared with other deep learning algorithms. The feasibility, effectiveness, and superiority of the proposed method are verified by numerical simulations.

1. Introduction

More and more countries are joining carbon-peaking and carbon-neutral programs [1]. Building a new type of power system with mainly renewable energy, or even a 100% renewable energy power system, has become imperative [2, 3]. Although the carbon emissions of renewable energy are very small, the dispatch center of the power system has to suffer from the dispatch difficulties caused by the fluctuation of renewable energy [4]. There are already developed countries that have to restart their thermal power generators to meet the demand for electricity for living and production activities [5].

An increasing number of sensors have been installed in novel power systems [6]. These sensors bring a huge amount of data to the dispatch center [7]. The current methods of data processing are still far from adequate to fully utilize the data that the dispatch center can receive. A variety of methods that do not rely on traditional models, i.e., data-driven methods, are constantly being implemented into novel power systems [8].

Currently, a data-driven approach cannot avoid involving deep learning methods [9, 10]. Deep learning can be classified as convolutional neural networks [8], deep neural networks [11], deep reinforcement learning [12], and deep forest algorithms [13]. Deep learning can in turn be classified as classification algorithms, prediction algorithms, and control algorithms [14]. The deep reinforcement learning method is a control method. In this work, the deep reinforcement learning method is applied to solve the safety and stability control problems of novel power systems.

Deep reinforcement learning is developed through reinforcement learning. The series of reinforcement learning methods evolved from being trained to get a look-up table method to an actor-critic network method consisting of a deep neural network or a convolutional neural network [15]. Although it contains deep neural networks or convolutional neural networks internally, deep reinforcement learning is still a control- or policy-based method in general [16].

The deep deterministic policy gradient (DDPG) method is a deep reinforcement learning method based on actor-critic that has been applied very effectively [17]. For example, DDPG can obtain small energy costs in peer-to-peer energy trading [18]. In general, a well-trained deep neural network can represent the control system signal at a specific time scale [19]. Therefore, to obtain better control performance, a deep reinforcement learning method based on the actor-critic structure capable of observing control signals at multiple time scales is proposed in this study. In this study, DDPG is chosen as the control method mainly based on (1) the fact that the output of DDPG is a deterministic strategy while proximal policy optimization (PPO) is a probability distribution; (2) the critic output in PPO is a value function, and the input is only state, while the critic output in DDPG is a behavior-state value similar to deep Q-networks (DQN); therefore, the input of DDPG contains action. The characteristics of different types of deep reinforcement learning methods are given in Table 1.

Recently, numerous deep reinforcement technologies have been combined to achieve better control performances in more complex scenarios. For example, traditional controllers + deep reinforcement learning [4], modal decomposition + generative adversarial networks [12], and twin-delayed DDPG + DDPG [20] are combined to address the frequency control problems of novel power systems; Markov chain and isoprobabilistic transformation are combined for capacitor planning [21]. Overall, the primary contributions of this work are summarized as follows:(1)This work proposes a deep reinforcement learning method based on multilayer DDPG. The proposed MDDPG can represent and predict control signals at multiple time scales using multiple deep neural networks. The proposed MDDPG observes more state variables of the control system, and thus has a stronger ability in responding to stochastic perturbations.(2)This work is the first application of a deep reinforcement learning method based on actor criticism to the stability control of novel power systems. The proposed MDDPG can obtain better power system stability control performances through multiple time scales of representation and observation.(3)The MDDPG proposed in this study is also a framework for combining multiple deep reinforcement learning. The proposed framework can integrate deep reinforcement learning with different characteristics or different parameters.

The stability control model of novel power systems is presented in the next section. Then, the next section shows the proposed MDDPG method. Then, the simulation calculations and their results are shown. The final section concludes this study.

2. Model of Rotor Angle Stability Control of Novel Power Systems

The modeling process of angle stability control of novel power systems is described in this section.

2.1. Model of Single-Machine Infinite Bus System

A single infinity system is one of the simplest and most basic systems in power systems with infinite power, constant voltage, and constant frequency [22]. A classical generator model is shown in Figure 1. is infinity bus voltage; is generator terminal voltage; is transient reactance; is the reactance of external network; is angle over infinity bus voltage ; and is reference vector. If the system is affected by a disturbance, will be changed.

The stator current is obtained as

After the stator resistance is ignored, the air gap power is equal to the terminal power . The air gap torque is equal to the air gap power when expressed per unit. Then,

Substituting the initial conditions , linearize equation (2) as

The motion equations of rotor rotation angle and angle deviation of synchronous generator in per unit, respectively, are

The control block diagram of a single infinity bus system represented by the classical generator model is represented in Figure 2.

In Figure 2, is synchronous torque coefficient; is damped torque coefficient; is inertia constant; is the standardized value of angular velocity offset; is rotor angular offset; s is Laplace operator; and is rotor reference speed. The expression for the rotor angular offset is obtained from Figure 2 as follows:

Considering the effect of the variation of system excitation flux on system performance, and neglecting the effect of damping winding on the circuit, the excitation voltage is assumed to be constant. The rotor angle is the angle at which the q-axis exceeds the reference quantity , is the sum of the internal angle and the angle at which exceeds . The equivalent circuit related to the magnetic chain and current of the synchronous motor is shown in Figure 3.

The magnetic chain of the stator and rotor can be expressed aswhere and are the air gap magnetic chains; and are the saturation values of mutual inductances; is the stator leakage inductance; and is the rotor circuit inductance. The excitation current is expressed as

The air gap magnetic chain of d-axis is expressed by and , aswhere

Since the rotor circuit is not considered in q-axis, the air-gap magnetic chain is expressed as

The air gap torque is written as

The corresponding terms cancel iswherewhere and are the reactance saturation values. The reactance value is equal to the inductance value per unit. The perturbation values iswhere

Then,

Then,

Assume that

The final system equation is written as

The prime-mover and exciter control are and , respectively. If the air-gap torque output from the prime-mover and excitation voltage output from the exciter are constants, the value and are zeros. If the final system equation is a classical generator model, both and are equal to 0, . In the above equations, and are the saturated values of mutual inductance and , respectively. and are the unsaturated values of mutual inductance and , respectively. The initial static values of system variables are indicated by the subscript 0.

The variation of depends on the equation of the excitation circuit. Then,where

is the rotor circuit self-inductance. The derivative operator is replaced by the Laplace operator s as

Thus, a control block diagram with stable excitation voltage representation is obtained in Figure 4. If is zeros, can be set as negative for large local load, which is supplied partly by generators and remote large system (Figure 4).

The values in parentheses are written in the following form as

is the prefault value of the voltage after . The expanded form of constant is

Similarly, the expansion of , , , and is calculated as

If the influence of saturation is ignored, can be simplified to

2.2. Model of Automatic Voltage Regulation

The input signal to the excitation system is the generator terminal voltage . is represented by the state variables , , and .

In the case of small perturbations,

Neglecting the second-order term for all perturbation values, then

Therefore,

With the value of the disturbance, the stator voltage equations are written as

Then,where

The model of the thyristor excitation system with automatic voltage regulation (AVR) is shown in Figure 5, where and are the upper and lower limits of the excitation output voltage, respectively; is the time constant of the terminal voltage transducer; is the system reference voltage; and is the output of the terminal voltage transducer. The thyristor excitation system contains only the necessary connections for the specific system and uses a high-gain exciter. The limiting and protection circuits are omitted (Figure 5) because they do not affect the small-signal stability.

Considering the effect of the excitation system, the equation of the excitation circuit is

Because the exciter is a first-order model, the order of the whole system is increased by one order on top of the original one; the newly added state variables are . Since and are not affected by the exciter, the entire state-space model of the power system is written in the form of the following vector-matrix:

If the mechanical torque input is constant, is 0. is the transfer function of the AVR and the exciter. is applied to any type of exciter, be expressed in terms of as

2.3. Model of Power System with Automatic Voltage Regulation and Power System Stabilizer

The PSS, which is an additional excitation control technique to suppress low-frequency oscillations of synchronous generators by introducing additional feedback signals, has been utilized to improve the stability of power systems. The control block diagram of the excitation system, including AVR and PSS, is shown in Figure 6. The PSS shown in Figure 6 includes three links: a phase compensation link, a signal filtering link, and an amplification link. The phase compensation link properly provides a phase lag characteristic to compensate for the phase lag between the exciter input and the air gap torque of the generator. Since the signal link is a high-pass filter with a large time constant , the oscillating signal does not change as the oscillating signal passes through. The stabilizer gain determines the magnitude of damping generated by PSS. Adding perturbation values to the signal-filtering module has

Hence,

Figure 6 shows that

After adding PSS, the whole state-space model is expressed as the following vector matrix if , then

The control framework of power systems containing AVR and PSS is obtained, as shown in Figure 6.

3. Multilayer Deep Deterministic Policy Gradient

Reinforcement learning is an algorithm that performs actions based primarily on feedback from the environment. By continuously interacting with the environment, the agent continuously “tries and fails” with one or more learning strategies to maximize the gain and achieve a specific goal problem [23]. The interaction between the agent and the environment means that the agent observes the state of the environment, performs an action , changes the state of the environment, and returns a reward and a new state to the agent [24].

The mathematical basis of reinforcement learning is a Markovian decision process (MDPs) [25]. An MDP usually consists of a state space, an action space, a state transfer matrix, a reward function, a policy function, and a discount factor. In an episode, write down all the rewards as: . Assuming a discount rate of , the discounted return can be defined asat , when the episode is not over, is an unknown random variable whose randomness comes from all states and actions after the moment . The action-value function is defined as

The expectation in equation (43) eliminates all states with all actions after the moment . The optimal action-value function utilizing a maximization elimination strategy is

In equation (97), S is the set of all states, and A is the set of all actions.

To address the problem that DQN applied to continuous action spaces can suffer from dimensional catastrophe, the deterministic policy gradient (DPG) is proposed [26]. The DPG method is the most common reinforcement learning for doing continuous control actions. DDPG simply combines DQN and actor-critic. DDPG, which can also be described as a combination of DPG and DQN, can combine the successful structure of DQN with actor-critic to improve the stability and convergence of DDPG. Since DDPG is based on DPG and has deep learning integration, DDPG can characterize high-dimensional data. The DPG is based on deep Q-learning, which employs a neural network to provide action and another neural network to evaluate the performance of the actions for improving the accuracy of performed actions. and are called the strategy and value networks, respectively (Figure 7).

Collecting experience with behavior policy, assume behavior policy iswhere is the determined policy network; is the added noise, . The behavior policy is implemented to control the interaction between the agent and the environment; the trajectory of the agent is stored in the experience replay array; the collected experience is reused for training (Figure 8).

In the training process of the policy network, the policy network outputs an action for a state , and then the value network evaluates the action output by the policy network to obtain the value of the evaluation . A higher evaluation of the value network means more accurate action given by the policy network. Hence, the objective function is defined as the expectation of the evaluation value.

The learning of policy network transforms is a problem of maximizing solution, i.e.,

The gradient is calculated using one observation d of the random variable at each iteration.where is called the determined policy gradient, which is derived by applying the chain rule, aswhere . A state is randomly selected from the experience replay array at each iteration, ; gradient ascent is employed to update the parameters of the policy network, aswhere is the learning rate of policy networks. To bring the value network closer to the true value function , temporal-difference (TD) is utilized to train value networks for more accurately evaluating the actions of policy network output. A trajectory of an agent is selected from the experience replay group at each iteration; the value network evaluates the output action by the policy network, as

Calculate TD targets as

Then, the loss function is

Calculate gradients as

Update the parameters of the value network with gradient descent as

After the loss function is reduced, the prediction of the value network is closer to the target value function, where is the learning rate of the value network.

The deterministic policy gradient suffers from the same overestimation problem as DQN, leading to difficulty in convergence during the training process. DPG is combined with deep learning as DDPG [26]. DDPG adds a policy target network and a value target network to the DPG; thus, two-goal networks are applied to calculate TD goals. The policy target and value target networks have the same structure as the policy network and value networks with different parameters. The DDPG policy network is updated in the same way as the DPG.

However, the parameters of the value network are updated differently. The evaluation of the trajectory of the agent at moment is calculated by the value network; the evaluation of the trajectory at the next iteration is calculated by and . Thus, the evaluation of the two moments is obtained as follows:where .

Update the parameters of the value network as

The target policy and target value networks are updated by a weighted average as follows, where is the hyperparameter (Figure 9):

The MDDPG consists of multiple DDPGs, which contain two critic networks and two actor networks, (Figure 10). The MDDPG equally sums the actions by each DDPG to produce a new action with a given state . The training steps of the MDDPG are shown in Algorithm 1.

In this study, two DDPGs applied to a power-stable stack are trained and updated simultaneously. In this study, the rlMultiAgentTrainingOptions() function in MATLAB is adopted in conjunction with the train() function to train these two agents simultaneously. Moreover, the proposed method is an integral control algorithm. This integral control algorithm requires the cooperation of two agents to complete an output action.

(1)Randomly initialize the parameters of policy network and of the target network.
(2)Randomly initialize the parameters of the policy target network and of the target network.
(3)Randomly initialize the experience replay matrix.
(4)Execute the platform corresponding to the environment by the initial action in one step.
(5)Obtain initial state from the environment.
(6)For i from 1 to maximum iteration N
(7) Obtain the actions from all policy networks through the received state .
(8) Obtain new action by summing the actions output from all policy networks and the agent performs the action based on the received state .
(9) Obtain reward value and next state from the environment.
(10) Deposit the quadratic array of trajectories () of the agent into the experience replay matrix.
(11) Update sampling priority.
(12) Randomly sample M samples from the experience replay matrix and calculate the current target value .
(13) Calculate TD error and TD target.
(14) Update the parameters and of policy and value networks by gradient ascent and descent, respectively.
(15) Set a hyperparameter , update the parameters and of target policy and target value networks by weighted average.
(16)End for
(17)Save the trained model/networks.

4. Results and Discussion

The simulation studies performed in this work are accomplished on a laptop with an AMD 8500H processor, 32 GB RAM, and 3060 GPU. To verify the feasibility and effectiveness of the proposed method, a disturbance input is designed, as shown in Figure 11. The designed disturbances fluctuated very sharply from 40 s to 50 s. The dramatically varying disturbance in Figure 11 is designed to verify whether the proposed method can stochastically adapt to complex disturbances to keep static safety and stability analysis of novel power systems.

The parameters of this novel power system are set as follows: , , , , , , , , , , , , , , and .

In this work, the proposed MDDPG method is compared with the traditional proportional-integral-derivative and the traditional Q-learning methods. For a fair comparison, a similar or the same parameter is adopted for the reinforcement learning family of methods.

The parameters of the proportional-integral-derivative (i.e., 0.7102, 25.8658, 0.0326) utilized in this study are optimized by a particle swarm optimization algorithm with a total population of 200 and the number of iterations of 200. The three parameters of this conventional algorithm are coupled with each other. These three parameters must be fully tuned in a wide range to obtain a high-performance control performance. The control parameters obtained by the optimization algorithm are superior when both the number of iterations and the population size of the chosen optimization algorithm exceed 100.

The parameters of the reinforcement learning series employed in this study are set as follows. The more the number of actions in the action matrix, the more training time is required to compute the memory, which may even exceed the computer memory. Although the smaller the number of actions in the action matrix, the smaller the computation memory, the faster the computation time, and the lower the accuracy. After extensive testing, the action matrix in this work is a 16-equivalent value between −0.1 and +0.1. The properties of setting the number of rows and columns of the Q-value matrix and the matrix are similar to the characteristics of establishing the number of actions in the action matrix. Therefore, after numerous tests, both the Q-value matrix and the probability matrix in this work are 16-row, 16-column matrices. Although higher learning rates imply faster convergence and inaccurate control actions, and lower learning rates imply slower convergence and longer training times, the proposed algorithm can characterize the input-output relationship of the system after a long period of online training iterations. Therefore, the learning rate, discount factor, and probability update rate are set to default values that are set by most references. The learning rate is 0.1. The discount factor is 0.05. The probability update rate is 0.9. The number of hidden layer units inside the actor and critic networks is set to [30 30].

The rotor angle deviations obtained by the compared algorithms are shown in Figure 12. The proposed MDDPG obtains the smallest rotor angle deviation. The reason why the Q-learning method based on the key-value pair type has not achieved a higher control performance than the proportional-integral-derivative is that the Q-learning method has too few action values. Although the number of proposed MDDPG actions is small, the MDDPG has strong prediction capability, which obtains better control performances.

The controller outputs given by the three comparison algorithms are shown in Figure 13. The conventional proportional-integral-derivative controller gives a smooth output curve. Reinforcement learning gives control actions that are too trial-and-error. The control commands given by the MDDPG proposed in this work appear to be irregular but can give better control results. The MDDPG proposed in this work gives actions between 40 and 50 s, which can eliminate the sharp disturbances.

The experiments of MDDPG for the angle stability control of power systems show that the control performance is more stable than other algorithms. The MDDPG processes more information and decomposes the high-dimensional input vector into multiple low-dimensional input vectors, effectively avoiding dimensional disasters. In addition, Figure 13 shows that (1) the conventional controller with very smooth and continuous control instructions obtains power angles with larger fluctuations in the end; a conventional controller with only three parameters is difficult to obtain the optimum in both steady-state values and convergence speed simultaneously. (2) Q-learning with strong random fluctuation can give trial-and-error signals with large fluctuation and a long convergence period. (3) The MDDPG that balances trial-and-error fluctuations and control performance can achieve smaller control errors than the traditional controller and Q-learning.

The method of principal component analysis could be considered to solve the coupling problems of the inputs of MDDPG. Exploring an MDDPG that can handle multidimensional information while reducing the computational memory and computational time of the system is an important direction.

The deficiencies of this proposed MDDPG are summarized as follows: (1) The MDDPG processes more information with more computation memory and longer computation time. (2) Meanwhile, MDDPG splits the high-dimensional vectors into several low-dimensional vectors as inputs, which weakens the coupling of input information. (3) This MDDPG has a total of eight networks that need to be trained and updated. The network number in this MDDPG is more than that of the normal deep reinforcement learning methods.

5. Conclusions

In this work, a reinforcement learning algorithm called MDDPG, which combines several DDPGs, is proposed to solve the rotor angle stability control of novel power systems. The test results verify the feasibility and effectiveness of the MDDPG. The primary characteristics of the methods are outlined as follows:(1)The MDDPG combines multiple DDPGs and applies multiple deep neural networks with high adaptability, high fault tolerance, and self-organization capability. When the system is under different perturbations, an MDDPG can control the output of the system rotor angle stably with an error less than proportional-integral-derivative and Q learning.(2)The MDDPG can transform the high-dimensional input into multiple low-dimensional inputs. The output action of MDDPG is deterministic, is the action superposition of each DDPG output, and provides accurate control continuously in real time with short systemic stability time.(3)The MDDPG provides accurate control. In the listed example, the system rotor angle stability control error is smaller than in comparison with other algorithms.

Future work could be improved in the following three ways: first, the proposed MDDPG framework could be incorporated into other more effective deep reinforcement learning methods; second, the proposed multilayer framework could be reduced to a deep reinforcement learning method composed of multiple deep neural networks; and finally, the static safety and stability problem of the novel power system could be combined with the dynamic safety and stability problem to be solved by deep reinforcement learning simultaneously.

Data Availability

The data supporting the findings of this study are available from the corresponding author upon reasonable request.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Authors’ Contributions

Yun Long was responsible for conceptualization, funding acquisition, project administration, supervision, methodology, resources, writing review, editing, data curation, formal Analysis, investigation, software, validation, visualization, and writing the original draft. Youfei Lu, Hongwei Zhao, Renbo Wu, Tao Bao, and Jun Liu reviewed and edited the manuscript.

Acknowledgments

The authors appreciatively acknowledge the support of the Science and Technology Projects of the China Southern Power Grid (GZHKJXM20210041).