Abstract
The efficiency of on-site consumption of new energy and the economy of dispatching strategy for that in modern microgrids are increasingly concerning, which are closely related to the microgrid control model with source-load uncertainty. To this end, this paper proposes the multiagent hierarchical IQ (λ)-HDQC regulation strategy to realize the source-load-storage-charging collaborative control of the microgrid model with high-permeable new energy. The first layer adopts the IQ (λ) strategy, which avoids the overestimation and underestimation problems of traditional reinforcement learning by the coupled estimation method. The second layer adopts the HDQC allocation strategy, which solves the problem of low utilization of new energy in the proportional allocation method and improves the adaptability of the regulation strategy in the complex stochastic environment. The interaction of the two-layer strategies realizes the source-network-load-storage-charge global dynamic interactive regulation of microgrids. Indicators of energy efficiency are constructed in this paper to measure the simulation results. And the superiority of the proposed strategy is verified through the simulation results of the microgrid system.
1. Introduction
Under the pressure of energy demand and environmental protection, renewable energy generation is gradually attracting attention. As an effective carrier of renewable energy, the application of microgrids [1–3] reduces the impact of the randomness of renewable energy output on the stability of the power systems, which is an effective way to improve the utilization and penetration rate of new energy. However, due to the lack of grid support and environmental uncertainty, the energy autonomy of microgrids faces many challenges [4–6], and how to solve the energy autonomy of microgrids becomes a hot issue.
For microgrids to realize the efficient consumption of new energy, they must realize the priority consumption of new energy in the microgrid. With the development of artificial intelligence, the research on the automatic generation control [7–9] (AGC) realizes the global dynamic interactive adjustment of source-network-load-storage-charging (SNLSC) for microgrids. Wu et al. [10] propose an extreme Q-learning algorithm to parameterize the sag control of the microgrid, thus achieving the integration of frequency regulation and economic dispatch. However, the above method has the problem of “overestimation” of the action value in the exploration process in a strong stochastic environment. To solve this problem, Xi and Zhou [11] propose the DQ forecast (σ, λ) algorithm, which improves the fast and stable power regulation of AGC units but generates a new “underestimation” problem. The allocation of the total power command is achieved by a fixed proportion of adjustable capacity, and the new energy generation model has a strong nonlinearity, which makes it easy to fall into the local optimal solution and leads to the curse of dimensionality.
For this reason, multiagent reinforcement learning is applied to AGC to achieve a dynamic allocation of conventional units and new energy output of microgrid systems. The paper [12] proposed an ecological population cooperative control strategy with a win-lose criterion and the space-time tunneling idea, which can converge to Nash equilibrium quickly, which is based on the stochastic consensus of a multiagent system (MAS). This strategy is based on the multiagent system stochastic consensus game framework to achieve frequent information exchange among multiagent. The paper [13] establishes a three-level architecture MAS to achieve coordinated control of AGC and automatic voltage control and utilizes the characteristics of independent autonomy and collaboration of agents to achieve coordinated control in physical distribution control while maintaining logical unity. However, in the case of large-scale distributed energy access to a microgrid, the convergence speed of the above method decreases, and the main problems it faces are the low backup capacity of the microgrid system, the difficulty of local consumption of new energy, and the decreased convergence accuracy of the previous algorithm.
Therefore, to solve the above problems, this paper improves on both the two-layer strategy and model. The distributed AGC strategy can be divided into an AGC control strategy and an AGC allocation strategy. In order to mitigate the impact of adding large amounts of new energy to the grid, interleaved Q-Learning [14] (IQ) is introduced in the control algorithm, which avoids both the “overestimation” problem generated by the maximum estimator (ME) and the “underestimation” problem generated by the double estimator (DE) and incorporates eligibility traces [15] to reduce the control bias. In the allocation part, a multiagent hierarchical strategy of hierarchical double Q-learning consensus (HDQC) is formed using a consistency algorithm with isomorphic properties [16], incorporating the double Q-learning (DQ) algorithm [17]. The simulation of the microgrid with EVs incorporating large-scale new energy sources shows that, compared with previous agent algorithms, the proposed scheme can fully utilize the new energy sources and realize the global dynamic interactive regulation of SNLSC.
2. High Penetration New Energy Microgrid Control Framework
2.1. Microgrid Control Architecture
As shown in Figure 1, microgrid units that incorporate a large amount of new energy have great differences in ramp rate and spatial distance. HDQC allocation strategy uses clustering to divide the generating units into different power generation groups (PGG) and selects the generating unit with the largest content in the PGG as the dominant unit. The IQ (λ)-HDQC regulation strategy uses IQ (λ) to obtain the total power generated in the microgrid system, and then the HDQC strategy allocates the total power command to each unit in the PGG to realize the global dynamic interactive regulation of the microgrid SNLSC.

2.2. Microgrid Distribution Model
The HDQC allocation strategy takes area control error, creep time, and energy efficiency as three objectives and constructs two multiobjective functions within the microgrid. Among them, the objective function h1 is to minimize the sum of area control error (ACE) and the maximum creep time of all generating units in the microgrid. The objective function h2 is to minimize the ratio of carbon emission (CE) and nonrenewable energy generation. Therefore, the mathematical model of the power command allocation process of the microgrid based on the HDQC allocation strategy is as follows:where A is the ACE of the microgrid system; Ctotal denotes the sum of CE for all units in the microgrid system; Piw and are the power command and regulation rate of the th unit in PGGi, respectively; Pn and P are the non-new energy generation power and total generation power of the microgrid, respectively; Ptie is the contact line exchange power; f is the frequency deviation and B is the frequency response coefficient; Pi is the power command of PGGi, which is the product of the distribution factor ηi and the total regulation power command PΣ of the system; Uiw and Liw are the upper and lower limits of the power regulation rate of the th unit in PGGi, respectively; and are the upper and lower limits of the power regulation capacity of the th unit in PGGi; m is the number of PGG; and Wi is the number of units in PGGi.
3. IQ(λ)-HDQC Regulation Strategy
The IQ(λ)-HDQMP regulation strategy requires both control and allocation of AGC. For control, the IQ (λ) algorithm is used to improve the fast convergence and control performance of Q-learning in a strongly stochastic environment; for allocation, the HDQC algorithm is used to solve the “dimensional catastrophe” problem caused by the proliferation of large-scale units using a novel hierarchical Q-learning strong consistency algorithm. The method can achieve fast convergence in the two-tier power allocation.
3.1. IQ(λ) Control Strategy
In traditional reinforcement learning, the maximum expectation estimation represented by Q-learning excessively pursues the maximum long-term discounted reward. It tends to choose actions corresponding to the maximum Q value, leading to an overestimation of action values during the strategy exploration process. The dual estimation method represented by DQ learning uses a more conservative strategy, which gave rise to the underestimation of action values. Both methods affect the optimal strategy exploration by the agents. For this reason, this paper incorporates eligibility traces based on the IQ algorithm using the coupled estimation method and then proposes a new IQ (λ) algorithm with fast convergence properties by reducing the difference of Q values.
3.1.1. Maximum Expectation Estimation
Q-learning always picks the action with the highest Q value, called a greedy strategy , as in the following equation:where s is the current state; Qk is the kth iteration of the optimal value function . Q(s, a) is the Q-value function under the state s and strategy a. Based on the greedy strategy, the Q-learning algorithm uses iterative computation to find the optimal Q-value function, and the Q-value iteration is communicated as follows:where α is the learning rate; R(sk, ak) is the reward function under the state sk and policy ak; γ is the discount factor. Always choosing the action with the highest Q value will result in agents that always follow the same path and do not adequately search for other actions in the space, often converging to a local optimum.
3.1.2. Double Estimation
DQ learning uses two disjoint value functions QA and QB instead of a single value function Q. The behavioral strategies for QA and QB are chosen as and , respectively, as follows:
DQ learning splits the strategy selection and estimation process to avoid overestimation of Q. The DQ learning iteration is updated as follows:
DQ learning completely decouples the selection and estimation processes not only avoiding overestimation of the true value but also introducing an underestimation problem that slows down the convergence of the algorithm.
3.1.3. IQ(λ) Learning
IQ learning avoids the above problems of overestimation and underestimation of reinforcement learning by coupling the sample set. The Q-value function error estimate for IQ learning is given by the following equation:where and are the evaluation errors of QA and QB functions, respectively; and are based on the actions selected by equations (4) and (5), respectively; σ (0 < σ < 0.5) is the coupling ratio, which reflects the proportion of shared states of QA and QB. The closer the value is to 0, the more IQ learning is underestimated state when σ = 0.5, IQ learning is totally overestimated state. The simulation comparison study shows that σ = 0.25 has a better effect. Incorporating eligibility traces based on IQ learning to retrace past information, the SARSA eligibility trace algorithm is selected in this paper as follows:where ek (s, a) is the kth step iteration under state s and action a. The IQ (λ) algorithm is updated as follows:
3.2. HDQC Algorithm
The introduced HQL algorithm [18] enables interactive learning and self-learning process among PGGs. Since the algorithm is based on the Q (λ) algorithm, the time tunneling method is iteratively updated as in equation (7) and incorporates DQ learning to form the HDQC algorithm.
Suppose there are N agents in PGGi, denoted by p = {1, …, N}. G = (V, E) denotes the multiagent communication undirected graph, and V = is the set of nodes and is the set of edges. The Laplacian matrix L = [lij] reflects the topology of the multiagents body network [19], which is represented as follows:where bij is the probability of communication between agents and (i ≠ j; i, j = 1, …, N).
The ramp time is chosen as a consistent variable for PGG. The ramp time of the th unit of the ith PGG in the regional grid is expressed as follows:where ∆Piw and denote the power generation command of the th unit within the ith PGG and the ramp rate of the unit, respectively, where is expressed as follows:
The ramp time of the th agent within the PGA is updated as follows:
Meanwhile, power correction is required according to whether the power is balanced within the microgrid by judging ∆Perror−i, which is the power correction command for the ith PGG and is expressed as follows:
Under the condition of frequent information interactions between agents and constant gain bij, collaborative consistency of agents can be achieved when and only when the directed graph is strongly connected [20].
When the boundary condition is reached, the generated power command ∆Piw with the maximum ramp time is as follows:where and are the maximum and minimum power reserve capacity of the th unit of the GUGi, respectively.
4. Simulation Design
Different regional grids play a multi-intelligent body dynamic game through an IQ (λ) control strategy to obtain the total power of the region. Within each regional grid, according to the spatial distance and generator type, the microgrid system is virtually partitioned into multiple PGGs using the graph theory cut-set method, and each PGG is regarded as a multi-intelligent body system, which dynamically allocates the total regulation power command to each unit through the HDQC strategy and implements the regional boundary power exchange control to jointly maintain the global dynamic interactive regulation of the microgrid SNLSC.
4.1. Reward Function Design
In order to judge the control performance of the regional grid system, the three main performance evaluation criteria of AGC (area control error (ACE), interconnection grid frequency deviation (Δf) and control performance standard (CPS) [21]) and energy efficiency are used as the input of reward function, which can evaluate whether the current decision can obtain long-term benefits and avoid large power fluctuations. The agent calculates and updates the state quantity and reward function of the system in real time and outputs the optimal control signal ΔPord−i (the power regulation command of the ith unit).
4.1.1. IQ (λ) Reward Function Design
After dimensionless processing, A(i) (the instantaneous values of ACE) and Δf(i) (the instantaneous values of Δf) are normalized linearly weighted to obtain the target reward function as follows:where μ is the weighting factor taken as 0.5.
4.1.2. HDQMP Reward Function Design
The dimensionless processed A(i) and the linear weighting and energy efficiency are selected as the reward function, which is shown as follows:where ω1 is the weighting factor and ω1 is taken as 0.7.
The IQ(λ) control strategy outputs the total regulation power command, which HQDC uses as a state quantity, discrete as (−∞, −850], (−850, −400], (−400, −20], (−20, 20), [20, 400), [400, 850), and [850, +∞); the set of action strategies is Ai = [η1, η2, …, ηj] = [(η11, η12, …, η1j), (η21, η22, …, η2j), … (ηn1, ηn2, …, ηnj)], ηnj is the allocation factor of PGGj within the regional grid n.
4.2. Reward Function Design
In the IQ(λ)-HDQC regulation strategy, five system parameters are set as follow.(1)The learning factor α1, α2 (0 < α1, α2 < 1) weigh the stability of the algorithm. The larger α can accelerate the convergent speed, smaller α can improve the system stability. α1 is taken as 0.9 for faster learning convergence. Considering the strong randomness of load perturbation after the high proportion of high-capacity new energy access, α2 is taken as 0.1.(2)The discount factor γ1 and γ2 (0 < γ1, γ2 < 1) weigh the importance of current and future reward. The closer the value is to 1, the more emphasis is placed on long-term rewards. γ1 is taken as 0.8, γ2 is taken as 0.9.(3)The attenuation factor of the eligibility trace λ (0 < λ < 1), reflect the degree of influence on convergence rate and non-Markov effect. The larger λ is, the slower the eligibility trace of the previous historical state-action pair will decay, and the more reputation will be allocated. λ is taken as 0.9.
4.3. Strategy Process
The IQ algorithm is introduced into the qualification trace as the control strategy, and the HQL algorithm is introduced into the double-Q learning strategy to form the IQ(λ)-HDQC regulation strategy. The IQ(λ)-HDQC process is shown in Figure 2, combined with the parameter settings described.

5. Simulation Studies
5.1. Microgrid System Model Simulation
In order to realize the global dynamic interactive regulation of microgrid SNLSC, a microgrid model is built in this paper, including microgas turbines, small hydropower, electric vehicles [22], solar energy storage power plant [23], wind farm and cooling, heating and power storage model [24], as shown in Figure 3, and the model parameters [25] are shown in Table 1. Among them, wind farms and electric vehicles are involved in only one FM, PV is simulated with 24-hour light intensity [26] adding a small perturbation, and unit parameters are shown in Table 2. Meanwhile, the work period of AGC is 4 sec. In the figure, the controller in each region shares data through the interconnection between regions, obtains the dynamic information of the AGC performance index, realizes the coordinated control of the system in the continuous trial and error optimization, effectively obtains the AGC optimal total power command in each region, and optimizes the active power output of the frequency modulation units.

Electric vehicles have replaced a part of fuel vehicles. Plug-in electric vehicles (PEVs) are equipped with energy storage batteries that can be charged and discharged, and when a large number of PEVs are connected to the grid as a cluster, they can participate in the frequency regulation of the grid to replace the frequency regulation of traditional thermal power units. The block diagram of the transfer function of a single PEV is shown in Figure.4. Where Ichj is the constant charging current; SOC and SOC0 are the EV battery charge state and initial charge state, respectively; KC is the sag control gain; TC is the sag control time constant; Er is the rated capacity of the energy storage battery; Rs and Rt are the series resistance and parallel resistance values of the battery, respectively; and Ct is the shunt capacitance capacity.

The combined cooling, heating, and power (CCHP) system [24] incorporates solar power including solar collectors, solar PV power generation equipment, gas boilers, heat exchange equipment, and centrifugal chillers. This system can realize the complementary and collaborative optimal operation of multiple energy sources. It uses the waste heat of solar power and gas boilers to produce electric energy and meet heating and cooling requirements. The purpose is to improve energy utilization efficiency and reduce the emission of carbon and harmful gases, which can greatly improve energy utilization efficiency.
5.2. Prelearning Simulation
Before online operation, a large amount of prelearning training is required for the IQ (λ)-HDQC regulation strategy to optimize the set of strategic state actions and the action selection set. A continuous sinusoidal load disturbance with a period of 1000 sec, amplitude of 1000 MW, and duration of 10000 sec is applied to the microgrid system for full learning, and the effect of microgrid prelearning and online operation is shown in Figure.5. From the analysis of the figure, it can be seen that after about 1000 sec of trial and error seeking during the prelearning process, the Q value of the optimal state action is explored, and the CPS1 in the prelearning phase is stable above 188% in region A and 200% in online operation, both of which meet the qualified CPS1 range. The prelearning simulation verifies that the IQ (λ)-HDQC strategy can converge quickly in complex stochastic environments and can control the generator to meet more complex load operations.

5.3. Random Square Wave Load Disturbance
After adequate prelearning, a random square wave load disturbance is introduced into the microgrid model to simulate the random load disturbance (i.e., irregular sudden increase and decrease of load and new energy output) in the random environment of the power system, so as to analyze the performance of the proposed strategy. The load disturbance with duration of 10,000 sec was taken as the assessment and compared with three control strategies, HQL [18], ML-AGC [27], and VWPC-HDC [28], were analyzed and compared. Figure.6 for the online control effect under random square wave load disturbance, the figure shows that IQ (λ)-HQDC has a more precise instructions, faster convergence speed. Using AGC three performance evaluation standard to evaluate the effects of different control strategies, Table 3 for the intelligent strategy of 4 kinds of performance evaluation of the smart grid contrast figure, compared with other strategies, IQ(λ)-HDQC can reduce |ACE| 37.6%∼70.2% and reduce the |Δf| 67.2%∼85.5%. CPS1 was increased by 1.16%∼6.33% and energy efficiency was increased by 11.3%∼34.9%.

5.4. White Noise Load Disturbance
The 24-hour white noise load disturbance is applied to the integrated energy system model to simulate the complex condition in which the power system load changes randomly at every moment in the large-scale grid-connected environment of unknown new energy. Figure 7 shows that the IQ(λ)-HQDMP control strategy can accurately track the strong stochastic perturbation, and the system remains stable at 12:00 noon when a large number of new energy sources, mainly PV, are connected to the grid. The statistical results of the simulation experiment are shown in Figure.8, IQ(λ)-HQDMP strategy can reduce |ACE| 42.8%∼89.2%, reduce |Δf| 43.7%∼81.1%, improve CPS1 0.03%∼1.22%, and improve energy use efficiency 30.5%∼51.4%. In addition, after 40 times of simulations, the standard deviations of these three indexes of IQ(λ)-HQDMP strategy are 0.000144, 1.4277, 0.32705, respectively. The IQ(λ)-HQDMP strategy has stronger antidisturbance ability and significantly improves energy efficiency compared with other strategies.


6. Conclusion
In this paper, we propose the IQ(λ)-HDQMP regulation strategy, an applicable control strategy for microgrids, to obtain the source-load-storage-charging collaborative control and optimal energy benefit of the microgrid model, thus solving the problems of strong stochastic disturbance and low utilization of new energy caused by new energy with a high proportion and large capacity connected to the grid.
In the first layer, the IQ(λ)-HDQC regulation strategy adopts the IQ (λ) control strategy, which can simultaneously avoid overestimation and underestimation to obtain long-term dynamic stability. Compared with the ML-AGC and the VWPC-HDC, the proposed algorithm can solve the multisolution problem effectively when the number of multiagent increases sharply. In the second layer, a consistent unit power dynamic optimization allocation strategy, namely HDQC, is used to achieve the optimal allocation of new energy sources. Sine wave, square wave, and random white noise loads are respectively introduced for simulation in microgrid model. Compared with the other four different control strategies, the results show that IQ(λ)-HQDC has better learning ability and can reach stability quickly in the prelearning stage. Even under strong random disturbances, it also has better performance and can improve the system's energy use efficiency with various new energies.
Data Availability
No data were used to support the findings of the study.
Conflicts of Interest
The authors declare that they have no conflicts of interest.
Acknowledgments
The authors acknowledge the support by the head office of State Grid Co., Ltd. which manages science and technology projects (5400-202118485A-0-5-ZN).