Abstract
Magnetorheological (MR) dampers, as an intelligent vibration damping device, can quickly change the damping size of the material in milliseconds. The traditional semiactive control strategy cannot give full play to the ability of the MR dampers to consume energy and reduce vibration under different currents, and it is difficult to control the MR dampers accurately. In this paper, a semiactive control strategy based on reinforcement learning (RL) is proposed, which is based on “exploring” to learn the optimal value of the MR dampers at each step of the operation, the applied current value. During damping control, the learned optimal action value for each step is input into the MR dampers so that they provide the optimal damping force to the structure. Applying this strategy to a two-layer frame structure was found to provide more accurate control of the MR dampers, significantly improving the damping effect of the MR dampers.
1. Introduction
MR damper is currently the fastest research and development field [1]. It is a new type of intelligent damping device that uses MR effects. It can be well combined with the control system and it is reliable. At present, it has become a new generation of civil engineering structural vibration damping devices and has been initially used in civil engineering structural vibration control. Ji et al. [2] proposed a scheme of using MR damping technology to control pipeline vibration. The test results show that all three pipeline vibration control methods based on MR dampers can effectively control pipeline vibration. The pipeline amplitude and acceleration reduced to 22.31 dB and 16.34 dB, respectively, while the amplitude attenuation rate and acceleration attenuation rate reached 92.34% and 84.77%, respectively. Ghasemikaram et al. [3] verified that the performance of the MR damper is suitable for controlling extreme cyclic oscillations of wings and external storage devices under flutter conditions. Based on on-site monitoring of wind and rain excitation events, Ni et al. [4] used two MR dampers to control the vibration of the Dongting Lake cable-stayed bridge. Abdeddaim et al. [5] used a MR damper to link two adjacent buildings and effectively reduced the vibration response of the structure.
Because the parameterized model of MR damper is not as complicated as the nonparametric model in terms of structure, its research and application are extensive. At present, the commonly used dynamic model of MR damper is Bingham model [6], Bouc–Wen model [7], and modified Bouc–Wen model [8]. The research [9] shows that, compared with the Bingham model, the Bouc–Wen model can accurately reflect the nonlinear performance of the MR damper at low speeds and the hysteresis characteristics of the simulated MR damper and has strong versatility; compared with the modified Bouc–Wen model, it has fewer parameters and is easy to digitally model. It has been widely used in the modeling of the dynamic characteristics of MR damper. Therefore, this article will use the Bouc–Wen model to calculate the damping force.
MR damper controls the damping force by adjusting the magnitude of current or voltage. MR damper vibration control methods mainly include active control, passive control, and semiactive control. Research [5, 10–13] shows that active control and semiactive control have better damping effects than passive control. However, compared to active control, semiactive control methods can change the stiffness and damping of the structure with a small amount of energy to reduce the vibration response of the structure [14, 15]. Since semiactive control has both the excellent control effect of active control and the simple advantages of passive control, it also overcomes the shortcomings of active control that requires a lot of energy and the narrow tuning range of passive control. Therefore, semiactive control has great prospects for research and application development. In the semiactive control of the MR damper, in order to apply the optimal control force to the control structure, the control current or voltage of the MR damper needs to be calculated by the control system through a semiactive control strategy, which attracts a large number of scholars. Bathaei et al. [16] used two different fuzzy controllers to study the seismic vibration of the adaptive MR damper, which can further reduce the maximum displacement, acceleration, and foundation shear force of the structure. Hazaveh et al. [17] used discrete wavelet transform (DWT), linear quadratic regulator (LQR), and limiting optimal control algorithm to determine the optimal control force. The semiactive control performance is evaluated by comparing the maximum displacement, total base shear force, and control energy. Kim [18] proposed two semiactive control methods for seismic protection of structures using MR dampers. The first method is a simple adaptive control method, and the second method is a fuzzy control method based on genetic algorithms. The results show that it can control the displacement and acceleration response of the structure effectively. The control strategies proposed above can improve the effect of vibration reduction to varying degrees. However, due to the nonlinearity of the MR damper, the above algorithm cannot control the MR damper accurately.
RL plays an important role in solving the optimal control of complex linear systems with unknown models and it is combined with control theory to form adaptive dynamic programming theory which is a data-driven intelligent control algorithm with learning and optimization capabilities. What is more, it has rich theoretical research results in the fields of robust control, optimal control, and adaptive control. The value-based RL algorithm obtains the optimal value function, selects the action corresponding to the maximum value function, and implicitly constructs the optimal strategy. Representative algorithms include Q-learning [19] and SARSA. Q-learning has a process of selecting the maximum value, which is more suitable for optimal control than SARSA. In 2014, Brodley and Health [20] proposed a deterministic strategy gradient algorithm. In 2015, Littman made a review of RL in “Nature” [21]. Littman and Michael et al. [22] used Q-learning algorithm to realize autonomous movement control of robots. Hara et al. [23] used machine learning control algorithms to control robots and improve learning efficiency.
In this paper, we propose a semiactive control strategy based on RL and apply this strategy to the two-layer framework. The results show that it has a significant improvement in vibration reduction effect compared to the semiactive control strategy based on simple Bang-Bang.
The remainder of this paper is structured as follows. Section 2 describes the principle and model of MR damper. Section 3 introduces the principles of RL. In Sections 4 and 5, the semiactive control strategy based on RL is proposed and applied it to a two-layer frame structure. Compared with simple Bang-Bang, it is obvious that this strategy is better. Based on the above results, we conclude in Section 6.
2. MR Damper Model
2.1. Mechanical Model of MR Damper
MR damper is made of MR fluid, which has the features of simple device, low energy consumption, fast response, large damping, and wide dynamic range. Its structure includes electrical control line, piston rod, piston, orifice, and buffer accumulator. So far, the mechanical model of MR damper can be roughly divided into two types: parametric model and nonparametric model. Since the nonparametric model has a very complex structure, scholars at home and abroad have fully considered the characteristics of different stages on the process of MR fluid yielding and the structural characteristics of the MR damper. Parametric models mainly include Bingham viscoplastic model, modified Bingham viscoplastic model, Bouc–Wen model, modified Bouc–Wen model, and phenomenological model. But the Bouc–Wen model is a model with a smooth transition curve, which can fit the test results well. This model is easy to carry out numerical calculation, has strong versatility, can reflect various hysteresis lines, and has been widely used in hysteretic system modeling. In this paper, the RD-8041-1MRD MR damper produced by the American Lord company is used for vibration reduction control research, and its Bouc–Wen model is
The expression of variable is
And its schematic diagram is shown in Figure 1.

x represents the displacement of the piston rod of the MR damper and the damping coefficient; is a constant. A is a coefficient determined by the control system and magnetic field MR fluid. By adjusting the parameters , , and A, the linear behavior of the force-velocity curve during unloading and the smoothness of the transition from before to after yielding can be controlled. is linear spring stiffness. is the initial deformation of the spring. The results of the dynamic characteristics of the MR damper show that the model can describe the force-displacement relationship of the MR damper well, and the force-velocity relationship curve is closer to the test results.
2.2. Semiactive Control Strategy of MR Damper
Under the control device equipped with MR dampers, the motion equation of a controlled system with n degrees of freedom subjected to vibration iswhere , , and , respectively, represent the mass, damping, and stiffness matrix of the structure; is the displacement vector of the structure relative to the ground; represents the one-dimensional ground acceleration, is the control force vector generated by n MR dampers, is unit column vector, and E is the position matrix of MR damping.
The corresponding state equation iswhere z is the state vector. A, B, and W are state matrix, control device position indication matrix, and earthquake action matrix, respectively.
The semiactive control strives to achieve the optimal control force, so a semiactive control algorithm is needed to control the magnetorheological damper to apply the optimal control force F. The specific steps include the following:(1)Obtain the displacement and speed of each incremental step of the control point through the URDFIL subroutine, and store the data in the global variable COMMON block at the same time.(2)According to the displacement and speed, the semiactive control algorithm is used to calculate the optimal control force F of the magnetorheological damper based on the Bouc–Wen model, and the data is stored in the global variable COMMON block.(3)Pass the control force F into the subroutine DLOAD through the global variable COMMON block, thereby applying the control force to the corresponding control area.(4)Repeat the above process for each incremental step until the end of the program.
The current semiactive control algorithms mainly include simple Bang-Bang control algorithm, optimal Bang-Bang control algorithm, and limit Hrovat optimal control algorithm. This paper combines the commonly used simple Bang-Bang control algorithm to realize the semiactive control of the MR damper. The simple Bang-Bang control algorithm can be expressed aswhere and are the displacement and velocity of the damper piston rod relative to the cylinder; is the maximum damping coefficient; is the minimum damper coefficient.
According to formula (5), the main operation of the simple Bang-Bang algorithm is when the structure deviates from the equilibrium position and vibrates, the magnetorheological damper applies the maximum damping coefficient to the structure; that is, the maximum current is used. When the structure vibrates to the equilibrium position, the magnetorheological damper applies a minimum damping coefficient to the structure; that is, the current is zero. Therefore, the simple Bang-Bang is equivalent to Passive-off and Passive-on control, and the actual damping of the control algorithm is between these two.
3. Principles of Reinforcement Learning
3.1. Basic Concepts of Reinforcement Learning
RL [24] is a branch of machine learning, which is used to solve continuous decision-making problems. RL can learn how to achieve the goals set by the task in a complex and uncertain environment [25].
Figure 2 shows the basic framework of RL. When the agent completes a task, it first interacts with the surrounding environment through actions. Under the action of action and environment, the agent will generate a new state. The environment will give an immediate return.

In this cycle, the agent continuously interacts with the environment and generates a lot of data. RL uses the generated data to modify its own action strategy and interacts with the environment to generate new data and then uses the new data to further improve its own behavior. After many iterations of learning, the agent finally learns how to complete the response, the optimal action for the task.
3.2. Markov Decision Processes
Markov decision process [26] is a framework that can solve most of the RL problems and has been widely used in various RL fields.
3.2.1. Markovian
Markovian means that, in a stochastic process, the next state of the system is only related to the current state and has nothing to do with the historical state and its actions. Setting as a stochastic process, if for any moment , any state . The conditional distribution function of the random variable under the known variable is only related to the current state and has nothing to do with the previous state ; that is, the conditional distribution function satisfiesnamely,
Therefore, when the state of the system at time t satisfies , the state satisfies the Markovian. Markovian is also called memorylessness; that is, the current state contains all relevant historical information . Once the current state is known, the historical information will be discarded.
3.2.2. Markov Decision Processes
Markovian describes the nature of each state , If each state in the stochastic process satisfies the Markovian, the stochastic process is called Markov Stochastic Process. The Markov process is a two-tuple . In the formula, is the set of finite states and is the state transition probability. The state transition probability matrix is defined as
The Markov Stochastic Process mainly describes the transfer relationship between states. In the transfer process, assigning different reward values to each process is the Markov Decision Process. The Markov Decision Process can be represented by a tuple in which is a finite set of states, is a limited set of actions, is the probability of state transition, is the reward function, and is the attenuation coefficient; the specific Markov Decision Process is described in Figure 3.

The agent whose initial state is selects an action from the action set A. After executing the action , the agent transfers to the next state according to the state transition probability P. Then, when the next action is executed, it moves to , and then the next action is executed until the last action is completed.
The goal of RL is to find the optimal strategy to obtain the largest reward (Reward) given by a Markov Decision Process, so as to estimate the pros and cons of a strategy. It can be described as
Strategy specifies an action probability in each state ; the strategy can specify a certain action in each state . In the case of a certain strategy , with a given strategy , the cumulative return can be calculated, which can be defined as
In order to evaluate the value of each state in RL, it is necessary to use a definite quantity to describe the value of state , but the cumulative return is a random variable and cannot be used as a definite value to describe the value of the state. While its expected value is a certain value, therefore the expected value of cumulative returns is used in RL to quantify the value of each state .
3.2.3. State-Value Function
When the agent adopts strategy , the cumulative return obeys a distribution, and the expected value of the cumulative return at state is defined as a state-value function:
The corresponding state-action value function can be expressed as
The purpose of calculating the state-value function is to build a RL algorithm to get the optimal strategy from the data, each strategy corresponds to a state-value function, and the optimal strategy corresponds to the optimal state-value function, which is
The optimal state-action value function A is the largest state-action value function of all strategies, which can be expressed as
If the optimal state-action value function is known, the optimal strategy can be directly expressed by maximizing , namely,
Therefore, RL is looking for a strategy that can maximize the value function under any initial conditions .
3.3. Semiactive Control RL Framework of MR Damper
The basic framework of reinforcement learning includes agent, environment, action, state, and reward. Based on the basic framework of reinforcement learning, this paper establishes a RL framework for semiactive control of MR. As shown in Figure 4, in this framework, the MR damper is used as the main body of learning—the intelligent body; the structure that needs vibration reduction control is used as the learning environment; the action corresponds to the MR damper applying damping force to the structure perform vibration reduction control; reward is an evaluation of the control effect of the structure through the evaluation function; the state describes the situation between the agent and the environment, and it is related to the environment and the agent. In the damping control of MR damper, the state of RL corresponds to the response of the structure.

The agent exists in the environment and takes actions on the environment. These actions will make the agent get corresponding rewards. The purpose of RL is to obtain a strategy through learning. Under this strategy, the agent can make appropriate actions at the right time and obtains the greatest reward. Corresponding to the RL framework of MR damper semiactive control, the MR damper is installed on the structure that needs vibration damping control, and the damping force is applied by the damper to control the vibration of the structure. For every action applied by the MR damper to the structure, the structure will produce a corresponding response. The response is used to determine whether the action of the MR damper reduces the structural response, and rewards are given according to the evaluation function. The reward may be positive or negative. Through repeated damping control of the structure, MR damping continuously explores and learns in the damping control process and finally learns an optimal control strategy.
3.4. RL Algorithm: Q-Learning Algorithm
RL algorithms mainly include model learning algorithms and model-free learning algorithms. Model algorithms are built on the condition that the various elements of the entire Markov process model are known, but when MR dampers are used to reduce structural vibration in the control process, it is difficult to know the elements in the Markov decision process corresponding to the task in advance. Therefore, when the model is unknown, that is, when the transition probability and reward function of the Markov decision process are unknown, RL adopts a model-free algorithm for learning. Model-free RL algorithms mainly include two types: one is Monte Carlo RL algorithm, and the other is time difference algorithm. The model-free algorithm is to obtain the optimal strategy through learning when the Markov Decision Process is unknown. The algorithm does not rely on the transition probability and reward model of previous experience [27].
Figure 5 shows the update method of the value function of different RL algorithms. From Figure 5, it can be seen that, in the Monte Carlo algorithm [28], a complete trajectory is needed to calculate a certain state-value function and update it, resulting in low algorithm efficiency. The Q-learning algorithm and the SARSA algorithm update the value function after each step of the strategy, so the efficiency is higher. In addition, the Q-learning algorithm [29] is superior to other algorithms in convergence and stability among the three algorithms, so this paper uses the Q-learning algorithm to study the semiactive control of the MR damper.

(a)

(b)
Q-learning is an off-policy temporal-difference method; it was first proposed by Watkins in his doctoral thesis in 1989. Q-learning uses the state-action reward value and as the estimation function. During the interaction, the value of the correct action keeps increasing; otherwise, the value will decrease. By comparing the Q value, the agent tends to choose the best action. The update formula is as follows:where is the attenuation coefficient, which refers to the influence of future strategies on the present under the current strategy. The value range is ; is the learning rate, which refers to the ratio of the current learning value covering the old value; means that the value table is not updated, and means that the value table is completely updated. is the action set and S is the state set.
The calculation process of Q-learning algorithm is shown in Algorithm 1. The main steps of Q-learning are as follows: (1) use the greedy strategy to select an action; (2) the agent takes action to obtain the reward and new state; and (3) update the value table using formula (16).
|
4. Study on the Effect of Different Vibration Reduction Control Methods
This section adopts the passive control method: Passive-off, Passive-on (I = 0.4 A), Passive-on (I = 0.6 A), and semiactive control method (simple Bang-Bang) to control the vibration of the structure. Figures 6–8 show the vibration reduction control effect of 4 control methods installed on the upper part of the first layer of pillars with 4 dampers. It can be seen from the figure that the semiactive control method has the best damping effect, and Passive-off has the worst damping effect, but Passive-on in the passive control algorithm also shows a better control effect. For Passive-on, by increasing the input current, the vibration reduction effect can be improved and the structural response can be reduced. In addition, it can be seen from Figure 8 that although the acceleration response of the simple Bang-Bang semiactive control is the smallest, the acceleration has a sudden change in local time, which is due to the simple Bang-Bang assumption that when the structure vibrates away from the equilibrium position, the MR damper adopts the maximum damping coefficient, that is, the maximum current; when the structure vibrates toward the equilibrium position, the MR damper adopts the minimum damping coefficient. That is, the current is zero. It is equivalent to Passive-off and Passive-on control, and the actual damping force of the simple Bang-Bang control algorithm jumps between these two. Therefore, the change of excessive damping force causes a sudden change in acceleration.



5. Semiactive Control Strategy Based on RL Algorithm
5.1. Mission Details
In the semiactive control of the MR damper, the goal of the semiactive control algorithm is to calculate the optimal damping force to be applied to the structure, so as to achieve the optimal control effect [30]. Therefore, the goal of RL is to obtain the optimal damping force of each step of the MR damper through learning. According to the RL framework of semiactive control and Q-learning algorithm, this paper proposes a semiactive control strategy based on RL Q-learning algorithm. The specific principle is shown in Figure 9. The method includes two modules: the learning module and the semiactive control module. In the learning module, the system applies current to the MR damper through the Q-learning algorithm to control the damper to control the vibration of the structure. The system is based on the structure’s response (speed, displacement, and acceleration) and the corresponding reward evaluation function; calculate the reward value for each action. Finally, in each state, the action with the largest value is selected as the optimal action in that state. After a certain period of learning, a Q value table is formed. The Q value table is a mapping pair of semiactive control strategies. Through the Q value table, the optimal control strategy, that is, the optimal control current, can be obtained. In the semiactive control module, the semiactive control of the structure by the MR damper can be realized by calling the control strategy that has been learned, that is, the current value learned at each step.

5.2. Establishment of Reward Evaluation Function
The reward evaluation function is the feedback of the environment to the agent's decision in RL, reflecting the effect of the agent's actions on the environment. The MR damper selected in this paper is to adjust the damping force by controlling the applied current to change the size of the magnetic field. Therefore, for this type of MR damper, the application of current to the MR damper is used as the action of the agent in the RL. In the finite element analysis, the ratio of the response value of each incremental step to the response value of each incremental step when the structure is controlled by the MR damper is used to reflect the reduction of each action. Therefore, the reward evaluation function can be expressed aswhere C is the reward magnification factor.
It can be seen from (17) that when , it indicates that the response of the structure under the action does not change, and the action does not have the effect of damping vibration. At this time, the reward of the action is 0; when , it indicates that the response of the structure under the action increases. Not only does the action fail to reduce the vibration, but it increases the response of the structure. Therefore, the reward (penalty) of the action is . And when , it shows that the response of the structure under the action is reduced, and the action has a damping effect, so the reward is . Therefore, if an action reduces the structural response more, the reward for that action is higher. Conversely, if an action increases the structural response more, the penalty for that action is also higher. Through the reward evaluation function, the RL algorithm can give its reward score according to each action. Finally, the optimal action of each step is determined, thereby forming the optimal control strategy and realizing the optimal control of the MR damper.
5.3. Greedy Strategy
In the RL Q-learning algorithm, the algorithm selects an appropriate action based on the current state and value table. When choosing an action, two methods are generally used to select the appropriate action. The first is based on past “experience”; that is, the action with the highest score is selected every time you learn. The second method is to use the “exploratory” method, that is, to randomly choose an action each time you learn. In the two methods, if all “experience” is used and the action with the highest score is selected each time, it is likely to be confined to the existing experience and it is difficult to find more valuable actions. However, if only the “exploratory” method is used, and random selection of actions is used every time, most of the actions selected may be too low or worthless, resulting in slower convergence of the calculated value table.
Therefore, in order to balance the “experience” and “exploration” in the algorithm, scholars have proposed a method to effectively balance the two . The mathematical expression of strategy is
For strategy that adopts the maximization value function, the probability of its optimal action being selected is , and the probability of each nonoptimal action being selected is . Therefore, when the strategy is adopted, each action may be selected, and different learning paths will be generated through multiple learning. In RL, a small value of is generally set first, and the agent has a probability of according to the above formula to randomly select actions to explore the experience. The agent has a probability to take action based on the learned Q value.
5.4. Algorithm Implementation
This section implements the semiactive control strategy of MR damper based on the RL Q-learning algorithm through the secondary development of Abaqus. The specific algorithm flowchart is shown in Figure 10. According to the Bouc–Wen model and the MR damper model selected in this paper, the MR damper only needs to control the current to adjust the damping to control the structure. In this section, applying current to the MR damper is taken as action in RL, and the response of each incremental step in Abaqus is taken as state . Because the MR damper selected in this article has a maximum operating current of 1.0 A, therefore in this section, there are 11 optional actions for each state , namely, applying 11 different intensities of current, I = 0 A, I = 0.1 A, I = 0.2 A, I = 0.3 A, I = 0.4 A, I = 0.5 A, I = 0.6 A, I = 0.7 A, I = 0.8 A, I = 0.9 A, and I = 1.0 A.

By inputting different currents (actions), the MR damper is used to control the vibration of the structure, and the response value of each incremental step is output. The reward value of each action is calculated through the reward evaluation function to form a reward value table— value table. The reward value calculated by the reward evaluation function is stored in the value table, which is a matrix
In the formula, represents the reward value of each action, where is the number of states and is the number of actions. After getting the reward value table, the RL Q-learning algorithm can learn according to the value table. In reinforcement learning, the algorithm records the reward value obtained by each learning in the value table; that is, the Q value table is the learned experience value, and the Q value table is a matrix, which is of the same order as the value table.
In the formula, represents the experience value learned for each action in this state, where is the number of states and is the number of actions. The value table is updated according to the Q-learning algorithm, and the formula is as follows:where is attenuation factor (discounting factor), is learning rate (learning rate), is action set, and is state set.
Finally, through the above steps, the final reward value of each action (current) in each state, that is, each incremental step, can be obtained. Therefore, the optimal action (current) of each step can be selected to form the optimal control strategy. In the next damping control of the MR damper, this strategy can be used to perform the optimal half of the MR damper.
6. Simulation Results and Analysis
6.1. Calculation Model and Boundary Conditions
This section takes a two-layer frame structure (including plate, beam, and column structure) as an example and uses the secondary developed Abaqus program to study the semiactive control strategy based on the RL Q-learning algorithm. The calculation model is shown in Figure 11. The structural parameters and material parameters are shown in Tables 1 and 2. The element type is a spatial second-order tetrahedral ten-node element (C3D10). A tangential dynamic load of 2000 N is applied to the center of the first floor of the structure at a frequency of 8.333 Hz, and the time history of the external force couple is given by , where is shown in Figure 12 and the bottom of the structure is set as a fixed end constraint. In the study of MR damper vibration reduction, in order to apply the model to large-scale structures, the model parameters are generally enlarged so that it can be used in large-scale civil engineering [31]. Therefore, in order to obtain a good damping effect, this section enlarges the performance parameters of the RD-8041-1MRD MR damper by 5 times and then conducts vibration damping research on the structure.


6.2. Selection of Reward Evaluation Function
In order to construct the reward evaluation function in RL, it is necessary to determine the evaluation index of the state. This section uses three state variables: displacement, velocity, and acceleration to establish reward evaluation functions. Determine which index is used as the reward evaluation function for the best learning effect. The parameters in the RL are shown in Table 3, where the learning rate of this RL = 0.8; the attenuation coefficient = 0.4; the greedy strategy ε = 0.1; the number of learning times is 1000.
Three different reward evaluation functions have been learned 1000 times, and the corresponding optimal actions are obtained. The results are shown in Figure 13, Figure 13(a) is the average reward value of different reward evaluation functions; Figure 13(b) is the action value after the 1000th learning with displacement as the reward evaluation function; Figure 13(c) is the action value after the 1000th learning with speed as the reward evaluation function; Figure 13(d) is the action value after the 1000th learning with acceleration as the reward evaluation function.

(a)

(b)

(c)

(d)
It can be seen from Figure 13(a) that the reward value tends to converge after learning about 600 times for the 3 different reward evaluation functions. Figures 13(b)–13(d) are the learning results with displacement, speed, and acceleration as the reward evaluation function, that is, the corresponding action value in each state—the applied current value. Table 4 and Figures 14–16 show the vibration reduction effect of MR damper after learning with different reward evaluation functions. From the results in the graph, it can be seen that the vibration reduction effect after learning with the speed reward evaluation function is the best, and the learning effect with the acceleration reward evaluation function is the worst. Among them, the maximum displacement response using the speed reward evaluation function is reduced by 45.63%, the maximum speed response is reduced by 47.73%, and the maximum acceleration response is reduced by 48.17%. The learning effect of the displacement reward evaluation function is closer to that of the speed reward evaluation function.



6.3. Selection of Reinforcement Learning Parameters
The main parameters in the RL Q-learning include learning rate α, attenuation coefficient γ, and greedy strategy ϵ value. Different parameters have a greater impact on the accuracy and success rate of learning, so reasonable selection of RL parameters can obtain the best learning effect. Among them, the correct rate of learning represents the probability of the learned action being the action with the largest reward value; the success rate represents the completion ratio of the learning reward value table.
6.3.1. Selection of Learning Rate α
The function of the learning rate is to learn the reward value brought by the current strategy during the update of the Q value table. In order to study the influence of the learning rate α on the learning effect, this section sets up 5 different working conditions, among which α = 0.2, 0.4, 0.6, 0.8, and 1.0 are five learning rates for RL, the attenuation coefficient γ is 0.4, and the greedy strategy value is 0.1. The number of learning times is 1000.
Table 5 shows the effect of learning rate α on the learning effect. The learning rate of working condition 1 is 0.2, the correctness rate of the action at this time is 40.73%, the success rate is 80.83%, and the learning effect is not good at this time. As the learning rate α increases, the learning effect gradually improves. When the learning rate α is increased to 0.8 and 1.0, the correct rate of the action reaches 100.00%, and the success rate is 100.00%, and the calculation converges. Therefore, as the learning rate increases, the correct rate of RL also increases. However, in the RL Q-learning, when the learning rate α = 1.0, the Q value table selects the current strategy and discards the values in the original Q value table. At this time, the Q value table will be continuously updated and excessive occupation computing resources. Figure 17 shows the average reward value for different learning rates α. It can be seen from the figure that as the learning rate increases, the convergence speed also increases.

6.3.2. Selection of Attenuation Coefficient γ
The attenuation coefficient γ represents the influence of the future strategy on the present under the current strategy, and the value range is . In order to study the influence of the attenuation coefficient γ on the learning effect, four different working conditions are set, among which γ = 0.2, 0.4, 0.6, and 0.8; four attenuation coefficients are used for RL. The number of learning is 1000 times. Table 6 shows the influence of the attenuation coefficient γ on the learning effect. The attenuation coefficient γ in the working condition 6 is 0.2. At this time, the correct rate of action is 99.19%, and the success rate is 99.89%. The attenuation coefficient γ in working condition 7 and working condition 8 is 0.4 and 0.6. At this time, the correct rate and success rate of the action have reached 100.00%. But when the attenuation coefficient γ increases to 0.8, the correct rate is 99.60%, and the success rate is 99.78%. Figure 18 shows the average rewards under different attenuation coefficients γ. It can be seen the attenuation coefficient γ has a greater impact on the average reward value but has a small impact on the calculated convergence speed.

6.3.3. Selection Greedy Strategy ε Value
The greedy strategy ε value is used to balance “experience” and “exploration” in the RL Q-learning. In order to study the influence of the greedy strategy ε value on the learning effect, this section sets 4 working ε conditions for RL. Table 7 shows the influence of the greedy strategy value on the learning effect. The greedy strategy ε value of working condition 10 is 0.1, the correct rate is 81.85%, the success rate is 96.02%, and the learning effect is relatively poor because at this time there is a 90% probability that the agent will act according to the Q value of the learned experience, that is, directly select the action with the largest Q value. It can be seen that, at this time, it is mainly learned through “experience,” and the probability of exploring new actions is smaller. As the value of the greedy strategy ε increases, the probability of the agent exploring new actions increases, so the learning effect of working case 11 is improved, and the success rate reaches 100.00%, but the correct rate at this time does not reach 100.00%. The greedy strategy ε value of working condition 12 is 0.9, and the success rate and correct rate at this time are 100.00%. Figure 19 shows the average reward of different greedy strategy values ε. It can be seen from the figure that the value mainly affects the convergence speed of RL. When the value is increased, learning uses a higher probability to explore new actions, thereby improving the convergence speed of the calculation. However, when the greedy strategy value ε for operating condition 13 is 1.0, its convergence speed is lower than operating condition 12. This is because if the agent only focuses on exploring new actions, that is, if the agent adopts all random actions, most of the learned actions will be of no value and affect the learning effect.

In summary, in RL Q-learning, a higher learning rate α can improve the accuracy and convergence speed of learning, but a too high learning rate will cause the constant update of the value table to occupy computing resources. Therefore, a larger value can be set at the beginning of learning. As the learning process deepens, the learning rate α can be reduced to reduce the occupancy rate of computing resources; the attenuation coefficient represents the impact of future strategies on the present under the current strategy. The attenuation coefficient has a greater impact on the average reward value but has a small impact on convergence; the ε value mainly affects the convergence speed of RL. When the ε value is increased, learning uses a higher probability to explore new actions, thereby increasing the calculation convergence speed. However, if the agent only focuses on exploring new actions, that is, the agent adopts all random actions, most of the learned actions will be of no value and affect the convergence speed of learning.
6.4. Vibration Reduction Control Effect of Semiactive Control Strategy Based on RL
This section used simple Bang-Bang and a semiactive control strategy based on the RL Q-learning algorithm to study the vibration reduction control of the two-layer frame structure. During vibration damping control, 4 MR dampers are installed on the upper part of each column on the first floor. It can be seen from Table 8 that, among the two semiactive control strategies, the RL strategy has the best effect, and the maximum displacement, velocity, and acceleration responses are reduced by 83.50%, 83.83%, and 83.62%, respectively. Compared with the simple Bang-Bang control, the maximum displacement, speed, acceleration response, and vibration reduction effect are increased by 3.50%, 6.00%, and 9.36%, respectively.
Figure 20 shows the response time history curve of the central point of the second floor of the structure under two semiactive damping strategies. It can be seen from Figure 20 that the damping effect of RL strategy is better than that of simple Bang-Bang for displacement, velocity, and acceleration response. In addition, Figure 20(c) shows that the acceleration time history curve of RL strategy changes more smoothly, and there is no local acceleration mutation like simple Bang-Bang control.

(a)

(b)

(c)
7. Conclusions
This paper proposes a semiactive control strategy of MR damper based on the RL Q-learning algorithm. According to the structural response, the corresponding reward evaluation function is established, and the semiactive control strategy of the MR damper based on the RL Q-learning algorithm is realized through the secondary development of Abaqus. Taking the two-layer frame structure as an example, the vibration damping control is implemented through a semiactive control strategy based on RL Q-learning. At the same time, the simple Bang-Bang control is compared, and the following conclusions are drawn.
This article proposes a semiactive control strategy based on the RL Q-learning algorithm. The method continuously learns through the “exploration” method to obtain the optimal action value of each step of the MR damper—the applied current. In the vibration damping control, the optimal action value obtained at each step of the learning is input into the MR damper, so that it can provide the optimal damping force to control the structure vibration. The results show that the semiactive control strategy based on RL Q-learning is simple, easy to implement, and robust. By adopting the semiactive control strategy of RL to study the vibration reduction of the two-layer frame structure, the results show that RL is better than simple Bang-Bang control. In addition, the acceleration time history curve of the RL strategy changes more smoothly.
For the RL algorithm, this paper establishes three reward evaluation functions: displacement reward evaluation function, speed reward evaluation function, and acceleration reward evaluation function. It can be seen from the results of RL and vibration damping control that the vibration damping control effect of the action learned by the velocity reward evaluation function is the best.
This paper discusses the impact of three main RL parameters, learning rate , attenuation coefficient , and greedy strategy ε, on the learning effect. A higher learning rate can improve the accuracy of learning and the speed of convergence, but a too high learning rate will keep updating the value table and occupy too much computing resources. Therefore, a larger value can be set at the beginning of learning. As the learning process deepens, the learning rate α can be reduced to reduce the occupancy rate of computing resources. The attenuation coefficient has a greater impact on the average reward value but has a small impact on convergence. The greedy strategy value ε mainly affects the convergence speed of RL. When the value ε is increased, learning uses a higher probability to explore new actions, thereby increasing the convergence speed of calculation. However, if the agent only focuses on exploring new actions, that is, all the agents adopt random actions, most of the learned actions will be of no value, which will affect the learning speed.
Data Availability
The codes used in this paper are available from the corresponding author upon request.
Conflicts of Interest
The authors declare that there are no conflicts of interest regarding the publication of this study.
Acknowledgments
This study was funded by the National Natural Science Foundation of China (grant no. 51579089).