Abstract
The existing cyber deception decision-making model based on game theory primarily focuses on the selection of spatial strategies, which ignores the optimal defense timing and can affect the execution of a defense strategy. Consequently, this paper presents a method for selecting deception strategies based on a multi-stage Flipit game. Firstly, based on the analysis of cyber deception attack and defense, we propose a concept of moving deception attack surface and analyze the characteristics of deception attack and defense interaction behaviors based on the Flipit game model. The Flipit game model is then utilized to create a single-stage deception spatial-temporal decision-making model. Additionally, we introduce the discount factor and transition probability based on a single-stage game model and construct a multi-stage cyber deception model. We provide the utility function of the multi-stage game model, and design a Proximal Policy Optimization algorithm based on deep reinforcement learning to compute the defender’s optimal spatial-temporal strategies. Finally, we utilize an application example to validate the effectiveness of the model and the advantages of the proposed algorithm in generating the multi-stage cyber deception strategy.
1. Introduction
With the development of technology and tools of cyber attacks, it presents the characteristics of complexity, concealment, and persistence, such as Advanced Persistent Threats [1] (APT). However, information systems with static attributes, such as cloud computing environments, mainly use passive defensive mechanisms (like Access Control [2] and Intrusion Detection [3]), which are insufficient to resist attackers. Therefore, adaptive and sophisticated attackers often have asymmetric advantages of resources (e.g., time, and prior knowledge of the vulnerabilities) over the defender. In order to change this situation, cyber deception [4], as an active defense technology, misleads the attacker’s decision-making process by manipulating the attacker’s cognition, enabling the attacker to choose the suboptimal attack behavior. So deception can effectively delay and block the continuity of the attack process. Currently, cyber deception has been frequently applied to Enterprise Networks [5–7], Cyber-Physical Systems [8, 9], Cloud Environments [10–13], the Internet of Things [14, 15], and Software Defined Networks [16–18].
Despite the broad prospects of application scenarios for cyber deception techniques, there are still two challenges in implementing optimal deception strategies. On the one hand, when deploying deception assets, the defender needs to consider diverse and heterogeneous deception configurations to maximize the deception effect and disrupt the attacker’s perception of the system attack surface. On the other hand, the defender needs to consider the migration cost of deception assets to prevent excessive resource costs caused by too frequent deception resource migrations. Therefore, given the above challenges, to balance the deception effectiveness and deception cost, the defender must consider both the best deception configuration and the best migration period simultaneously. To this end, a deception model must precisely characterize the defender’s spatial-temporal decision-making.
However, most deception decision-making models focus on spatial decision-making, such as the spatial deception decision-making model based on attack graphs [19–21] and the spatial deception decision-making model based on game theory [22–31]. Among them, the spatial deception decision-making model based on attack graph models the attacker’s invasion process and path as an attack graph model. When the defender deploys deception assets in the system, they can trap the attacker to invade according to the attack graph after deploying deception resources and then discover and block the attacker’s attack process. The deception decision-making model based on the attack graph can be traced back to the general attack graph model proposed by Cohen and Koike [19]. In this literature, the defender guides the attacker to enter the designed path by designing a set of deception resources. After that, various attack graph models have been proposed to model deception, such as multi-layer attack graphs and Bayesian attack graphs. For example, Milani et al. [20] proposed using a state-based Bayesian attack graph to represent the attacker’s path. The defender adopts two types of spatial strategies to manipulate the attack graph: deception and protection, which changes the attack path found by the attacker through reconnaissance and makes the attacker unable to find critical assets. Sayari et al. [21] proposed to use a multi-layer attack graph to model the deception. Based on the assumption that the defender takes such deceptive actions as setting up false network services, creating bread crumbs or honey coins, creating false local or domain accounts, and creating bait files or documents, the optimal deceptive resource placement strategy of the defender is obtained. The deception decision-making model based on game theory models the attacker and defender as game players, constructs the strategy space and utility function of the attacker and defender in the background of reasonable assumptions, and then calculates the Nash equilibrium to obtain the optimal deception strategy. Currently, the main game models are divided into dynamic game and Stackelberg game. Carroll and Grosu [22] proposed to use the signal game to model the interaction between attacker and defender, then obtained the optimal deception strategy by calculating perfect Bayesian equilibrium. Huang and Zhu [23] proposed a dynamic Bayesian game with two-sided incomplete information to model the attack process of advanced persistent attacks. The optimal deception strategy was obtained by using perfect Bayesian Nash equilibrium. Ahmadi et al. [24] proposed to use a two-player partially observable stochastic game to model the defender and attacker and used a mixed-integer linear program and the infiltration strategy to calculate the robust deception strategy. Ye et al. [25] proposed a differential privacy dynamic game approach to model the behavior of deceiving defenders and attackers. In the approach, defenders strategically changed the number of systems and obfuscated the configuration of systems through the differential privacy mechanism, and attackers used Bayesian inference to infer the proper configuration of systems. Using this method, the impact of system quantity changes on network security can be effectively addressed while the attackers with different attack capabilities can be effectively defended. Schlenker et al. [26] introduced a novel deception game model. This game model uses a zero-sum Stackelberg game to model the interaction between the cyber deception defender and the rational attacker. The author proved that computing the optimal deception strategy is NP-hard and provided a mixed-integer linear program to compute the optimal deception strategy. Yin et al. [27] proposed to use the Stackelberg game to model the behavior of attacker and defender. The author analyzed the strategies of attacker and defender from pure and mixed strategy and proposed an algorithm based on ORIGAMI to get the optimal defense strategy. Anjum et al. [28] introduced a method of using deception traffic to confuse the network information obtained by attackers through network reconnaissance. In order to obtain the optimal deception traffic placement strategy, the author used a two-person non-zero-sum Stackelberg game to model the actions of the attacker and defender and verified the effectiveness of this method by constructing the Mininet experimental environment. Ngo et al. [29] proposed to model the behavior of both attackers and defenders using a Stackelberg game on a large active directory attack graph, in which the defender employs a set of honeypots to prevent the attacker from discovering high-value targets.
The above researches on the deception decision-making problem only consider spatial decision-making, simplifying and ignoring the impact of temporal decision-making influence on the deception defense process. Therefore, with the development of AI, some deception decision-making research based on temporal and spatial have emerged. These methods use AI technologies to perceive the occurrence of attacks and then change the deployment of deception strategies. For example, Dowling et al. [32] proposed an adaptive honeypot deployment strategy based on SARSA. Wang et al. [33] proposed a dynamic deployment strategy for intelligent honeypots based on Q-learning. Furthermore, Abay et al. [34] proposed a honey-data generation method based on deep learning.
All of the above three models can guide the selection of deception strategies. However, the deception decision-making method based on the attack graph does not consider the interaction of attacker and defender when modeling the attackers’ behavior. Furthermore, this method only considers the spatial decision-making, which is inconsistent with the actual attack and defense interaction behavior. Although the deception decision-making method based on machine learning considers the optimal time, it needs a large number of data sets, and the availability of data sets will directly affect the detection effectiveness. The deception decision-making method based on the game theory can compensate for the shortcomings of the above two methods. However, at present, this method only considers the optimal strategy based on spatial decision-making, which oversimplifies the strategy space of the defender. In addition, most deception decision-making methods based on the game model only consider the interaction behavior between attacker and defender in a single stage and do not consider the effect of changes in the attacker’s behavior on the defender’s strategy.
Based on the above shortcomings, we construct a multi-stage deception game model based on the Flipit game, considering both temporal and spatial decision-making. The Flipit game model provides continuous strategic space for both attacker and defender, which can well describe the timing characteristics of the migration of deception assets. We construct the utility function by introducing the discount factor and the stage transition probability. Based on analyzing the multi-stage network deception model, the optimal deception strategy selection algorithm based on deep reinforcement learning is designed. The effectiveness of this model is verified by simulation experiments.
The main contributions of this paper are as follows:(1)Different from the moving attack surface and deception attack surface, this paper constructs a deception defense model based on the moving deception attack surface, which can accurately represent the deception attack and defense process.(2)We analyze the impact of spatial-temporal deception on attackers by modeling the deception decision-making as a Flipit game model and introducing the temporal decision-making into the deception model.(3)By introducing discount factor and stage transition probability, the single-stage Flipit game model is extended to a multi-stage Flipit game model and then we analyze how the defender’s deception strategy adapts to the attacker’s strategy changing.(4)The deep reinforcement learning algorithm is proposed to obtain the multi-stage deception defense strategy. The experimental results verified that the multi-stage Flipit game model can effectively describe and characterize the security evolution process of the deception system.
2. Analysis of Attack and Defense
2.1. Moving Deception Attack Surface
The attack surface can effectively present the system resource vulnerability set as an evaluation method to describe the system security status and potential security risks. Therefore, it is often used to model network attack and defense. For example, in literature [35], Mandahata first defined attack surface. The attack surface is a part of the system resources, mainly including methods , channels , and data . Attackers can use the system’s channel to connect, call the system’s method, send data to system or receive data from the system, and launch the attack. Then, the author introduces the application of attack surface evaluation in Moving Target Defense (MTD) and puts forward the concept of attack surface shuffling. By shuffling the system attack surface, the defender can effectively reduce the time of system resource leakage and the probability of attack success. However, the shuffling system attack surface proposed by Mandahata et al. assumed the system attack surface is time-invariant, which is not in line with the characteristics of (MTD). Therefore, Huang and Ghosh [36] proposed the concept of moving attack surface with time-varying characteristics, which makes it impossible for attackers to determine whether the vulnerability of resources they use can be reached in a fixed period. Later, Albanese et al. [37] and Ma et al. [38] introduced the attack surface into the cyber deception and proposed the concept of virtual attack surface to represent the attacker’s view of system resources and the concept of deception attack surface to represent the attack surface observed and perceived by the attacker. Unlike the moving attack surface used in MTD, the attack surface used in deception can change the attack surface by adding deception assets and building deception topology, which can accurately represent the perception view of attackers to system resources. However, the existing concept of attack surface can not represent the scenario of deception assets migration, so based on the concepts of the deception attack surface and the moving attack surface, we propose the concept of the moving deception attack surface for the scenario in this paper.
As shown in Figure 1, unlike the moving attack surface and the deception attack surface, the moving deception attack surface proposed in this paper shuffles the deception attack surface of the system instead of shuffling the real attack surface of the system, which means shuffling the deception assets to build a dynamic multi-dimensional deception topology. This method can increase the space for attackers to explore by building various heterogeneous deception assets in the spatial dimension. Moreover, the information collected by the attacker is expired by shuffling deception assets in the temporal dimension, which increases the uncertainty of the attacker’s perception of the vulnerability of the system resources. According to the definition of Lei et al. [39], this paper defines the moving deception attack surface as follows:

Definition 1. Moving Deception Attack Surface (MDAS) is composed of a triple , which represents the vulnerability set of network resources that exposed by the system to the attacker in any given shuffling period . The vulnerability set includes the union of real assets and deception assets. Among them, the Deception Attack Surface Dimension (DASD) refers to the type of network resources in the target system, including the type of deception and real resources. And the network resources are usually in the form of network address, port, service, protocol, etc. The Deception Attack Surface Value (DASV) refers to the value of deception resource type. Because the MDAS will change with time, the attack surface of the target system at time can be expressed as . represents the type of network resources exposed by the target system at the time , . represents the value of network resources in different dimensions at the time . “1” indicates a deceptive attack surface of this dimension at time .
In the moving deception attack surface, the defender can shuffle the deception attack surface in two ways. One is shuffling the decoy, and the other is changing the camouflage type. At the same time, the defender can obtain the information of the attacker according to the deception topology.
2.1.1. Shuffling a Decoy
As for cyber deception, the defender usually deploys honeypots, honeynets, and other deception resources to attract the attacker to deception assets and then delay the attacker’s successful attack time. In order to prevent attackers from discovering deception assets, MDAS increases the diversity of system resource vulnerability by shuffling the type of honeypot (such as the transformation of high-interaction honeypot and low-interaction honeypot), the services and vulnerabilities deployed in the honeypot (such as the transformation of Nginx service and Apache service), and the number of honey files deployed in the honeynets.
2.1.2. Changing Camouflage Type
Disguising honest nodes is to confuse attackers through active response, such as allowing the system to simulate different versions of services to respond to attackers or opening all services to respond to attackers’ requests. Therefore, the moving deception attack surface can enhance the system’s obfuscation by changing the active response service type, making the attacker unable to distinguish between the deception assets and the tangible assets.
2.2. Analysis of Attack and Defense Behavior
In order to evaluate the effectiveness of cyber deception, this paper uses the Flipit game model, as shown in Figure 2, to evaluate the effectiveness of moving deception attack surface. In the Flipit game model, the policies of defender are divided into spatial decision-making and temporal decision-making. The temporal decision-making is referred to the interval at which the defender chooses to control the target system. As shown in Figure 2, the temporal decision-making of defenders can be represented as . The spatial decision-making refers to defenders choosing different deception assets during the interval in which the defender controls the target system. According to the attack and defense strategy defined in the Flipit game model, the assumptions of the attacker and defender are as follows:(1)Assume that the time interval of the attacker to control the target system follows an exponential distribution with parameter . The probability distribution of the time interval between attackers getting control of the target system is . Within the control time interval , attackers can detect the attack surface exposed by the target system, and then find resource vulnerability of system to launch further attack.(2)Because the deployed deception topology can capture part of the information of the attacker, this paper assumes that the defender is the Last Move (LM) defender of the Flipit game. As shown in Figure 1, the defender can see the time interval from the last time the attacker controlled the target system to the present, and the time that he controlled the target system this time. Therefore, LM defenders have certain information advantages over attackers.(3)Assume that the time interval for the defender to control the target system is periodic. The defender selects the periodic strategy as his own strategy. The deception attack surface can be shuffled within the time interval to build a dynamic multidimensional deceiving topology and enhance the deceptive ability of the target system. Different strategies of defender will affect its utility. When the strategy is larger, the defender can control the target system longer. So, defender can spend a longer time deploying more complex deception assets. So that the smaller the probability of the attacker discovering deception assets is and the greater the utility of the defender is. However, it also results in an increased cost of deploying deception assets at the same time. Therefore, it is necessary to select the optimal strategy that increases the utility of the defender while reducing the cost of the deception assets with high deployment complexity.

3. Multi-Stage Cyber Deception Game Model
In the network attack and defense scenario, when the attacker cannot obtain the critical asset information of the target system for a long time, attacker will actively change his strategy to maximize his utility. Therefore, the single-stage cyber deception attack and defense model does not conform to the actual attack and defense scenario. It is necessary to consider building a multi-stage cyber deception attack and defense model. So in this section, we first give a single-stage Flipit game model based on the literature [40]. Then, a multi-stage Flipit deception attack and defense game model based on the discount reward is given.
3.1. Single-Stage Flipit Deception Game Model
According to the Flipit game model shown in Figure 2, based on the assumption of Section 2.2, we construct the single-stage deception game model at first.
Definition 2. The Flipit-based Single-stage Attack-Defense Deceptive Game Model (FS-ADD) can be represented by triples :(1) refers to the player in the single-stage deception game. We only consider one attacker and one defender, where refers to the attacker, and refers to the defender.(2) refers to the optional strategy set of attacker and defender in the single-stage deception game. represents the strategy set of the attacker. The attackers in this paper are the players who adopt the exponential distribution strategy in the Flipit game. The average time interval between the attacker controlling the target system is , so the attackers' strategy is a set of different average time interval . represents the strategy set of the defender. We consider that the defender is the LM player who uses periodic strategy in Flipit game. Defender’s strategy is a set of different time intervals .(3) refers to the set of the utility functions of attacker and defender in a single-stage deception game. represents the utility of the attacker. represents the utility of the defender. According to the Flipit game, the utility function of the attacker and defender can be represented as shown in formula: where, refers to the cost of the attacker to control the target system, refers to the cost of the defender to control the target system, which means the cost of the defender to change the deception assets.
3.2. Multi-Stage Flipit Deception Game Model
According to the single-stage Flipit game model, the defender is the LM player. So the deception game model exists the information asymmetry between attacker and defender, and there is no dominated strategy for both attacker and defender in the game. This means there is no Nash equilibrium in the deception game. However, the defender can get the strongly dominant strategy. When the system passes through a period, the defender can get the optimal defense strategy and the deception defense system is stable. However, when the attacker cannot obtain the key information of the target system for a long time, the attacker will change his attack strategy. At this time, the strategy of the defender is no longer optimal. Therefore, the relatively stable deception system needs to re-select the best defense strategy of the defender, which means the system moves from one state to the next.
Figure 3 shows a multi-stage Flipit deception game model. In the stage, we consider that the attacker plays with exponential distribution with . In order to deceive the attacker and enhance the confusion of the system attack surface, the defender needs to move the deception attack surface periodically to generate a new virtual network topology. However, the selection of the deception strategy will affect the defender’s utility. Therefore, the defender must choose the optimal defense strategy to achieve the system’s steady state at each stage. However, when the attacker cannot obtain the key asset information of the target system for a period of time, the attacker will change his strategy, making the system shuffle to another stage. The defender needs to re-select the deception strategy to prevent the attacker from obtaining the necessary information from the target system.

Definition 3. The Flipit-based Multi-stage Attack-Defense Deceptive Game Model (FM-ADD) is represented as :(1) refers to the players of the multi-stage deception game. We only consider one attacker and one defender, where refers to the attacker, and refers to the defender.(2) refers to the total number of stages in the multi-stage Flipit game, refers to the current stage, .(3) refers to optional strategy set of attacker and defender in the k-th stage of deception game. refers to the optional strategy set of attacker in the k-th stage and refers to the optional strategy set of defender in the k-th stage.(4) refers to the system’s initial state at each stage, that is, the security state of the system when the attacker’s policy changes.(5) refers to the state of the system at each stage, including the state before the system reaches a stable state and the state when each stage reaches stable state .(6) refers to the stage transition probability, and refers to the transition probability from steady state to steady state .(7) refers to the set of the utility functions of attacker and defender in a multi-stage deception game. refers to the attacker’s utility function in the k-th stage and refers to the defender'S utility function in the k-th stage.In the multi-stage deception game model, the optimal strategy of the defender is not only related to the utility of the current stage, but also to the utility of the future stages. According to the literature [41], we design the objective function to determine the advantages and disadvantages of the strategies of the defender and the attacker by introducing the discount factor and the stage transition probability :
4. Cyber Deception Strategy
4.1. Single-Stage Flipit Deception Strategy
According to the single-stage Flipit deception game model, as shown in Figure 2, since the attacker plays with exponential distribution and the defender plays with periodic strategy, the information of the attacker and defender is asymmetric. The defender who deploys the deception strategy can see part of the information of the attacker’s strategy, while the attacker cannot obtain the information of the defender. Therefore, the single-stage Flipit deception game model only has the strongly dominated strategy of the defender, and there is no Nash equilibrium between the attacker and defender. According to literature [40], when the attacker and defender are players who follow exponential distribution and LM players who adopt periodic strategy respectively, the defender’s strongly dominated strategy can be obtained by calculating the derivative of the defender’s utility :
According to formula (3), when , the defender has a unique periodic strategy that maximizes the utility of the defender. The defender’s strategy and the attacker’s strategy satisfied formula . When , . The utility function of the defender is monotonically increasing. So the optimal strategy of the defender is that , which means the defender takes no action to seize control.
4.2. Multi-Stage Flipit Deception Strategy
Based on the single-stage Flipit deception game model, we construct a multi-stage Flipit deception game model by introducing the discount factor and the stage transition probability . We calculate the optimal defense strategy when the attacker takes different strategies, which means at different stages . The strategy can be obtained as:
When the attacker’s strategy meets , the optimal strategy of the defender is . So studying the attacker’s strategy that meeting is meaningless. In this paper, we only considers the optimal deception strategy when the attacker’s strategy meets .
By analyzing formula (4), it can be found that formula (4) is a dynamic programming problem. Because the strategy of the defender is a continuous strategy space, to avoid the dimension disaster problem brought by the traditional algorithm, we propose to use the deep reinforcement learning method to calculate the optimal deception strategy of the multi-stage deception game model. In this section, we first represent the state, action, and reward function of reinforcement learning, then design a Multi-stage Flipit game deception strategy generation algorithm based on the Proximal Policy Optimization (PPO) algorithm (MFD-PPO).
4.2.1. Reinforcement Learning Algorithm
In the classical reinforcement learning model, the agent interacts with the environment continuously, then gets the optimal goal gradually through training. This process mainly involves the dynamic change of the environment, the behavior of the agent, and the reward after acting, which are called state, action, and reward, respectively. When the agent selects an action from the strategy space according to the obtained state at a certain time , it obtains the reward for taking action under state . Then, the state changes from to . During the whole training process, the interaction between the agent and the environment can be expressed as trajectory , and the effect of training can be quantified by the accumulated discount reward in the learning process as:where, indicates the number of iterations, indicates the discount factor. Formula (5) indicates that the current reward of the agent is not only related to the current reward but also to the future reward. Based on this, in the multi-stage Flipit deception game model, the states , actions , and rewards are represented as follows:
(1) State .In the multi-stage deception game model, the state is composed of the strategy taken by the attacker at all stages and the cost of the defender, expressed by . At time , the state of each stage of the system can be expressed by . represents the state of the k-th stage in the multi-stage game process, which can be represented as . refers to the strategy of the attacker at the k-th stage, and refers to the cost of defender at the k-th stage.
(2) Action . In the multi-stage deception game model, the action strategy of the defender at each stage is used as the action space . According to the multi-stage deception game model, the deception strategy of the defender is a different fixed time interval , which is represented as: and represent the minimum and maximum control intervals the defender can choose, respectively. When the defender selects action at time , the system will change the configuration of deception nodes according to the selected strategy. If the defender chooses the deception strategy , the defender will combine the configuration of k deception node according to the strategy . The longer the strategy chosen by the defender, the longer it will take to change the deception assets and more complex deception assets can be deployed by k deception nodes.
(3) Reward . The defender’s behavior will affect the security situation of the deception system and obtain rewards from the environment. Here, the objective function of formula (7) is used as the reward function.
4.2.2. Algorithm of Multi-Stage Deception Defense Strategy
Based on the analysis in Section 4.2.1, the action space of the agent is continuous. Using the classical reinforcement learning algorithm, such as Q-learning algorithm, to calculate the deception strategy will lead to the infinite size of the Q table. These reinforcement learning algorithms cannot adapt to the dynamic programming problem of high-dimensional state space and action space. Therefore, in this paper, we design a Multi-Stage Flipit Deception Strategy solution algorithm based on the Proximal Policy Optimization (PPO) algorithm (MFD-PPO). PPO algorithm is a deep reinforcement learning algorithm based on policy gradient. By combining reinforcement learning with neural networks, PPO has apparent advantages in solving the optimization problem of high-dimensional state space and action space.
Figure 4 shows Multi-stage Flipit Deception Game Strategy framework based on the PPO algorithm (MFD-PPO). PPO algorithm is based on the Actor-Critic deep reinforcement learning algorithm, mainly including Actor and Critic neural networks. And the Actor-network includes the new Actor-network and old Actor-network. In the training process, PPO first inputs the state of the environment into the Actor-network to obtain the expectation and variance required to construct the action space subject to the normal distribution. Then, PPO inputs the random sampling action into the environment according to the normal distribution and obtains the reward and the next state . Further, store in the experience pool and input state into the Actor-network. PPO obtains trajectory after repeating the above steps. Actor network does not update the network parameters at this time. After the number of state-reward pairs in the experience pool reaches a certain number, PPO inputs the last sampling action to the environment and outputs the state . Then, PPO inputs the state and the state stored in the experience pool into the Critic-network. According to state , the Critic network outputs state value . The discount reward is represented as formula: is the random sampling for the last step. At the same time, all state values can be obtained according to the state set in the experience pool. Then, the estimate of the advantage function can be obtained as:

The loss function of the Critic network can be obtained according to the advantage function shown in equation (9). The system updates parameter of Critic network according to equation.
For the parameter update of the actor-network, the PPO algorithm first collects the combination of state and inputs them to the new actor-network and the old actor-network. The two networks can output the normal distribution as Normal1 and Normal2 respectively. Then, MFD-PPO inputs the stored action set into the normal distribution Normal1 and Normal2 to obtain the corresponding probability and . divides to get the importance weight . Then, we can get the loss function of actor-network as:
However, in the training process, if the policy update step is too long, it will affect the optimal value and if the policy update step is too short, it will slow down the convergence speed. Therefore, the PPO algorithm introduces hyper-parameters to clip the policy update step. The loss function of the actor-network is optimized as follows:
Among them, the clip function can prevent Actor network from updating too fast. PPO updates the new Actor network according to formula (12) and repeats the above steps until the iteration is over, then updates the old actor-network. Based on the PPO algorithm framework shown in Figure 4, this paper designs a Multi-stage Flipit Deception Game Strategy Selection Algorithm based on PPO (MFD-PPO). The detailed algorithm is shown in Algorithm 1.
|
According to the analysis and description of the above algorithm, the MFD-PPO algorithm has strong scalability and efficiency. As for algorithm’s scalability, it is known that the MFD-PPO extends from the PPO based on the description of Section 4.2.2. According to the literature [42], the PPO can be applied to scenarios with high-dimensional state and action spaces, so the MFD-PPO can also deal with high-dimensional state and action spaces. To verify the scalability of the MFD-PPO, we set that the action spaces of the defender are and the state spaces consist of eight phases of attack policies and costs in Section 4.2.1. So the experimental results can effectively prove that the MFD-PPO can deal with high-dimensional and continuous state and action spaces. Besides, when applying the algorithm to other scenarios, the users only need to reset the state spaces and action spaces, and the algorithm can easily get the results.
As for algorithm’s efficiency, we demonstrate that the MFD-PPO can correctly obtain the optimal defense strategy through the experiments in Sections 5.3.1 and 5.3.2. Besides, the MFD-PPO only needs to be trained once. After the training completes, we can deploy the trained model to the target system and use the trained model to easily obtain the optimal deception strategy. Therefore, the computational resource overhead is mainly spent in the training process, and the resources spent are limited.
5. Simulation Experiment and Analysis
The rapid development of cloud computing provides users convenient ways to access computing resources. However, the architecture of cloud computing also provides many attack surfaces to attackers, enabling attackers to gain control of the cloud in various ways, then, launching the further attack on the cloud computing environment. Therefore, it is urgent to study the attack and defense model of the cloud environment. In this section, we take the cloud computing environment as an example to analyze the proposed multi-stage deception model based on Flipit game and calculate the multi-stage deception strategy.
5.1. Experimental Environment
In order to build an actual attack and defense interaction scenario in the cloud environment, this paper uses Kubernetes to build an experimental environment. As an automated container management tool, Kubernetes can make the deploying, scheduling, and deleting of applications more convenient. As shown in Figure 5, we build a Kubernetes cluster to manage application services. The cluster mainly includes a master node (16-core processor, 16 G RAM) and three nodes (16-core processor, 16 G RAM). The master node manages the creation and deletion of PODs, and the nodes are used to run PODs. Applications are deployed in PODs that mainly include Web server, FTP server, database server, LADP server, and honeypot server. The clusters of above application servers are the main targets of deception defense. The main defense method is deploying honeypot clusters to prevent attackers from discovering critical assets in the cloud computing environment.

Based on the above experimental environment and reference [43, 44], this paper designs a multi-stage deception game model based on Flipit game which includes eight stages, as shown in Figure 6.

Based on the literature [41, 45], the transition probability between the stages can be obtained based on prior knowledge. Therefore, the transition probability of a multi-stage deception game based on Flipit game is shown in Table 1. Although the given transition probability is fixed in this experiment, they are only used as an example and the set of transition probabilities do not affect the stability of the algorithm.
5.2. Parameters Setting
In order to obtain the multi-stage deception defense strategy, we design our experiments using Python 3.6.7, TensorFlow 1.8.0, and Stable Baselines library for implementing the MFD-PPO algorithm. We have simulated our experiments using HUAWEI machine with Intel (R) Core (TM) i7-6700 @ 3.40 GHz CPU and 32 GB RAM. According to the experimental environment built in Section 5.1, to obtain the 8-stage deception strategy, it is necessary to firstly determine the space of attack and defense strategy before getting the multi-stage deception strategy. Therefore, this section simulates and analyzes the relationship between the defender’s reward and the defender’s strategy, the defender’s cost and the attacker’s strategy in the single-stage Flipit game model. As shown in Figure 7. when the cost of the defender remains unchanged, the larger the attacker’s strategy is, the smaller the optimal deception strategy is if we want to maximize the reward of the defender. This means that the larger the attacker’s strategy is, the shorter the average time for the attacker to seize control of cloud is. In order to maximize the reward for the defender, the time interval for the defender to seize control should also be shorter to defend against an attacker with a shorter average shuffle time. At the same time, the smaller the cost of the defender is when the attacker’s strategy is unchanged, the smaller the interval between the defender seizing control of cloud is when the defender gets the maximum reward.

Just as Section 3.2 described, because the single-stage Flipit game doesn’t exist Nash equilibrium and only exists strongly dominated strategy of defender, the multi-stage Flipit game model also doesn’t exist Nash equilibrium. To provide robust defense strategies, we set the attacker’s strategy to vary within a certain range at each stage during the experiment. This experiment method avoids the influence of fixed attack strategy on the optimal defense strategy and ensures the effectiveness of the defense policy. According to the above simulation results and the analysis in Section 5.1, the attacker’s strategy must be less than , and the defender’s strategy space must be within 10∼50 s. Therefore, when designing the multi-stage deception strategy space, to ensure the defense strategy’s accuracy, the strategy space of the defender is 1∼100 s, and the strategy space of the attacker is less than . In addition, to ensure the stability and convergence speed of the MFD-PPO algorithm, the parameters of the MFD-PPO algorithm are designed as shown in Table 2.
In order to compare the advantages of the MFD-PPO algorithm in calculating the multi-stage deception strategy, this paper selects the following algorithms for comparison.(1)Multi-stage Flipit deception game strategy selection algorithm based on Particle Swarm Optimization (MFD-PSO). PSO algorithm is a typical heuristic algorithm. This algorithm selects the optimal solution by simulating the optimization problem as birds' foraging and flight behavior. The flight space of birds is the solution space and the position of birds in space is one solution to the problem.(2)Multi-stage Flipit deception game strategy selection algorithm based on Advantage Actor-Critic (MFD-A2C). A2C algorithm is also a deep reinforcement learning algorithm based on Actor-Critic architecture. Multiple parallel threads will be built during the algorithm’s training process. Each thread includes an Actor-Critic network and interacts with their own environment to obtain independent experience.
5.3. Analysis of Experiment Results
5.3.1. Convergence Analysis
In order to analyze the convergence of the MFD-PPO algorithm and the impact of introducing hyper-parameter into the clip function on MFD-PPO algorithm, this simulation carries out -step training. Since the hyper-parameters in the clip function generally range from 0.1 to 0.3, the hyper-parameters selected in the experiment are 0.1, 0.15, 0.2, 0.22, 0.26, and 0.3. The training results are shown in Figure 8, where the horizontal axis represents the number of training steps and the vertical axis represents the discount reward. From the experimental results shown in Figure 8, the MFD-PPO starts to convergence when the training reaches 600000 steps, and the training time is only three hours. Besides, when the hyper-parameter is 0.1, 0.15, 0.2, 0.22, 0.26, and 0.3, and the discount reward after convergence is also basically the same. This result can prove the effectiveness of the MFD-PPO. In addition, the experimental results show that the smaller the hyper-parameter is, the slower the convergence speed is. This is because in the training process, the smaller the hyper-parameter is, the more cautious the strategy update is, so the slower the convergence speed is. However, the experimental results show that the convergence speed is similar when the hyper-parameter is in the range of 0.2 to 0.3. In order to avoid excessive differences between the new and old policies due to the excessively large hyper-parameter , this paper selects the hyper-parameter to be 0.2.

5.3.2. Multi-Stage Deception Strategy Solution
Based on the hyper-parameter selected in Section 5.3.1, this section calculates the reward and deception strategy of each stage in the multi-stage Flipit deception model. The experimental results are shown in Figures 9 and 10. Figure 9 shows the defender’s reward at each stage. From the reward of the defender at each stage, it can be seen that all of the 8-stages’ rewards can converge during the training process. In the case of the reward convergence of the defender, when the attacker takes different attack strategies, the strategy of the defender is shown in Figure 10. As shown in Figures 9 and 10, when the attacker is in the 1st stage and the system reaches a steady state, the defender’s reward fluctuates around 1.05, and the defender’s strategy fluctuates around 20. When the attacker is in the 2nd stage and the system reaches a steady state, the defender’s reward is 0.96, and the defender’s strategy fluctuates around 20. When the attacker is in the 3rd stage and the system reaches a steady state, the defender’s reward is 1.02, and the defender’s strategy fluctuates around 22. When the attacker is in the 4th stage and the system reaches a steady state, the defender’s reward is 0.58, and the defender’s strategy fluctuates around 24. When the attacker is in the 5th stage and the system reaches a steady state, the defender’s reward is 0.62, and the defender’s strategy fluctuates around 20. When the attacker is in the 6th stage and the system reaches a steady state, the defender’s reward is 0.99, and the defender’s strategy fluctuates around 20. When the attacker is in the 7th stage and the system reaches a steady state, the defender’s reward is 0.53, and the defender’s strategy fluctuates around 20. When the attacker is in the 8th stage and the system reaches a steady state, the defender’s reward is 1.06, and the defender’s strategy fluctuates around 20.


5.3.3. Delay Comparison
In order to evaluate the advantages of the MFD-PPO algorithm, this section first compares MFD-A2C with MFD-PPO. Figure 11 shows the training result of MFD-PPO and MFD-A2C. The experimental result shows that although MFD-A2C converges faster than the MFD-PPO algorithm, the reward of MFD-A2C is not monotonically increasing during the training process. Moreover, when MFD-A2C converges, the fluctuation of reward is far greater than that of MFD-PPO. Therefore, the MFD-PPO algorithm is stable when solving multi-stage deception strategy. In addition, this paper compares the time delay of the MFD-A2C, MFD-PSO, and MFD-PPO algorithms in getting a multi-stage deception strategy. Since the deep reinforcement learning algorithm includes two stages of training and decision-making, it is meaningless to compare the training time delay of deep reinforcement learning with the time delay of PSO. Therefore, we select the decision-making time delay of MFD-PPO and MFD-A2C to compare with the delay of MFD-PSO, the experimental results are shown in Figure 12. It can be seen from the experimental results that the deep reinforcement learning algorithms MFD-PPO and MFD-A2C have obvious advantages over MFD-PSO in decision-making. MFD-PSO needs to recalculate the multi-stage deception strategy when the environment changes, while the MFD-PPO and MFD-A2C algorithms can use the model to quickly give the multi-stage deception defense strategy after the training is completed. Furthermore, from the experimental results, it can also quickly give the multi-stage deception strategy compared with MFD-A2C, further illustrating the advantages of MFD-PPO.


6. Conclusion
In view of the existing problems in the deception decision-making model based on game, this paper presents the deception decision-making model based on multi-stage Flipit game. Firstly, based on the moving attack surface and the deception attack surface, this paper proposes a deception model based on the moving deception attack surface. Then, on the basis of analyzing the attack and defense behavior of network deception, a single-stage spatial-temporal decision-making deception model is constructed based on Flipit game. On this basis, we present a multi-stage spatial-temporal decision deception game model by introducing discount reward and stage transition probability, and give the utility function of the multi-stage deception model. Finally, we design a deep reinforcement learning algorithm MFD-PPO to obtain the optimal deception defense strategy. The experimental results show that the MFD-PPO algorithm has strong stability when getting the multi-stage deception strategy. Moreover, it has lower delay compared with the heuristic algorithm.
Our work opens up new avenues for future research. In our paper, we consider the existence of a strong dominant strategy for the defender and the attacker’s strategy varies within a certain range. It is interesting to study the case when the attacker selects random policies in every stage. For future work, we may study the model in which there are optimal strategies for both attackers and defenders, and propose the deception strategy selection method based on multi-agent deep reinforcement learning. By using the method, the policies of the attacker and defender are random, and neither side knows the policy of the other.
Data Availability
https://github.com/hwz9612/multi-stage-Flipit-game available on Corresponding author request.
Conflicts of Interest
The authors declare that they have no conflicts of interest.
Authors’ Contributions
Weizhen He and Jinglei Tan contributed equally to this work.
Acknowledgments
the National Natural Science Foundation of China (No. 62072467), the National Key R&D Program of China (No. 2021YFB1006200, No. 2021YFB1006201) and the National Natural Science Foundation of China (No. 62002384).