Abstract

Making effective use of scarce spectrum resources, along with efficient computational performance, is one of the key challenges for future wireless networks. To tackle this issue, in this paper, we focus on the intelligent dynamic spectrum allocation (DSA) in a mobile edge computing (MEC) enabled cognitive network. And our objective is to optimize the spectrum utilization and load balance among idle channels. Since users can only acquire part of environment information in a decentralized way, we model such a problem as decentralized partially observed Markov decision process (Dec-POMDP) and design the corresponding evaluating metric to encourage users sense and access spectrum properly. Then, we propose a QMIX-based DSA method with centralized training decentralized execution (CTDE) structure to tackle it. In the training phase, the users offload the computational tasks to the MEC server to obtain the optimal distributed DSA strategies, through which the users select the optimal channel locally in the execution phase. Simulation results show that, using the proposed algorithm, users can independently capture spectrum holes, and hence improve the spectrum utilization while balancing the load on available channels.

1. Introduction

With the rapid development of the future beyond 5G/6G wireless communication, more and more emerging applications like virtual reality (VR), augmented reality (AR), and interactive game are springing up, requiring low-latency, high-reliability and powerful computational capability [1]. Due to size, weight, and power (SWaP) constraints, devices are recognized to have limited computational capability to fully support these applications. Mobile edge computing (MEC) has a great potential to overcome this issue, through which the users can offload the computational tasks to MEC servers [24]. By this way, many services can be provided, like communication, caching, and computing [5]. Meanwhile, it can also potentially boost intelligence for users to achieve efficient spectrum access with low coordination overhead. Confronting the time-varying and complex wireless environment, it is challenging for users to access the spectrum adaptively with robustness guarantee. The MEC-enabled intelligent dynamic spectrum allocation (DSA) can play a significant role in this case.

Many efforts have been devoted on the DSA in the traditional way, e.g., blind rendezvous in D2D [6] and cognitive radio networks (CRNs) [7], cross-layer perspective [8, 9] and randomized rounding algorithm [10] in CRNs, and bipartite graph theory in wireless LANs [11]. Since these works require lots of statistical knowledge, which is difficult to obtain in a dynamic network, the intelligent approaches have been adopted for DSA further. In [12], genetic algorithm (GA) is adopted by the central node to complete DSA for each secondary user (SU) in a CRN. Benefit from the fitting properties of deep neural networks and the interaction with the environment, several works focus on the deep reinforcement learning (DRL) structure [13]. Specially, in [14, 15], the central controller is employed to evaluate and allocate channels to multiuser through DRL, while the users report the channel state after having an access. And the maximum channel utilization and minimum collision are expected to be optimized with no prior information for multiuser. However, there is still much room for improvement compared with the optimal scheme, due to the heavy signalling overhead and overdependence on the central node. To tackle this problem, there are increasing works that focus on intelligent DSA in a distributed way. In [16], the bioinspired solution is employed for users to adjust and optimize the cluster size distributedly in cognitive internet of things (IoT), which aims to achieve efficient spectrum allocation, flexible connection, and minimum network access delay. In [17], a heuristic method is applied for users to sense an unoccupied spectrum and build an optimum route, so as to complete dynamic spectrum allocation and power control. The authors in [18] have modeled the multiuser multichannel allocation as an undirected graph, and a greedy algorithm is designed to form the load balanced cluster. The authors in [19, 20] introduce the game theory for SUs to allocate spectrum, and the SUs learn access strategies competitively by maximizing their respective revenue. The authors in [21] have investigated the problem of multiuser enabled frequency division multiple access in radio network, and a bargaining approach is used to allocate subcarrier and power for multiuser by get a Nash equilibrium, so as to achieve the tradeoff between throughput and power consumption. Even though these works make a good breakthrough in the intelligent DSA, there are still some shortcomings. For one thing, these methods are just feasible for the simple case with small strategy space, otherwise, the long decision time would occur. Obviously, it is impractical for the low-latency scenarios. For another, the interaction should be addressed in the dynamic environment furtherly.

To tackle the dilemmas above, a powerful method, mutliagent reinforcement learning (MARL), has gained increasing interest recently. MARL is an extension of reinforcement learning (RL) [22], which is suitable for distributed learning and processing. With the aid of this method, users can interact with environment and obtain their DSA strategies as agents. Particularly, in [23, 24], the double deep Q-network (DQN) algorithm and multiagent Q-learning are applied to active users, who compete to access multichannel independently to achieve minimum collisions. In [25], the SUs in CRN are aiming to learn proper spectrum access strategies autonomously and the DRL method, along with echo state network (ESN) is adopted, through which interference can be alleviated. And in [26], a competitive spectrum access scheme for multiuser is proposed, and the performance of DSA and avoidance is simply analyzed by Q-learning method. While in [27], multiuser DSA is modeled as a multiarmed bandit problem, in which the users are supposed to access proper spectrum distributedly. However, since there is no negotiation that users analyze information locally and allocate the spectrum resources by way of competition in these works, the spectrum access can be regarded as an ALOHA-like process, which is still confronted with great challenges in highly dynamic environments. Moreover, for constraints of computing and battery capacity, there is still a huge gap to fill before deployment in real world.

Inspired by MEC technology, which could extend the computational capacity and process the task for DSA at edge cloud platform [2830], the centralized training decentralized execution (CTDE) scheme of MARL is addressed to achieve efficient DSA [3133]. By this way, the distributed users offload the DSA task to the cloud platform in the training phase, so as to learn the action-taking strategies, and then work in a fully distributed manner in the practical implementation phase. Specially, the authors in [31] have investigated a noncooperative DSA in CRN. In [32], users are expected to transmit on idle channel and cooperate to maximize the sum rate, and the double DQN algorithm is utilized, resulting in a certain gap with the optimal performance. Considering the linear relationship between the global and individual utility, the authors in [33] propose a QMIX-based DSA algorithm, which brings the excellent strategies for users, and the maximum successful transmission and minimum collision are realized. However, in order to avoid collision, the users in [32, 33] are allowed to be silent. That is, it is assumed that only part of users could participate in the DSA at the same time, which is unfair to all users distributed in the network. Moreover, the works in [18, 24, 27] take the perfect spectrum detection into consideration, where the selection for idle spectrum is always guaranteed. In reality, influenced by the dynamic environment and limited hardware condition, the detection ability is usually partial [34] and imperfect [35].

In this paper, we investigate the intelligent DSA for multiuser MEC networks. We consider that the users can only sense part of information, and the ability of detection is assumed to be imperfect. Motivated by the monotonicity of the considered problem, as well as the powerful computation ability, the QMIX-based DSA algorithm with CTDE structure is employed. To the best of our knowledge, this work has not been researched yet. And we highlight the main contributions of our work as follows: (1)We focus on a MEC-enabled cognitive network deploying multiple SUs who attempt to access the dynamic spectrum without perfect sensing capabilities. In this scenario, SUs are supposed to be intelligent to achieve their common task autonomously. Meanwhile, in order to make up for the users’ limited computational capability and energy reserve, the MEC server, which is computationally powerful and long-lived, is employed at BS. This is practically significant, since the traditional dependence on a central controller is released. And the users can adapt to the environment independently and timely with lower overhead, so that it can be further extended to the latency-sensitive applications(2)We formulate a distributed DSA problem to improve both the idle channel utilization and load balance. And the problem is modeled as a decentralized partial observation Markov decision process (Dec-POMDP). Then we propose a CTDE enabled DSA algorithm, whose characteristic is consistent with that of the modeled problem. This algorithm can handle the environment dynamics and users’ partial observation with low-complexity for practical implementation. Specially, different from these online searching methods, such as POMCP [36], DESPOT [37], and HyP-DESPOT [38], there are two phases in our proposed algorithm, i.e., offline training and online execution. In the offline phase, the task of DSA is offloaded to the MEC. With the aid of MEC, the SUs adaptively adjust their DSA strategies, so that their network models can be well trained finally. While in the online phase, each SU executes action locally based on the trained model with no central controller and the coordination among SUs(3)We present simulations to demonstrate the effectiveness and feasibility of our proposed DSA algorithm in the different settings under dynamic environment. We observed that the optimal network utility is always realized after limited training, while the sensing accuracy is also improved. The SUs can effectively overcome their imperfect sensing characteristic and capture the idle channels. Based on this, the expected optimal DSA task could be completed in a fully distributed manner

The rest of this paper is structured as follows: the system model is provided in Section 2. In Section 3, the problem is formulated as a Dec-POMDP to maximize the global utility. Then in Section 4, QMIX-based algorithm is proposed to obtain the optimal DSA policy. Numerical results are provided in Section 5, and the conclusion of the paper is presented in Section 6.

2. System Model

As shown in Figure 1, we consider a MEC-enabled cognitive network consisting of one primary user (PU), SUs, with the set denote by , and one cognitive BS with MEC server. There are orthogonal authorized channels, denoted as . The channels’ state switches between idle and occupied according to the communication behavior of the PU. However, the channel state and switching pattern is unknown for SUs. We assume that there are idle channels, which are feasible for SUs distributed in the network to utilize opportunistically. In this paper, the SUs are supposed to capture the PUs occupation mode and learn to sense and access channels autonomously, so as to achieve the efficient DSA. Due to the limited computational ability and battery life, SUs offload their DSA tasks to the MEC server for computation and analysis. Thereafter, the MEC server distributes the DSA strategies to each SU for the online learning to realize the DSA. In the whole process, no information interaction is required among SUs.

We assume that all the SUs are slot-synchronized, and only part of primary channels can be sensed by each of them, one of which should be accessed by each SU further. Here, the energy detection mechanism is employed [39]. In practice, there are imperfect detections, which may cause the wrong judgement inevitably. And also, since the environment is unstable and channel states are time-varying as mentioned above, the SUs interact with environment to learn how to sense and access a particular channel.

Specially, as depicted in Figure 2, the whole DSA procedure for all SUs can be illustrated as follows: the PU occupies one or more channels at each time slot, and the state of primary channels may change at each time slot. Firstly, each SU senses channels independently to judge whether they are occupied by the PU. Then, the SUs attempt to access one of the sensed channels and send the request signals to the BS, where the MEC server is employed to finish the computation and analysis so that the distributed DSA strategies for each users can be produced. In this way, SUs can learn the switching patterns of the channel states and decide which channel should be sensed in the next time, by analyzing their current DSA scheme and the corresponding feedback received from the BS.

3. Problem Definition with Dec-POMDP

Note that all SUs aim to achieve DSA, in which the objective including full idle channel utilization, and load balance is considered. For each SU, it can only sense part of primary channels, then judges the occupation state of the channel that it accesses without prior coordination among SUs. That is, the multiple SUs distributed in the network can only obtain partial environment information. Therefore, the problem can be modeled as Dec-POMDP, which can be formulated as a tuple . The definitions of the tuple elements are listed as follows, and some of the key symbols are summarized in Table 1.

is the number of SUs who are regarded as multiple agents in the interactive environment.

is the global channel state space, which reflects the true state of orthogonal authorized channels in the communication environment.” At time slot , the channel state space is defined as , where the state of , is given by

is the partial observation space, and it represents the sensed channels for all agents. At each time slot , agents observe some representation of environment state from the state . Particularly, the agent may not obtain full and perfect knowledge of the channel states, i.e., for agent , . And we define the observation of the agent for as , where

is the action space for all agents. The action profile for all agents at slot is formulated as . For the agent who chooses the , , we define , where

is the state transition matrix, reflecting the transition of channel occupation from state to .

is the set of immediate reward for all agents after accessing the sensed channels, which encourages agents to learn an optimal DSA strategy. Here, the agents are supposed to independently sense and access a truly vacant and proper channel and obtain a reward according to the feedback from the BS.

The key point of the reward for each agent is to make perfect use of the idle channels in the space as fair as possible. Since the MEC server at BS will collect the channel state and the number of agents who request for the same channel, we design the immediate reward of the -th agent at slot as where the function is defined in the case when the channel available is chosen. With regard to the ratio of the number of agents allocated on the same channel , to the total number of agents, , it guides the agent to access idle channel properly. And denotes the agents except for agent , measures the number of agents on . On the contrary, when a busy channel is wrongly chosen by agent , a negative reward is occurred and an error identification is informed.

Specifically, denotes a piecewise function bounded on , which is set to guarantee that all the available channels can be utilized fairly. The function is formulated as

When increases, the reward increases at first and achieves the maximum value at the boundary , then decreases rapidly. It signifies that the agent will get a small reward whether there are too many or too few agents on the selected . Whereas a balance scheme is explored for all agents in the cognitive network under the limited channels.

Based on the immediate rewards from all agents, the total reward in one slot can be written as where denotes the observation-action history of agent . Actually, each agent in the network is supposed to action toward the whole optimal DSA. We call it a cooperative game, which is a special type of exact potential game (EPG). Based on this theory, the monotonicity of conforms to that of [40]. Thus we have

In the finite time slots, the global reward of all agents can be obtained as where is the discount factor, reflecting the influence of the agents’ action at the current time slot on the long-term return. And equation (8) can be simply transformed into which can be furtherly integrated as

The ultimate goal of each agent is to obtain their own optimal spectrum sense and DSA strategies , so as to maximize the expected cumulative reward of the whole network. The corresponding problem can be formulated as where denotes the expectation.

From the above defined problem, the agents are expected to possess their excellent abilities of independent perception and decision to maximize a global cumulative reward in a cooperative manner. It is challenging since there is not even a central node to control the whole allocation in practical scenario and no direct information exchange beforehand among agents.

4. QMIX Algorithm for DSA

4.1. Algorithm Description

We consider the QMIX algorithm [41] with the CTDE structure to solve the DSA problem. There are two phases for the DSA of all the agents, i.e., offline training and online execution, respectively. For one thing, in the offline phase, agents perceive environmental information and offload the DSA task to the MEC server, who is responsible to train and issue the distributed DSA strategies by computing and analyzing the received data. For another, in the online phase, the MEC server keeps silence, and each agent executes action autonomously by the learnt strategy.

As shown in Figure 3, there are local agent networks for SUs and one mixing network deployed at the central controller. And the agent networks are constructed by deep recurrent Q-network (DRQN), catering for the agents’ partial observation. Here, as the agents considered in the network are homogeneous and also for the system stability, all the DRQNs are equipped with the same network structure and parameters. For any agent , with the current observation and the previous action , the local action value function is obtained, which enables the agent to choose action . Then all the agents’ value functions are injected into the mixing network. Note that a hypernetwork is embedded in the mixing network, which makes full use of the global channel state to improve the convergence speed and output the parameters, e.g., bias and nonnegative weight for the mixing network. Finally, by the nonlinear map model of the mixing network, the joint action value function is produced.

The advantage of this method is that the monotonicity of and can remain the same, i.e., which well coincides with the property of the problem. Therefore, the relationship between and can be furtherly written as

By learning the optimal joint value function , we can obtain the agents’ local distributed strategies indirectly. And the update criteria for is to minimize the loss function , which is given by where is expressed as where denotes the target network, supplying for the stable training.

Specifically, the process of the proposed QMIX-based DSA algorithm is listed as Algorithm 1.

1: Initialization:
  The network environment and experience replay buffer ; the parameters for hypernetwork and all of the agent networks ;
2: Setting:
  The target-network parameters , the learning rate , the discount factor , the batch size , maximum training epoch, episode, slot: , maximum train step ;
3: [Centralized Training Phase]:
4: whiledo
5:  fordo
6:   fordo
7:    for each agent do
8:    Get observation , action , reward ;
9:    end for
10:    Get the next observation ;
11:    Store the to the observation-action history;
12:    end for
13:    Store the episode data to the replay buffer ;
14:   end for
15:   for in each epoch do
16:    Sample a batch of episodes’ experience from ;
17:   for each slot in each sampled episode do
18:     Get and from the evaluate-network and the target-network, respectively;
19:   end for
20:   Calculate the loss function by (14), and update the evaluate-network parameters ;
21:   Update the target-network parameters ;
22:  end for
23:  Save DRQN and QMIX network models;
24: end while
25: [Decentralized Executing Phase]:
26: Setting: ;
27: Input: The channel state;
28: Output: The agents’ observations and actions.
4.2. Computational Complexity Analysis

In the proposed QMIX-based DSA algorithm, DRQN is adopted for each agent, which can well handle the Dec-POMDP problem. Besides, some simple activation function, e.g., ReLU and ELU are employed in the algorithm. The operation mainly involves matrix multiplication and addition. In particular, for the DRQN, let us assume that there are layers, and the number of neural units is in -th layer, and is the size of input layer. Then, the number of multiplications through DRQN can be presented as . For each agent, the computational complexity of one sample is . Note that the offline training is parallelly worked at the edge server in the training phase. The training complexity of one batch of episodes under training slots is . And the whole computational complexity is until the algorithm converges over iterations. Further, the computational complexity of the execution phase is , since each SU acts locally at each time step. Due to its monotonicity, the complexity increases linearly with the increase of input scale, which greatly improves the efficiency of the algorithm. Therefore, less computational resource is required in practice.

5. Simulation Results

In this section, we provide the simulation parameter setting, and then evaluate the performance of the proposed QMIX-based DSA scheme and the rationality of the defined problem via simulation.

Since there is a mixing network, a hypernetwork and agent networks in the proposed algorithm, the corresponding network parameters setting are illustrated as follows: the mixing network, which brings the global action value from local action values, has one hidden layer of 32 neurons, and the nonlinear function ELU is employed as the activation function. For the hypernetwork, it consists of one hidden layer which has 64 neurons with ReLU as the activation function. Each agent network is with one recurrent layer employing a GRU with 64-dimension hidden state. Unless otherwise specified, other simulation parameters are summarized in Table 2.

In particularly, for hyperparameters, the replay buffer is capacity-limited which can store 100 sets of data, and the oldest data will be removed when the buffer is full. The batch size for sampling is 16 episodes. The target networks of the DRQN are updated every 40 training steps. In the whole process, it is considered that there are 2000 training epochs, each epoch has 100 episodes, in which 20 time slots are regarded as one episode. In the training phase, to encourage the agent to explore the environment, the “explore and exploit” mechanism is employed by agents to choose actions [42]. The exploration probability decays from 0.4 to 0.02 over 400 steps. Then, in order to timely evaluate the quality of the training performance, the distributed execution is conducted every four training epochs, where we set , and the agents make decisions only by the local models. For the environment setting, we assume that the channels change in periodic mode and each SU can only sense one channel.

To verify the advantages of the proposed scheme, in Figure 4, we compared our proposed scheme with two other schemes: (1) the IQL-based DSA scheme [43] and (2) the VDN-based DSA scheme [44]. We take nine SUs and four channels with one channel unavailable to compare the performance of these schemes. The abscissa is the training epoch, and the ordinate is the normalized reward. The simulation results show that, the performance curves of IQL and VDN based schemes show relatively large fluctuation, which are far from the best effect. It can be seen that the maximum value under IQL is only 0.17, while VDN is better than IQL reaching just 0.32. The reasons for this result can be explained as follows: for IQL-based scheme, each agent operates independently in the whole learning process, which is not conducive to the stability and convergence. And for VDN-based scheme, it does not use global state information during central training, and a simple weighted summation method is used to decompose the joint value function to update the agents’ strategies, causing a bad training effect. In addition, due to the powerful fitting ability and the integration of global environment information, an excellent DSA effect is achieved in our proposed QMIX-based scheme. Therefore, the DSA performance of our proposed QMIX-based DSA scheme outperform other two schemes.

Figure 5 displays the sum of rewards obtained by all users in an episode versus training epoch in the case of different number of SUs. There are four channels considered in the network, where there are always three channels available for SUs that changes periodically over time. It can be seen intuitively that under three different settings, as the number of training epoch increases, the total episode reward increases gradually, and finally reaches to the maximum value within limited training. Note that the negative reward happens at the beginning of the training, which can be explained that some SUs select the nonidle channels, since the agents’ network models are still rough at that moment, although four epochs’ training is done. Specially, when there are 15 SUs, the initial episode reward is about -30, and then a longest time is experienced for convergence. This is because the more number of SUs means the larger calculation dimension and the slower learning speed, which makes it more difficult for SUs to learn and analyze the environment. That is, through the proposed method, firstly, the SUs have learned to capture the spectrum holes. On this basis, the load balance is realized, so as to obtain the optimal DSA. Besides, we can observe that, when the networks are well trained, more episode reward value of the system can be achieved with more users. It is related to the definition of global reward in cooperative MARL environment, which integrates all SUs’ rewards.

In Figure 6, the behavior of the episode rewards under different number of idle channels is plotted. To facilitate the comparison, we set the idle channels for , and 12 SUs are participating in the DSA. Likewise, we can observe that the curves fluctuate but overall increase and then converge to the maximum value as the training epoch increases. As for the situation of the fewer channels available, it can be seen that a slower convergence speed is acquired, which means the fewer optional channels requires more time for multiple SUs to make a “trial and error” until the model is well-done, so as to achieve the optimal DSA. We can also observe that at the beginning of the training, a quite low negative value, about -100, is caused when , i.e., there are two unavailable channels. It can be explained that the more number of unavailable channels, the greater probability that the SUs make a wrong decision to choose unavailable channels. It also reflects the huge influence of the number of unavailable channels on network learning effect. What is more, it is also shown that the more available channels, the larger total episode reward is obtained, which reveals that the more available channels brings the better balance among these channels.

Figure 7 evaluates the performance with respect to number of authorized channels, which are set as four, five, and six, and the PU occupies one, two, or three channels correspondingly to ensure the same channels available for SUs. Considering that the SUs’ sense accuracy is also an important indicator for the DSA, the miss detection ratio is evaluated as well. From the presented reward curves, we can observe that in three different cases, the episode rewards are increasing gradually. Since there is little difference among the set of available channels and SUs, the curves are entangled with each other, and the values are relatively close in the whole process. Finally, they all converge at the same value, which is also the optimal value after the network model is fully trained.

Meanwhile, the miss detection ratio behaves an opposite trend. This intuitively shows that the agents’ detection ability is indeed weak as they experience little learning, which makes it easy for them to make the error decision. And we find that, when more channels are occupied by the PU, the initial miss detection ratio is greater, which is consistent with the reward at the starting stage. Then, with the increase of training time, the SUs achieve a perfect detection (). At this time, compared with reward curves, it can be seen that the reward value has not reached the optimal value, but converges after more training. This shows that after overcoming the imperfect detection, it takes some time to further realize load balance for SUs on the available channels.

To be more intuitive, Figures 8 and 9 present the effects of the distributed execution in four consecutive channel states under the initial and final training phases, respectively. Here, the initial phase is the first four training epochs, while the final phase is the last epoch when the models are fully trained. We assume that nine SUs independently make a choice from four authorized channels in different channel states for instance.

We can see from Figure 8 that there are many wrong channel detections and selections in the fourth tested states. Due to the fact that the SUs have no more knowledge of the channel variation characteristics in the initial training phase, they access channels almost randomly. Specially, in the state2, although there are three idle channels, the SUs gather in the fourth channel. This reduces the channel utilization and also causes congestion of other idle channels. Whereas, we can see from Figure 9 that the result of DSA is well done after the final training. Not only no one selects the occupied channel but also the SUs on each available channels are balanced. The goal of the DSA in the considered scenario is achieved, and the effectiveness of proposed algorithm is demonstrated. Besides, we have counted the real execution time under one of the channel states, and we find that it takes only about 26.59 ms for all SUs’ DSA. It highlights the low computational complexity of the proposed algorithm furtherly. Then, since all SUs’ collected spectrum data has no error after training, the availability, accuracy, and timeliness of the acquired spectrum data can be guaranteed.

6. Conclusion

In this paper, we have studied the distributed DSA strategies for multiple cognitive users in MEC-enabled network, where the spectrum environment is time-varying, and the users make decisions with imperfect spectrum sensing. The DSA task, including capturing the spectrum hole and achieving the load balance on channels available, is investigated. We modeled the problem as Dec-POMDP, and a QMIX-based DSA algorithm is proposed, which allows users offload their task to the MEC server to train the network models. We evaluated the system reward and the miss detection ratio of the DSA by the proposed algorithm. The results showed the rationality of the model and the effectiveness of the proposed algorithm.

Data Availability

The data used to support the findings of this study are included within the article.

Conflicts of Interest

The authors declare that there is no conflict of interest regarding the publication of this paper.

Acknowledgments

This work was supported by National Natural Science Foundation of China under Grant Nos. 62171449, 61931020 and 62001483.