Abstract

The optimization problem for the performance of opportunistic spectrum access is considered in this study. A user, with the limited sensing capacity, has opportunistic access to a communication system with multiple channels. The user can only choose several channels to sense and decides whether to access these channels based on the sensing information in each time slot. Meanwhile, the presence of sensing error is considered. A reward is obtained when the user accesses a channel. The objective is to maximize the expected (discounted or average) reward accrued over an infinite horizon. This problem can be formulated as a partially observable Markov decision process. This study shows the optimality of the simple and robust myopic policy which focuses on maximizing the immediate reward. The results show that the myopic policy is optimal in the case of practical interest.

1. Introduction

There is a significant increase in the demand for radio spectrum with the emergence of new applications and the compelling need for mobile services in recent years. This is partly due to the increasing interest of consumers in convenient and ubiquitous wireless services, and the interest has been driving the evolution of wireless networks to high speed data networks. However, ever since the 1920s, in order to avoid the serious interference in wireless services, the wireless providers have been required to apply an exclusive license from the government. Today, it is becoming very difficult to find vacant bands to either deploy new services or to enhance the existing ones with most of the spectrum being already allocated [1]. On the other hand, not every channel in every band is in use all the time; a large number of vacant spectrum holes can be discovered in the spectrum [2]. A technique for opportunistic spectrum access can effectively utilize these spectrum holes.

The spectrum sensing for detecting spectrum holes is the precondition for opportunistic spectrum access. However, the existing spectrum sensing techniques has to face one main challenge: wideband sensing which is hard to be implemented for the main reason of hardware limitations [3]. The user usually uses a tunable narrowband bandpass filter at the radio frequency (RF) front-end to sense one channel at a time due to the costliness of a wideband RF front-end. Consequently, it is a lot of time delay for detecting all channels. Meng et al. [4] study this problem with the method of compressive sensing based on the sparse observations of sensing information. Equipped with frequency selective filters, the sparse sensing information vectors of multiple channels are linearly combined and compressed. Multiple channels thus can be sensed simultaneously. However, it is used in the case of less practical interest due to the requirements of sparse observations and frequency selective filters. The basic theory of compressive sensing is given in the work of [58]. Some other studies focus on the reliability of sensing information. Different SNR estimations and channel fading environments are considered in [9, 10] to improve the reliability of sensing information. Chen [11] studies the optimum number of collaborative users to get the tradeoff of the reliability and the complexity. The Byzantine attacks which come from malicious users and carry false sensing data are taken into account in [12].

The studies of [1316] also exploit a method to solve the problem of wideband sensing by estimating the information of all channels with only a small amount of sensing results. The sensing procedure is modeled as a partially observed Markov decision process (POMDP). Zhao et al. [13] propose this idea and a myopic sensing method. Wang et al. [14] exploit the impact of the rateless code. Lingcen et al. [15] modify the cost function of POMDP with the switching time.

We consider the communication system where a user has opportunistic access to multiple channels like the model of [13] but is limited to sensing and transmitting only on several channels at a given time due to its hardware limitation. Meanwhile, the presence of sensing error is also considered. We explore the problem to maximize the performance of opportunistic access when the past observations and the knowledge of the stochastic properties of these channels are given. This problem can be described as a partially observable Markov decision process since the user does not have full knowledge of the availabilities of all channels. We examine the optimality of the myopic policy for this problem in this study for the reason that the myopic policy is very simple and robust. Specially, we show that the myopic policy is optimal in the case of practical interest. Ahmad et al. [16] also study the optimality of the myopic sensing. However, we have discovered that the study of [16] is only suitable for the special case where the one-channel myopic sensing and the absence of sensing error are considered. If we consider the multichannel myopic sensing or the presence of sensing error, the study of [16] cannot hold its conclusion. The reason is that the mathematic method of the proof of [16] is very special for the case of one-channel myopic sensing, and Lemmas 2, 3, 4, and 5 of [16] cannot be improved to prove the optimality under the other conditions. We propose the proof of the optimality of multichannel myopic sensing in the presence of sensing error. Our mathematic method is rigorous and quite different from the method of [16]; we use two functions to give the proof, and the method is generally effective for such issues.

The rest of this paper is organized as follows. We formulate the problem in Section 2 and give the definition of the myopic policy in Section 3. We prove the optimality of the myopic policy in Section 4 and extend the results from the finite horizon to the infinite horizon. The numerical results of the performance comparison of the myopic policy and the optimal policy are given in Section 5, and the conclusion is drawn in Section 6.

2. System Model

Consider a spectrum consisting of independent and statistically identical wireless channels; each channel has two states , and the state transition is given by a two-state discrete time Markov chain shown in Figure 1. It is supposed that these channels evolve according to a synchronous time slot structure indexed by , where . Specially, the states of all channels at time are denoted by , where .

We consider a user seeking the spectrum holes in these channels for opportunistic access. At the beginning of the time slot , the user selects a set of channels to sense and a set of channels to access, where . However, due to hardware limitation, and are usually much smaller than . The user can only sense the channels in and does not have the states of all channels , when a decision is made at time . Consequently, the spectrum is not fully observable to the users. For clarity, we use to denote the action of the user at time and use to denote the sensing information of channels, where and is the index of the channel in . Specially, we cannot guarantee the absolute reliability of the sensing information in the presence of sensing error; the detection and false alarm probabilities should be taken into account. Here, the detection probability denoted by and the false alarm probability denoted by [17] are the conditional probabilities that the state of the channel is actually when the observations are and , respectively.

For making the optimal decision, a sufficient statistic of this system denoted by is given, where is the conditional probability that the channel is idle at the beginning of the time slot given all past observations and actions, and this conditional probability is called the idle conditional probability. Due to the Markovian property of the channels, the future idle conditional probability is only a function of the current idle conditional probability and the action.

Proposition 1. The relationship between and can be given by (3).

Proof. We first consider case 1 that and to simplify the proof; the events , , , and are denoted by , , , and , respectively. The value of in case 1 can be replaced by ; then we can get
Since is independent of when has been determined, we can get
The proofs of the other cases are similar.

The objective of the user is to maximize its total (discounted or average) expected reward; let and [18] denote the rewards, respectively. Here, denotes the policy of the user, denotes the discounted factor, is the initial probability distribution of all channels, and represents the mathematical expectation which is determined by the initial probability distribution and the policy . Consequently, and denote the discounted and average reward with the initial probability distribution and the policy , respectively.

An optimal policy should maximize the reward of the user, and this optimization problem can be formally defined as follows: where denotes the immediate reward when the user implements the action , and we think that each channel to be accessed brings unit of reward; thus . However, there is a selection problem of . Without loss of generality and for the greedy approach, all the channels whose states are sensed as in are selected into .

Then, we can give a recursive expression of the reward function: where denotes the mathematical expectation of the immediate reward in the time slot , denotes the maximum expected reward that is accrued from time to . Specially, when . This proposition is proved in Section 4.3.

3. Myopic Policy

The myopic policy is essentially a greedy policy which maximizes the immediate expected reward in each time slot and ignores the future reward; this greedy policy has the minimal time complexity and computational complexity. The expression of the myopic policy can be given by We can discover that the channels which have the largest conditional probabilities in are selected into by the myopic policy. Consequently, the successive update of the idle conditional probability vector can determine the action of the myopic policy at time .

In particular, if is larger than all the idle conditional probabilities and is smaller than all the idle conditional probabilities for any , the myopic policy requires only the initial condition but not the precise values of . To give an explanation, we first simplify the expression of (3) by defining a function : Due to the monotonicity of , is a monotonically increasing function when . The ordering of the idle conditional probabilities can be preserved when they are updated for the reason that if . If a channel is selected into , its idle conditional probability will become when it is observed as , or when the observation is . That is, the channel has the largest idle conditional probability if it is observed as , or the smallest idle conditional probability if it is observed as . The myopic policy can create a list which preserves the ordering of the idle conditional probabilities according to the initial condition . After each update in each time slot, the channels which are not observed do not change the list, the channels which are observed as are selected into and moved to the top of the list, and the channels which are observed as are moved to the bottom of the list. Consequently, the myopic policy does not require the precise values of the updated idle conditional probabilities in this case.

We have an opposite situation when , is a monotonically decreasing function. The ordering of the idle conditional probabilities should be reversed when they are updated for the reason that if . The myopic policy also creates a list which preserve the ordering of the idle conditional probabilities. After each update in each time slot, the channels which are not observed reverse their locations, the channels which are observed as are moved to the bottom of the list, and the channels which are observed as are moved to the top of the list. Consequently, the myopic policy also does not require the precise values of the updated idle conditional probabilities in this case.

4. Optimality of Myopic Policy

In order to show the optimality of the myopic policy, we first define two functions which can denote the expected rewards obtained by the myopic policy and the arbitrary policy: where denotes the expected total reward obtained by the myopic policy from time on. denotes the sequence of the idle conditional probabilities of all channels; it is reordered to by and . denotes the set of channels which have been chosen to sense by the myopic policy. denotes the set of channels which have been chosen to access: where denotes the expected total reward obtained by the arbitrary policy at time and the myopic policy from time on. denotes the sequence of the idle conditional probabilities of all channels. The channels corresponding to its last entries are selected into . The arbitrary set is corresponding to the arbitrary policy. denotes the set of channels which have been chosen to sense by the arbitrary policy. denotes the set of channels which have been chosen to access.

In particular,

Theorem 2. When is finite, the optimality of the myopic policy at times is equivalent to for any and .

Proof. We first prove the sufficiency inductively. The myopic policy is optimal at time for the reasons that is larger than any and obtained by the arbitrary policy.
Then, we suppose that the myopic policy is optimal at time . We have that is larger than the reward obtained by the same policy at time and the arbitrary policy from time due to the induction hypothesis, and the immediate reward obtained by the myopic policy is the largest. Consequently, is larger than any . is thus larger than the reward obtained by the arbitrary policy. The myopic policy is optimal at time . The proof of the sufficiency is complete.
The proof of the necessity can also be obtained due to the optimality of the myopic policy.

Lemma 3. and are variable functions which are polynomial of order 1 for .

Proof. We prove this by induction over time . and are polynomial of order 1 due to their definitions.
We suppose that and are polynomial of order 1. We have that and are polynomial of order 1 for the reason that is a linear function. Consequently, and are polynomial of order 1. The proof is complete.

4.1. The Case of

Assumption 4. The transition probabilities and are such that .
The function is monotonically increasing under Assumption 4. For any and , we have .

Assumption 5. We assume that for any discounted factor and all the idle conditional probabilities, the detection probability and the false alarm probability are such that For we can rewrite Assumption 5 as follows
Assumption 5 is used to limit the reliability of the sensing information; we cannot make the optimal decision if the information is very unreliable. equals and equals in the absence of sensing error; Assumption 5 is always true.

Theorem 6. The myopic policy is optimal under Assumptions 4 and 5 when is finite.
To prove this theorem, one should show that is larger than any for according to Theorem 2. One proves this inductively. Given that is larger than any for , one wants to show that is larger than any . This relies on a number of lemmas introduced below.

Lemma 7. For all , , one has

Lemma 8. For any and , one has

Proof. Lemmas 7 and 8 are true according to the definition of .

Lemma 9. One has

Proof. We use LHS and RHS to denote the left-hand side and the right-hand side of the equation, respectively. We can prove that is the first-order function of and according to Lemma 3. Consequently, we can suppose that where , , , and are irrelevant with and . Consequently, we have that The proof is complete.

Lemma 10. Consider Assumption 4. One has for any .

Proof. We use LHS to denote the left-hand side of the inequality. We use to denote : Consequently, we have for any . Therefore, we have

Lemma 11. Consider Assumptions 4 and 5. For any and , if , one has

Proof. The inequality is true at time . We have the following equation for any time according to Lemma 9: Because , , we use LHS to denote . According to the definition of , we have
According to Lemma 10, we have Consequently, under Assumption 5: The proof is complete.

Now we can give the proof of Theorem 6 with Lemmas 7, 8, and 11.

Proof. due to their definitions. For any time , , , and , we have The inequalities are true due to Lemmas 7, 8, and 11. Consequently, at time . The myopic policy is optimal according to Theorem 2. The proof is complete.

4.2. The Case of

Assumption 12. The transition probabilities and are such that .
The function is monotonically decreasing under Assumption 12. For any and , we have .

Assumption 13. We assume that, for all the idle conditional probabilities, the discounted factor is such that Like Assumption 5, we can rewrite Assumption 13 as follows:

Theorem 14. The myopic policy is optimal under Assumptions 12 and 13 when is finite.

Lemma 15. Consider Assumptions 12 and 13. For any and , if , one has

Proof. The inequality is true at time . We have the following equation for any time according to Lemma 9: Because , , we use LHS to denote . According to the definition of , we have
We have the following inequalities due to the definition of :
Then, we have
Consequently, under Assumption 13. The proof is complete.

Now we can give the proof of Theorem 14 with Lemmas 7, 8, and 15. The proof of Theorem 14 is similar with the proof of Theorem 6.

4.3. The Case of

We discuss the optimality of the myopic policy in above subsections when is finite; now we consider the extensions of results when is infinite.

Theorem 16. If the myopic policy is optimal when is finite, it is optimal when is infinite.

Proof. where denotes only in this proof.
For we can use the bounded convergence theorem to interchange and . Then, we consider the relationship of and , and two sequences are given as follows:Sequence 1: Sequence 2: .
We have the following inequalities: where denotes . And for any , we have Consequently, Sequences 1 and 2 are monotonically increasing and bounded. We can conclude that Sequences 1 and 2 have finite limits due to the monotone convergence theorem. We thus have for any policy and . Therefore, we have For , we have We can conclude that Consequently, we have The myopic policy is thus optimal when is infinite if the myopic policy is optimal when is finite. has been defined in Section 2.
Then, we consider the uniqueness of the optimal policy. Let be the myopic policy, and is the action at time ; we have For we have The above equation is the dynamic programming equation for the infinite horizon discounted reward problem. The uniqueness of the optimal policy can be proved due to the uniqueness of the dynamic programming solution.
The proof is complete.

Theorem 17. If the myopic policy is optimal for the discounted reward, it is optimal for the average reward.

Proof. We first consider the Blackwell optimality [19, pp. 336–341] of the optimal policy for the discounted reward. The sequence of is given, and , . For we can give the definition of and due to the boundedness of and : for any .

Then, we can give the average cost optimality equation (ACOE) [20]. Here, we calculate the reward:

For we have

Consequently, we have

We thus can conclude that the stationary deterministic policy realizing the pointwise maximum on the right-hand side of the ACOE is the average optimal policy due to the boundedness of , and is the maximum average expected reward [20, Theorems 4.1–4.3].

For the optimal policy for the discounted reward can maximize . Consequently, the myopic policy is optimal for the average reward when it is optimal for the discounted reward. The proof is complete.

5. Numerical Results

We consider twenty independent channels with the same bandwidth and transition probabilities . The sensing capacity of the user who transmits the data on these channels is limited to ; that is, the user can sense channels in a sensing procedure. The transition probabilities are set as follows: and when , and when . The detection probability equals , and the false alarm probability equals . We present the numerical results to evaluate the performance of the optimal policy (OP) which is the dynamic programming solution and the myopic policy (MP).

We first use the throughput of the policies to evaluate the performance. The above subfigure of Figure 2 shows the performance comparison of OP and MP when . The myopic policy is the optimal policy in this case for the reason that Assumptions 4 and 5 are met. We observe that the performance of OP is similar with MP’s. The following subfigure shows the performance comparison of OP and MP when . We observe that there is a large difference between OP and MP with the growth of the sensing capacity . The reason is that Assumption 13 is not met. For example, the curves of MP and OP at separate when equals for the reason that .

Then, we use the collision probabilities which are the probabilities which the user accesses occupied channels to evaluate the performance. The collision probability is the key metric which measures the interference caused by the user. Figure 3 gives us similar results of Figure 2. The above subfigure shows the collision probabilities of OP and MP when . We observe that they have the same collision probabilities for the reason that the myopic policy is the optimal policy. The following subfigure shows the collision probabilities of OP and MP when . The myopic policies at have the same collision probabilities for the reason that they are the same policy. On the other hand, the optimal policies are different at different .

At last, we also give the comparison of the time complexity of OP and MP in Table 1. Since , , and do not make impact on the time complexity of the policies, we mainly consider the variation of the sensing capacity. The first column of the table is the sensing capacity. The second and third columns show the time overhead of OP and MP, respectively, and the unit is second. The time overhead of MP is nearly 0 as MP does not need to calculate any parameter. In particular, the time overhead of OP is also almost 0 when the sensing capacity is 20 for the reason that OP can directly choose the channels which are observed as 1(idle). From Table 1, we can find that the time complexity of MP is much smaller than OP’s.

6. Conclusion

We show the optimality of the simple and robust myopic policy for the infinite horizon discounted and average reward criteria in the case where the stochastic evolution of channels can be modeled as the independent and identically distributed two-state Markov chains. The myopic policy is optimal when the state transitions are positively correlated and the detection probability and the false alarm probability are limited. The myopic policy is also optimal when the state transitions are negatively correlated and the discounted factor is limited.

Acknowledgments

This work was partly supported by National Natural Science Foundations of China under Grant no. 61074033 and no. 61233003, Doctoral Fund of Ministry of Education of China under Grant no. 20093402110019.