Abstract

A convolutional self-attention network-based channel state information reconstruction method is presented to address the issue of low reconstruction accuracy of channel state information in Multiple-Input Multiple-Output (MIMO) at a high compression rate. First, an encoder-decoder structure-based channel state information reconstruction model is built. The feature is extracted by the encoder’s convolutional network, and the information is compressed by adding an attention block. At the same time, the compressed information is nonuniform quantized to prevent the transmission process from using up too much bandwidth. A dequantization module and an attention block are added to the decoder to reduce the impact of noise on the matrix, converting the continuous value into a discrete value to increase reconstruction accuracy and using the long-time cosine annealing training approach. According to the simulation results, when compared to CsiNet, Lightweight CNN, CRNet, and CLNet, convergence speed is improved by 17.64%, indoor reconstruction precision is improved by an average of 37.4%, and outside reconstruction accuracy is improved by an average of 32.5% under all compressions.

1. Introduction

With the rapid development of information technology, the massive MIMO system in the 5th Generation mobile communication system (5G) can meet the requirements of high reliability and large system capacity for wireless communication [1]. MIMO can boost system capacity and decrease multiuser interference by providing the base stations (BS) with hundreds or even thousands of antennas in a distributed or centralized way. By obtaining correct channel state information, this benefit is realized (CSI). The downlink CSI in a massive MIMO system with time-division duplexing (TDD) can be retrieved through channel reciprocity and transmitted back to the base station. Users must first estimate the downlink CSI in massive MIMO systems using frequency-division duplexing (FDD), and then, they must send that information back to the BS. With the increase in the number of antennas, the cost of CSI feedback is increasing rapidly, and the CSI compression method will lead to a decrease in feedback precision.

At first, vectorization or codebook-based methods are used to reduce the overhead, but the feedback overhead is proportional to the number of antennas, which is not feasible in large-scale MIMO. To reduce the cost of CSI feedback and ensure the accuracy of CSI reconstruction, a method based on compressed sensing [2] (CS) is proposed; examples include absolute shrinkage and selection operator [3] (Lasso) and approximate message passing [4] (AMP). They exploit the spatial and temporal correlation of the CSI to improve the precision of the reconstruction but need to assume that the CSI matrix is sparse to apply CS to feedback or distributed compressed channel estimation [5]; moreover, most reconstruction methods are iterative algorithms, which have high complexity and cannot obtain good results.

Some scholars have introduced the deep learning (DL) algorithm into the massive MIMO feedback scheme, which provides a new way to solve the traditional CSI reconstruction problem. The literature [6] applies the convolutional neural network (CNN) to CSI feedback and proposes an autoencoder-based channel state information net (CsiNet), which does not rely on the knowledge of channel distribution and instead directly learns to effectively use the channel structure from training data; it can result in higher refactoring speed. However, it only pays attention to the angle delay domain sparsity and ignores the spatial correlation, which results in a sharp decrease in the resolution at a low compression ratio (CR). The literature [79] based on CSI time-specific network design, by using CNN and recurrent neural network (RNN) to extract spatial and temporal features, respectively, has significant robustness at low CR. However, RNN is essentially an iterative algorithm, which increases the computational complexity and time of the original network. Therefore, a lightweight convolutional neural network (Lightweight CNN) [10] is proposed to reduce the computational complexity without decreasing the refactoring accuracy. The method uses a convolution kernel in the encoder/decoder layer, which can only extract the edge information in a very small region, resulting in the sparse region of the CSI matrix that cannot extract the specific features. In Reference [11], CsiNet+ with a large convolutional kernel is designed to obtain more information by enlarging the range of convolutional fields; even in the sparse domain of CSI, the large convolution field can obtain enough features, which can further improve the precision of CSI feedback. However, in the process of CSI feedback, the sparsity varies with channel scenarios, and CRs, and CsiNet+ cannot adapt to all situations at the same time, so the reconstruction accuracy is still insufficient. In Reference [12], the channel reconstruction network of a multiresolution neural network (CRNet) is proposed, which can extract information in multiscale. CRNet designs two kinds of convolution cores to process different CRs, and the reconstruction accuracy is guaranteed. At the same time, the large convolution kernel is optimized to reduce the computational complexity of the network. The literature [13] preserves the real and complex numbers of the CSI matrix and introduces a convolutional block attention module (CBAM) in the angular delay domain to suppress unwanted noise. Good accuracy is obtained. However, the CSI matrix is a set of interrelated data, and the matching ability of the traditional convolutional neural network reconstruction method to the in-phase and quadrature samples needs to be further improved. Important information is missed, which reduces the accuracy of refactoring in low CR and outdoor environments.

Due to the existence of the aforementioned methods, the CSI matrix is treated as an image, and the convolutional neural network can only extract some features, decreasing the reconstruction, noise in the actual channel interfering with the CSI matrix reconstruction, and excessive bandwidth resources. To address the issue of designing training methods, this paper proposes a channel state information reconstruction method based on convolutional self-attention network (CSANet). The method removes redundant convolutional neural layers and adds self-attention blocks to analyze and extract features from all CSI information to improve the reconstruction accuracy; at the same time, a CSI reconstruction model based on an encoder-decoder structure is constructed, which uses the convolutional layer to assist self-attention block to get the features and then compresses the CSI matrix through the self-attention block. After that, the compressed CSI matrix is quantized to reduce bandwidth consumption and eliminate the effect of noise. In the decoder, the quantization module is used to restore the quantization data to get the high-accuracy CSI compressed data, and then, the compressed data is restored to the CSI matrix by the self-attention decoding block. To improve the reconstruction accuracy, a new training scheme with multiple training epochs and a learning rate varying with training epochs is introduced.

2. System Model

In this paper, a single-cell downlink massive MIMO system is proposed, which has antennas at the BS and a single receiver antenna at the UE. This system uses the subcarrier to transmit information in Orthogonal Frequency Division Multiplexing (OFDM). The received signal at the nth subcarrier is provided as follows: where , , , and represent the frequency domain channel vector, precoding vector, data-bearing symbol, and additive white GAUSSIAN noise of the th subcarrier, respectively. Then, can be regarded as the superposition of CSI in the spatial frequency domain, i.e., the channel matrix of the downlink. As long as BS receives the matrix, it can generate the predicted vector; in the FDD system, UE returns to BS via the feedback link, so the number of parameters to be fed back will reach , which is not allowed for limited feedback links. This article focuses on reconstruction scenarios, so it is assumed that perfect CSI is achieved through training [14].

To reduce the overhead, we increase the sparsity in the angular delay domain by using discrete Fourier transform (DFT). where and are and DFT matrices. Because of the finiteness of delay extension, only the first rows () of sparse matrix have nonzero elements, and the rest rows have almost zero elements. So keep the first rows, and remove the remaining rows. Here, we still represent the matrix in terms of , so the total number of parameters required for feedback is , which, although partially reduced, is still a large number in large-scale MIMO systems.

The -matrix is then compressed by the encoder and generates the codeword according to the given compression ratio (CR): where represents the coding process, the encoder can transform the channel matrix into an -dimensional vector, where, and the data compression ratio is.

Once the codeword is received by the BS, the decoder is used to reconstruct the channel:

stands for the decoder. Therefore, the whole reconstruction process can be expressed as

The goal of this paper is to minimize the difference between the original and reconstruction , which can be expressed as finding the encoder and decoder parameter sets that satisfy the condition:

3. Network Structure

In this section, the above-mentioned framework is introduced, including neural network architecture, quantization and dequantization modules, and network training methods.

3.1. Neural Network Structure

In this paper, a CSI reconstruction method based on a convolutional self-attention neural network (CSANet) is proposed. As shown in Figure 1, CSANet uses a self-attention mechanism to analyze the whole CSI matrix information, getting the overall characteristics to improve accuracy. The following will be the structure of CSANet, and attention encoder block and attention decoder block were analyzed.

3.1.1. CSANet Network Structure

As shown in Figure 1, represents the number, length, and width of CSI, respectively. The reconstructed neural network consists of an encoder and a decoder, and the performance of the CSI reconstruction scheme depends largely on the compressed portion; that is, in the encoder, the less information loss in the compression process, the higher the reconstruction accuracy. At the same time, due to the limited computing capacity of UE, the network structure should not be designed too deep.

First of all, we still use the 3-layer convolutional network to extract the information from the input matrix. Since the traditional neural network uses the linear rectification function (Relu), it will lose the negative value; in this paper, a batch and leaky Relu rectifying function is used to solve the problem of gradient disappearance after each convolutional layer. The reason for using the convolutional network to extract features first is that it is found that convolutional operation can improve the reconstruction accuracy before data enters the self-attention block. Since the traditional convolutional neural network splits the real and imaginary values of complex numbers into two matrices for reconstruction, thus splitting the intrinsic message of the CSI matrix, this paper convolves the real and virtual numbers, this is converted into a matrix, and the feature matrix is then passed into the attention encoder block, where there are two attention encoder blocks. While reducing the depth of the network, it can also further extract the global features and mark the important information in them. Then, the CSI matrix is compressed into according to the required compression ratio η.

For BS to analyze network scattering and fading using CSI, it is necessary to use a decoder at BS to reconstruct the compressed codeword; the structure of the decoder based on self-attention can refactor CSI more quickly and reduce the parameters of the required operations. Therefore, the decoder uses only the full self-attention network, as shown in the decoder module in Figure 1.

The decoder consists of a fully connected layer and three self-attention decoder blocks. First, the codeword compressed by the encoder is reduced to a CSI matrix of by the fully connected layer, to improve the precision of CSI reconstruction; this paper adopts a 3-level attention decoder block, and then, the CSI matrix values are mapped by logistic regression function (sigmoid) to get the fully recovered CSI matrix.

3.1.2. Attention Encoder Block and Attention Decoder Block

The attention encoder block, shown in Figure 2, consists of a multiheaded self-attention block, a multilayer perceptron (MLP), and a corresponding normalized layer. The encoder block is designed as a residual structure to prevent the loss of gradient when the learning parameters are too low.

To make full use of the spatial characteristics of CSI, location coding is first used to obtain the spatial relationships among multiple data. After entering the multiheaded self-attention block, the CSI data are analyzed for various features, and then the neural network weight and bias are further learned through a multilayer perceptron.

The decoder block, as shown in Figure 3, is similar to the encoder block in that the input information is decompressed and characterized by a multiheaded self-attention mechanism. After the multilayer perceptron, the eigenvalue matrix is reduced to the size of the CSI matrix in the linear layer, and the, the original CSI matrix is reconstructed by a sigmoid mapping the eigenvalue to the (0,1) range.

The CSI matrix is set as a matrix by the attention block. Firstly, will transform linearly to generate , and . where , and are linear transformations and is the dimension of .

The multiheaded self-attention block consists of scaled dot-product attention blocks, which are connected and linear to obtain the output of the multiheaded self-attention block, as shown in Figure 4(b).

An attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors. The output is computed as a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key [15]. As shown in Figure 4(a), the attention block is first required to calculate the self-attention score by and and then multiplied by the matrix to the result, where the self-attention formula is as follows:

3.2. Quantification Module

In the traditional convolutional neural network CSI matrix transfer process, the compressed matrix is transmitted directly, which takes up a large amount of bandwidth and is not easy to recover due to the noise; the compressed matrix, therefore, needs to be quantified. The quantized matrix has better antijamming ability than the bit-stream and is convenient for storage. The UE encoder in CSANet needs to convert the output into a bit stream and quantize the CSI matrix after the encoder outputs. Once a bit stream is received at the BS, the quantized value is dequantized before it is entered into the decoder, as shown in Figure 5.

In [16], the uniform quantization method is used to discretize the measurement vector, but the uniform quantization method is not suitable for small signals. The signal-to-noise ratio (SNR) of the CSI signal is small, and it cannot reach the standard of uniform quantization, which limits the range of input and is not the best quantization method for compressing and reconstructing the CSI matrix; therefore, this paper adopts the method of nonuniform quantization to quantize the CSI signal and uses -law nonuniform quantizer to optimize by adjusting the compression function , which is where is the input value of the matrix and is a weak signal. is a constant that determines the degree of quantification. The value we use is 87.6, which is the current international standard value. At this time, when the signal is small, it can be seen from the above formula that the signal is amplified by 16 times, and the quantization interval is reduced by 16 times compared with the uniform quantization. Therefore, the quantization error is greatly reduced;

3.3. Training Programs

Previous convolutional neural network models such as CsiNet, CLNet, and others have fixed the number of training sessions to 1000, but no ablation experiments have been done to explain why 1000 is the best, and even some articles have ignored the description of the training regimen. In fact, for a neural network to achieve better accuracy, not only the need to design the corresponding network structure for the problem but also a network structure to adapt to the training program is essential. Therefore, it is of great significance to find the corresponding training scheme for CSI reconstruction.

On the one hand, if the fit does not occur, the results will become better as the number of training sessions increases. The reasons for this are data interference and the complexity of the training model. For a CSI reconstruction network, the CSI matrix is a random but correlated set of data, unlike in computer vision tasks where there are multiple targets in a single image, the CSI-reconstructed neural network is still very small compared to most neural networks such as the common ResNet 50, which has a parameter of 3.9 G, about 87 times as many as CSANet. Overfitting will be less likely to occur for CSANet, so this article will train for as long as possible. On the other hand, an appropriate learning rate is also the key to training, this paper uses the Adam optimizer, although Adam can dynamically adapt to a constant learning rate during training.

The experiments in this paper show that for CSI reconstruction, better reconstruction accuracy can be obtained by using preheating training and cosine annealing learning rate, which are shown in Figure 6; the figure represents the number of horizontal coordinates training, to 100 epochs as an example, vertical coordinates for the learning rate.

In the early stages of training, networks need a large learning rate to converge faster and to find better regions. When the training time is too long, the learning rate needs to be reduced to approach the optimal value gradually, and the change of learning rate under cosine annealing mode is smooth, which can make the training more stable. Therefore, it can improve the precision of refactoring.

4. Simulation Analysis

4.1. Simulation Parameter Setting

For comparison with CsiNet, the data set in this article was generated using the COST2100 [17] model: (1) the indoor pico cellular scenario at the 5.3GHz band and (2) the outdoor rural scenario at the 300 MHz band. All parameters follow their default setting in [17]. The BS is positioned at the center of a square area with lengths of 20 and 400 m for indoor and outdoor scenarios, respectively, whereas the UEs are randomly positioned in the square area per sample, with specific parameters shown in Table 1. The whole process is realized in Pytorch, for the training method, too large an initial learning rate is not conducive to data convergence, so the initial value and final value of the learning rate of the cosine annealing algorithm are set to be and .

4.2. Analysis of Simulation Results

To evaluate the performance of the system, the normalized mean square error (NMSE) is used to represent the difference between the original channel data and the reconstructed channel data :

4.2.1. Ablation Experiment

To compare the constant learning rate with the adaptive learning rate of the long-time cosine annealing algorithm, a simulation experiment is designed and analyzed. The simulation results are shown in Table 2 and Figure 7.

As shown in Figure 7, the horizontal coordinates represent the number of training sessions and the vertical coordinates represent the loss values for each method. CSANet-train and CSINet-train represent the loss values of CSANet and CsiNet in training, and CSANet-test and CsiNet-test represent the loss values of the two methods in testing. As you can see, with the increase in the number of training, the loss of the test has been around the loss of training fluctuations. Therefore, the CSI reconstruction task is a weak fitting task and can be trained over more epochs. At the same time, with the increase of training epoch, the loss value of CSANet trained by long-time cosine annealing decreases faster than that trained by CsiNet trained by the constant learning rate and reaches convergence earlier; the convergence speed increases by 21.22%. It is proved that long-time cosine annealing is superior to the constant learning rate training method.

As shown in Table 2, CSANet-const is the constant learning rate training and CSA-cosine is long-time cosine annealing training. The long-time cosine annealing method is superior to the constant learning rate training method. With the increase of training epoch, the NMSE obtained by the long-time cosine annealing method is lower and more accurate. When the training epoch reaches 2000 epochs, the reconfiguration accuracy is still increasing, but it approaches the optimal value, and the benefit of continuous training becomes very small.

For the CSI reconstruction task, CSI data compression transmission needs to be real-time. When the base station is in different environments, the channel state information generated is also different, so it is necessary to weigh the relationship between the reconstruction accuracy and the training epoch. It can be seen from Table 2 that 3000 epochs only extend the training epochs, and do not obtain equivalent reconstruction accuracy. Therefore, this paper sets the training method to 2000 epochs.

To better compare the gain of the self-attention network (SA), convolutional neural network (CNN), and convolutional self-attention network (CNNSA) on the CSI reconstruction task, ablation experiments were designed. Based on CsiNet, as shown in Table 3 and Figure 8. CNN+SA is that the encoder uses a convolutional network and the decoder uses an attention network, SA+SA replaces the encoder-decoder with the self-attention network, and CNNSA+SA replaces the convolutional self-attention network. In Figure 8, the horizontal coordinate is the compression ratio, and the vertical coordinate is the NMSE performance. With the gradual replacement of convolutional networks by attention networks, the reconstruction accuracy of low compression law is improving. At a high compression rate, the precision of full use of the self-attention network will be reduced because the small area feature cannot be extracted. Therefore, this paper proposes to add a convolution network to the input of the encoder to assist the attention network to acquire the features under a high compression rate. The experimental results show that the convolutional self-attention network can achieve higher reconstruction accuracy at all compression rates.

To compare the relationship between the nonuniform quantization module proposed in this paper and the uniform quantization and no quantization, we designed an ablation experiment, as shown in Table 4; CR is the compression rate, B is the quantization bit, UQ is the uniform quantization, NUQ is the proposed nonuniform quantization. It can be seen from the table that since part of the precision will be lost after quantization; the overall performance is lower than the original 32-bit precision. Second, as we said, nonuniform quantization is better than uniform quantization because the measurement vector is a weak signal. Moreover, when the quantization bit is 6, it shows a similar performance to the unquantized original CSANet. It is proved that the 32-bit precision of the codeword compressed by the neural network is too redundant, and it can be replaced by 6-bit.

4.2.2. Performance Analysis

The feedback CSI serves as a beamforming vector. Let be the reconstructed channel vector of the th subcarrier. If is used as a beamforming vector; then, we achieve the equivalent channel at the UE side.

To measure the quality of the beamforming vector, we also consider the cosine similarity:

To sum up, we use the long-time cosine annealing algorithm to train the model during 2000 epochs. CSANet compares NMSE, cosine similarity , and floating-point operations per second (FLOPs) with other methods as shown in Table 5 and Figure 9, where the horizontal coordinates are compressibility and the vertical coordinates are NMSE performance.

CSANet achieves the lowest NMSE and is significantly superior to other neural network methods at all compression ratios. The average indoor and outdoor accuracy increased by 37.4% and 32.5%. In the outdoor high compression ratio, the advantage is more obvious, with an increase of 37.8%. At the same time, in all environments, the cosine similarity of the proposed CSANet outperforms other networks. At a high compression rate, more features can be obtained by convolutional network first, and then by semantic analysis of the attention network for high-precision reconstruction. Since the attention network contains two fully connected layers responsible for improving the accuracy, compared with other networks, the computational complexity has been significantly improved. The average FLOPs reached 44.48 M, but compared with the traditional neural network model, CSANet is still a simple model, such as ResNet50, with 3.9G FLOPs, so the additional computational overhead is acceptable. The core issue of CSI feedback is still the accuracy of reconstructing CSI, so the FLOPs of this method are far from hindering practical research. Therefore, the convolutional self-attention network CSANet is more suitable for the CSI reconstruction task.

In addition, the indoor compression ratio is 1/4, the convergence rate of each model is simulated, and the simulation results are shown in Figure 10. The horizontal coordinate is the training epoch, and the vertical coordinate is the loss value of the test set, with the increase of the training epoch; it is obvious that the loss value of CSANet decreases faster. It can quickly adjust the network parameters and achieve the convergence effect before other network methods. The average convergence rate increased by 17.64%. It shows that the convolution self-attention network is better than the convolution network model and more suitable for the CSI reconstruction task.

5. Conclusion

Since convolutional networks cannot extract all CSI features, which leads to the low reconstruction accuracy of CSI in FDD MIMO systems; a convolutional self-attention-based neural network method (CSANet) is proposed in this paper. In this method, the redundant convolutional layer is removed, the attention block and the quantization module are added, and the long-time cosine annealing training method is introduced. The ablation experiments show that the self-attention network is more suitable for CSI reconstruction than the convolution network, and the use of the convolution network in the early stage of the self-attention network can help the self-attention network to recover the CSI matrix. Secondly, the CSI reconstruction task is a weak fitting task, so it can increase the training epoch to achieve higher reconstruction accuracy. The experimental results show that CSANet can converge faster at the beginning of training, and the convergence speed is increased by 17.64% on average. Compared with CSINet, Lightweight CNN, CRNet, and CLNet, the average indoor and outdoor reconstruction accuracy is improved by 37.4% and 32.5%. Therefore, the CSANet proposed in this paper is more suitable for CSI refactoring tasks.

Data Availability

The data that support the findings of this study are available from the author upon reasonable request.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This study is supported by the National Natural Science Foundation of China under Grant 61931004.