Abstract
The proposed decision feedback receiver (DFR) is an end-to-end data-driven iterative receiver, and the performance gain is achieved from iterations. However, the mismatch problem between the training set and test set exists in the DFR training, and thus performance degradation, slow convergence speed, and oscillation are introduced. On the other hand, deep unfolding using parameter sharing is a practical method to reduce the model parameter number and improve the training efficiency, but the problem whether the parameter sharing will cause performance degradation is rarely considered. In this work, we generally discuss and analyze these two problems, and solution to solve the problem or conditions that the problem no longer exists is then introduced. We give the improvements to address the mismatch problem in the DFR, and thus we propose an improved-DFR via deep unfolding. The improved-DFRs with and without parameter sharing, namely, DFR-I and DFR-IS, are both developed with low computation complexity and model complexity and can be executed by parallel processing. Besides, the practical training tricks and performance analysis including computation complexity and model complexity are given. In the experiments, the improved-DFRs outperform the DFR in various scenarios, in terms of convergence speed and symbol error rate. The simulation results also show that the DFR-IS is easier to train, and the slight performance loss can be reduced by increasing model complexity, in comparison to DFR-I.
1. Introduction
Machine learning (ML) or deep learning (DL) [1–5] is promising for wireless communications. The communication system is typically model-driven and thus the corresponding design is derived by domain knowledge. Meanwhile, the ML-enabled approach is data driven, and it learns to optimize the system design through data training. ML is applicable for end-to-end optimization problems in complex scenarios with inaccurate and/or intractable models [6, 7]. Neural network (NN) is a fundamental model of ML. As a universal function approximator, NN has been widely studied in a range of optimization problems in wireless communications, such as signal detection [8–11], degree of arrival (DOA) estimation [12, 13], beam prediction [14], and resource allocation [15, 16]. Besides, the rapid development of massively parallel processing hardwares with distributed memory guarantees the deployment and execution of NN-based algorithms to be fast and efficient.
Deep unfolding [17] is proposed as a combination of the data-driven NN-based learning approaches and the model-based iterative algorithms, and it has been widely investigated in the wireless communication community [6, 18, 19]. In deep unfolding, the iterations are unfolded into a layer-wise structure analogous to a NN, and the model parameters across layers are untied to obtain NN-like architectures which are learnable. The resulting formula makes an iterative algorithm to be capable of data learning and thus a better performance can be achieved [20]. In some cases where the signal dimension is high and the mathematical model is unpresented, the iterative algorithms are developed in a data-driven manner, and thus the scale of trainable parameters in an unfolded NN is extremely large [21, 22] and the training is challenging. Parameter sharing is a practical technology which allows the parameters across different layers to be the same and thus reduces the number of trainable parameters [9, 22, 23]. On the other hand, without parameter sharing, the number of samples with respect to the corresponding trainable parameter is in inverse proportion to the number of layers. Therefore, the decrease in model complexity and the increase in sample complexity are beneficial to overcome the overfitting problem and also improve the learning efficiency. Intuitively, the reduction of parameters in the whole unfolded NN will lead to a decrease in model approximation quality. But strictly speaking, whether the parameter sharing would lead to performance degradation is rarely investigated.
Considering signal detection, Ye et al. presented initial results in deep learning for signal detection in orthogonal frequency-division multiplexing (OFDM) systems [8]. In this letter, they exploit deep learning to handle wireless OFDM channels in an end-to-end manner. Different from existing OFDM receivers that first estimate channel state information (CSI) explicitly and then detect/recover the transmitted symbols using the estimated CSI, the proposed deep learning-based approach estimates CSI implicitly and recovers the transmitted symbols directly. Samuel et al. considered the use of deep neural networks in the context of Multiple-Input-Multiple-Output (MIMO) detection; a model and data-driven neural network architecture suitable for this detection task is proposed [24]. Furthermore, a MIMO detector which is specially designed by unfolding an iterative algorithm and adding some trainable parameters is proposed in [9]. Since the number of trainable parameters is much fewer than the data-driven DL-based signal detector, the model-driven DL-based MIMO detector can be rapidly trained with a much smaller data set.
In [25], a point-to-point band-pass wireless communication system is considered. The transmitted symbols are mapped by m-ary phase position shift keying (MPPSK) modulation [26], and the infinite impulse response (IIR) band-pass filters [27] are deployed at both the transmitter and receiver, to shape the transmitting signal and obviate the disturbances, respectively. However, the filters also introduce inter-symbol-interference (ISI), waveform distortions, and nonwhite noise. To address these issues, a matched filter, an equalizer, and a demodulator are usually required at the receiver. The ISI becomes severe when the allocated bandwidth is narrowed, and equalization becomes the key issue in this receiver. However, equalizer design in such a system is challenging. First, the system is IIR, but the typical equalizers are developed with the assumption that the systems are finite impulse response (FIR) [28]. Second, the symbols are mathematically described as scalars in a typical equalizer; meanwhile, the symbols modulated by MPPSK are formulated as vectors in the time domain. Third, the block-by-block system design is fussy. Due to these issues, the receiver design is intractable and thus we proposed an end-to-end NN-based receiver which iteratively estimates the transmitted symbols, namely the decision feedback receiver (DFR). Simulation results showed that, after several iterations the DFR had better detection performance than the receiver without feedbacks. However, we also found that the first solution of DFR performed worse than the receiver without feedbacks, but theoretically they are the same. Besides, the detection instability along iteration times occurred when the soft information of both posterior and previous adjacent symbols was utilized as feedbacks. These existing problems are needed to be solved.
In this work, we study the principle behind the performance degradation of DFR in a general way. Based on the theoretical analysis, we propose an improved-DFR to address the existing problems in DFR. Specially speaking, this work makes the following contributions:(i)We consider a mismatch between the training set and test set, and point out that the increased test error is caused by data missing and data divergence. We also give the upper and lower bounds of the test error. Then, we respectively introduce the solution to eliminate the error caused by data missing and the conditions that inherent data divergence no longer exists.(ii)We consider training one single model on an integration of multiple training sets and point out that the increased test error is caused by data divergence. We also introduce the conditions that data divergence no longer exists. Based on these conclusions, under sufficient model complexity assumption, we prove that data divergence does not exist in deep unfolding using parameter sharing. Furthermore, we use the Markov decision process (MDP) [29] to describe the iterative algorithm with insufficient model complexity, and the condition of data divergence nonexistence is given.(iii)We point out the existing problems in DFR are introduced by the mismatch between the data sets. To address the mismatch problem, we further propose the improved-DFR via deep unfolding. The improvements include: the DFR is unfolded to several sub-NNs, and the intermediate solutions can be obtained during training; each sub-NN is trained with its parameter, and the possibly existing data divergence is avoided; the soft information of both posterior and previous adjacent symbols are utilized as feedbacks, and the compromised method to guarantee stability becomes unnecessary.(iv)The improved-DFR is an end-to-end data-driven receiver, and it has low computation complexity and model complexity and can be executed by parallel processing. We also propose a clip technique to solve the numeric overflow problem in practical training. In the experiments, the improved-DFRs outperform the DFR in various scenarios, in terms of convergence speed and symbol error rate (SER). The simulation results also show that the DFR-IS is easier to train, and the slight performance loss can be reduced by increasing model complexity, in comparison to DFR-I.
The remainder of this study is organized as follows. The theoretical analysis including the test error brought by the mismatch problem, the integration of multiple training sets, and the parameter sharing are given in Section 2. The DFR and the proposed improved-DFR are presented in Section 3. The simulation results of the improved-DFR with and without parameter sharing and DFR are demonstrated in Section 4. Conclusions are presented in Section 5.
2. Theoretical Analysis
2.1. Preliminary
We begin by establishing the models of the data set. We assume that the potential mapping in the data set is a surjection in which every element of the image space is a value for some members of the domain, and the mapping also satisfies Lipschitz continuity. The mapping in the training set is expressed aswhere the input vector following the random distribution is an input random variable vector, and the output vector following the random distribution . The probability density function (PDF) of is . We sample the input–output pairs of independently and identically, then we get the training set , where the subscript “train” denotes the training set, and is the number of training samples. Similarly, the mapping in the test set is defined aswhere is a random variable with PDF being . We sample the input–output pairs of and get the testing set , where the subscript “test” denotes the testing set, and is the number of testing samples.
Second, we define a space , an intersection of support sets of functions and :
We also define a space which is a complement of the support set of in the support set of , and similarly we have a space . The two spaces are written as follows:
Figure 1 shows an illustration of the spaces and with a one-dimensional variable . Particularly, is an intersection of support sets of functions and , is a complement of support set of in the support set of , and is a complement of support set of in the support set of .

2.2. Mismatch between Training Set and Test Set
In ML, the potential mappings in the training set and the test set are usually the same, i.e., , and they follow the same distribution, i.e., . Using a training set where , a parameterized model of sufficient complexity can be sufficiently trained, then we have . This indicates that the learned model has zero error on the test set. First, we investigate the mismatch problem between the training set and the test set: when there exists a difference between the distributions of the data sets, i.e., where is the Kullback–Leibler divergence [2], and thus , then how will the learned model perform on the test set.
Without loss of generality, we use some distance function as an error function, and the expected error on the test set can be expressed aswhere , , and
The upper bound of the test error is given in (5), and it indicates that the error is composed of two parts. First in the space , the images of the same variable are different, i.e., , we call this inherent error as data divergence, and it cannot be cancelled. The other part of the error comes from the input space , which only shows in the test set and is unpresented in the training set. This error is caused by data missing, and the model cannot capture the proper mapping during learning. Particularly, the data missing is different from the generalization error. By increasing sample complexity, data divergence cannot be reduced but generalization error can be alleviated. However, the samples which follow the distribution of the test set can be added to the training set, and then the data missing can be removed. We consider three cases. First, after proper training with added samples, the upper bound of the test error can be reduced to
Second, when any of the following two conditions are satisfied:meanwhile, if the added samples are unavailable, then the upper bound of the test error is
Third, when the learned model eliminates the data missing with added samples, and the data set satisfied any of conditions (8) and (9), then , According to the nonnegativity of PDF and distance function, the upper bound is also the lower bound.
2.3. Integration of Multiple Training Sets
We have quantitatively analyzed the test error when the training set and test set mismatch, and in this subsection we focus on the situation where multiple training sets which follow different distributions, integrate into a new training set. First, we consider two training sets, i.e., , and their potential mappings are and , respectively. We assume that is sampled with probability in the integrated training set , and the test set follows the same distribution as the training set. Therefore, the expected errors on the training set and the test set are the same. The model is trained with , and the corresponding error can be written as
According to (11), , :
Especially, according to the symmetry and trigonometric inequality of , when where is a function of , then we have
Plug (13) into (11), and (11) can be rewritten as
More generally, after sufficient training, the learned function can be achieved as
Plug (15) into (11), and we havewhere
The upper bound of error is given in (16). Data missing no longer exists in this situation, but data divergence arises from the integration. Similarly, when the data set satisfies condition (8) or (9), then the upper bound and the lower bound of error are zeros, i.e., .
The above conclusions can be extended to . Given a set of mapping functions , and the element mapping is expressed aswhere we use to denote the PDF of , , when any of the following two conditions:are satisfied, and we train model of sufficient complexity with the sufficient large training set, then the expected error on the data set reaches the lower bound, i.e., .
In summary, the data missing can be removed by the solution that adding the missing samples to the training set. Moreover, the condition (19) or (20) guarantees that the mapping in the integrated training set is still a surjection, and then the inherent data divergence no longer exists.
2.4. Deep Unfolding
In deep unfolding, an iteration algorithm can be unfolded as follows:where is the iteration index, is the input, is the current iterative solution and also part of the input in the next iteration, and is the parameter of layer . In the first iteration, is some initial value. The target solution of the iteration algorithm is . Given an algorithm with iteration times, the parameter set of the corresponding model is . In some cases, the parameters in different layers are shared, i.e., .
We discuss the training procedures with and without parameter sharing here. First, given a parameter set and a set of training sets where the element training set is provided for parameter , and the element sample in is . Without parameter sharing, there are layers, and each layer is trained with its training set with respect to its parameter. Second, we consider the training case with parameter sharing. According to the discussions in subsection 2.2 and subsection 2.3, a single model is used to approximate multiple potential mappings, and thus the corresponding training set must include samples of all these mappings to eliminate data missing. Hence, an integrated training set is provided for the model with a shared parameter . Using parameter sharing, only one model is trained with respect to the shared parameter on an integrated training set.
Several benefits can be gained from parameter sharing. Assuming that the amounts of the parameter in different layers are the same, then the total amount of model parameter is reduced to times, but the amount of samples with respect to the corresponding parameter is increased to times. The model complexity and parameter redundancy are reduced, but the sample complexity is increased, and thus the overfitting can be alleviated and the learning efficiency can be improved.
On the other hand, some negative issues can arise from parameter sharing. For example, the decrease in parameter number also leads to a decrease in model approximation quality. When we consider parameter sharing, the model complexity can be appropriately increased, because that the union space is larger than any subspace , where is the PDF of input . Moreover, there is always a trade-off between the model approximation quality and computation efficiency. Normally, a proper NN scale can be determined through simulations. Most importantly, we consider such a problem: does the test error caused by data divergence exist in deep unfolding with parameter sharing?
2.4.1. Sufficient Model Complexity
According to the introduced formulation and training of deep unfolding, given any input and output in any layer , then the target solutions are the same, i.e., . This indicates that when the model complexity and sample complexity are both sufficient, after sufficient training, then we have
Equation (22) shows that deep unfolding satisfies condition (20), and the potential mapping in the integrated training set is still a surjection. Therefore, the data divergence does not exist in deep unfolding with parameter sharing, and the corresponding upper bound of the test error is zero.
In fact, the assumption of sufficient model complexity is impractical. Due to the extreme complexity of potential function , the practical parameterized model usually cannot approximate with arbitrary precision. Therefore, we resort to iterative methods which approach the optimal solution step-by-step, instead of reaching the optimum by a single step. Therefore, the data divergence still possibly exists due to the insufficient model complexity.
2.4.2. Insufficient Model Complexity
We resort to the MDP to describe the iterative optimization in deep unfolding with insufficient model complexity. MDP is a typical model in reinforcement learning [29], which concerns about an agent interacting with an environment in the single-agent case. In each interaction, the agent takes action by policy using the observed state , then receives a feedback reward and an updated state from the environment. The agent aims to find an optimal policy to maximize the cumulative reward over the continuous interactions.
The inference process of iterative algorithms can be described as MDP, as shown in Figure 2. The agent policy is the model function . At the time , the state is , the action is produced by the policy (namely the model function), and it is . The reward is some negative distance function where denotes the optimum solution with respect to . The environment is then transferred to the updated state . In such formulation, the iterative inference is in consistent with the MDP.

We can prove that: when the performance of the intermediate solution is monotonously increasing, then the model is a surjection and thus the data divergence does not exist.
Proof. Given any input and intermediate solution , . When , then according to the monotonicity of the distance function, we have .
Given a solution set where the elements satisfy , and thus we have . The mapping of any input is a unique . Therefore, the mapping is a bijection and also a surjection, and the data divergence does not exist.
The above conclusion is derived without the assumption of sufficient model complexity. When the monotonously increasing condition is not satisfied, the oscillation possibly occurs. Besides, when the monotonously increasing condition is satisfied, the data divergence does not exist in the situation where the model parameter is not shared and the model complexity is limited.
3. Data-Driven Iterative Detection
3.1. System Model and Problem Formulation
As shown in Figure 3, a point-to-point band-pass wireless communication system is considered. The symbol sequence vector is produced with frame size , and they are modulated by the m-ary phase position shift keying (MPPSK) modulation [30]. The MPPSK signal of the symbol is given aswhere represents the carrier period, and and denote the number of carrier periods in each time slot and in each symbol, respectively. Apparently, we have and , where denotes the symbol duration. Under the MPPSK modulation, transmitted symbols are mapped into a base-band signal which can be written aswhere . Then, is up-converted by carrier frequency into a signal , and shaped by a band-pass IIR filter with the impulse response , to be transmitted radio frequency signal . The additive white Gaussian noise (AWGN) channel is considered, and additional noise is denoted by with variance . At the transmitter, a same band-pass IIR filter is used to denoise and obviate the disturbances. The filtered received signal is obtained aswhere represents convolution operator. It can be seen that is composed of two parts: the colored Gaussian noise which is filtered once, and the band-pass signal which is filtered twice. The continuous signal is then sampled at sampling frequency where is usually a positive integral multiple of , i.e., , and finally we have sampled signal . In this work, we investigate the detection problem: estimate the transmitted symbol sequence with the received sampled signal .

The band-pass IIR filters are deployed at both the transmitter and receiver to shape the transmitting signal and obviate the disturbances, respectively. However, the filters also introduce ISI, waveform distortions, and nonwhite noise. To overcome these issues, a matched filter, an equalizer, and a demodulator are usually required. The receiver design is intractable and thus we proposed an end-to-end data-driven approach.
3.2. Decision Feedback Receiver
In [25], we proposed an end-to-end NN-based receiver which iteratively estimates the transmitted symbols, and we called it decision feedback receiver (DFR). There are hidden layers in the full-connected NN, and the number of neurons in the th hidden layer is . In the th iteration, to detect symbol , the input of DFR includes a windowed received sampled signal and a prior information vector , and the output is a conditional probability vector. The mapping of the DFR is expressed aswhere denotes the model parameter. As shown in Figure 4, is a vector filtered by a rectangular window. The window length is where , and the window center is on the symbol . The prior information vector is composed of the conditional probability vectors of the previous and posterior adjacent symbols of symbol . Therefore, the number of feedback symbols is , and the length of priori information vector is . Then, the length of DFR input is given as

Besides, the length of DFR output is equal to . The estimated symbol can be achieved by
Summarily, the DFR utilizes the prior soft information of adjacent symbols derived from the last estimation, to iteratively estimate the transmitted symbols.
3.3. Existing Problems in DFR
Simulation results in [25] showed that after several iterations the DFR had better detection performance than the NN-based receiver without feedbacks. However, we also found that the first solution of DFR performed worse than the receiver without feedbacks, but theoretically they are the same. This phenomenon is caused by the mismatch between the training set and the test set. In (26), the DFR obtains the estimated soft information derived from the last estimation. However, the soft information is unavailable in the training set, and is replaced by the hard information which is generated from the correct output. From the analysis in subsection 2.1, we know that the mismatch in the input will lead to performance degradation.
Owing to the mismatch in the input data between the training set and test set, three sub-problems arise and we summarize them as follows:(i)First, according to the analysis in subsection 2.1, the data missing exists in DFR due to the mismatch problem. When the band is narrow and the signal-to-noise ratio (SNR) is low, the mismatch becomes severe and the DFR detection performance on the test set is ruined.(ii)Second, the data divergence can exist due to the insufficient complexity of DFR. There is only one single model in (26), and its parameter is shared in all iterations.(iii)Third, when the soft information of both posterior and previous adjacent symbols are utilized, the detection becomes unstable and oscillation occurs. Therefore, we adopt an alternative method where only the soft information of previous adjacent symbols is used as feedbacks.
3.4. Improved-DFR
3.4.1. Framework of Improved-DFR
To overcome these existing problems, we unfold the DFR and make adjustments to the training algorithm via deep unfolding, and the improved-DFR is proposed. As shown in Figure 5, the iterative algorithm is unfolded into sub-NNs, and the parameter of is , All the sub-NNs have the same structures and scale. The activation function in the hidden layers and the output layer are sigmoid and softmax, respectively. In DFR, the detection is frame-by-frame, and thus we use tensors to describe the data. The first dimension of tensor-like data refer to the sample number, which is the frame size in DFR. The tensor is a matrix when the sample is a vector . The mapping of sub-NN can be described as

We use to denote the input of . In the first iteration, is initialized as a zero matrix.
In the proposed improved-DFR, Memory I transforms the received sampled signal into and stores . Besides, the output matrix is transmitted to Memory and then transformed into priori information matrix . The is stored in Memory , and then transmitted to the as input. There are one Memory I and Memory IIs in the improved-DFR. In Algorithm 1, the detection algorithm of improved-DFR is given in the symbol-wise form. In addition, the detection algorithm can be parallel executed. The detailed hyper-parameters of the improved-DFR are given in Table1.
|
3.4.2. Training of Improved-DFR
First, the received sampled signal and the corresponding transmitted symbol sequence , respectively, are transformed into the input and the output . Given initial and input , the improved-DFR serially obtains the solutions of each sub-NN, and we obtain the solution set . According to the cross-entropy loss function, the error of one frame is expressed as follows:
To minimize (30), the parameter is updated along the negative gradient direction during learning:where denotes the learning rate. Especially, we use the Adam optimizer [31–33] to dynamically adjust the learning rate. In Algorithm 2, the training algorithm is given in the parallel processing form. We summarize the improvements of improved-DFR as follows:(i)During learning, the training set is dynamically generated where the priori information is transformed by the last solution of the previous sub-NN. Therefore, the training data and test data follow the same distribution, and the error caused by data missing is eliminated.(ii)Each sub-NN is trained with its parameter, and the possibly existing data divergence caused by insufficient model complexity is avoided.(iii)The soft information of both posterior and previous adjacent symbols are utilized as feedback, and the detection instability does not exist.
|
3.4.3. Practical Tricks in Training
In practical training, the numeric overflow problem often occurs, especially when the logarithmic calculation or index calculation is considered. Take the DFR as an example, the target output of the symbol is a one-hot vector where the th element is and the other elements are . The th output element of the model iswhere and are the parameters of the previous layer. When the model approximates the th to be , from (32) we know that only when , then . The parameters , will sharply increase and thus the numeric overflow is caused. To fix this issue, one simple and efficient trick is to clip the element of output by the following function:where is a small value, and in our training.
On the other hand, overfitting often exists in practical training, due to the unbalance between the model complexity and sample complexity. We use a validation set to observe the overfitting and evaluate the DFR performance after each training episode, and the model with the smallest error on the validation set is saved and then executed on the test set.
3.4.4. Improved-DFR with Parameter Sharing
As we have discussed in subsection 2.4, parameter sharing can have several benefits and it is still promising. The intensity of possible existing data divergence usually depends on the specific problem, and it can be alleviated by increasing the complexity of the single model. The execution and training algorithm of improved-DFR with parameter sharing can be regarded as one specific case of the improved-DFR without shared parameters. We change the parameter set to be the shared parameter , and then the improved-DFR with parameter sharing is derived. In short, we call improved-DFR with parameter sharing as DFR-IS, and improved-DFR without parameter sharing as DFR-I.
3.4.5. Complexity Analysis
First, we consider the computation complexity of receivers with various model scales. The computation complexity of the DFR and the improved-DFR is the same. In each inference per iteration and symbol, the addition calculation times , multiplication calculation times , and exponentiation calculation times are listed in Table 2. Both the computation complexity and time cost are proportion to the iteration times . Meanwhile, all the symbols in one frame can be parallel processed. Therefore, when the parallel computation is fully utilized, the time cost of inference is unchanged with various frame sizes.
Second, we consider the model complexity and storage complexity of the improved-DFR. The number of parameters and the amount of stored data are listed in Table 3. The stored data include: the received sampled signal and its transformation in Memory I, the DFR output , and its transformation in all Memory IIs. It is noteworthy that the Memory IIs are virtual, and in fact there is only one physical Memory II in DFRs.
4. Simulation Results
The parameters of the band-pass wireless communication system are listed in Table 4 With the values and settings in Table 4, the number of carrier periods in each symbol is , and the order of the designed band-pass filter is . Besides, the detailed hyper-parameters of the improved-DFR are listed in Table 1. The number of feedback symbols is , the length of priori information vector is , the length of input received sampled signal is , and the input vector length of DFR is . The number of training samples is . The DFR-I and DFR-IS are tested in this section, and the DFR with and is regarded as a comparison. These values or settings are fixed in the following simulations and will be particularly mentioned if modified. The simulation results are averaged with repeats.
4.1. Model Complexity
First, we study the influences of the NN scale on the improved-DFR performance. The NNs having a single hidden layer with neurons, and double hidden layers with , are considered. The single layer of neurons is the possible smallest NN, and we assume that a NN with scale has sufficient model complexity. We use the error function (33) to measure the training quality.
4.1.1. Error
As shown in Figure 6, the error on the training set is recorded along the episode times . In general, the final error decreases as the model complexity increases, and the DFR-I has a smaller final training error than DFR-IS. Besides, the NNs of more complexities have faster convergence. These simulation results basically agree with the theoretical expectations. However, the final error of NN with is larger than that with , and the DFR-I has a larger final training error than DFR-IS when the NNs are , , . We speculate that these abnormal phenomena are caused by the overfitting where the model complexity is excessively high.

Meanwhile, the error on the validation set is recorded in Figure 7. As the episode times increase, the validation errors of NNs with , , and increase to varying degrees, which verifies our speculation that the overfitting occurs. The DFR-IS with obtains the lowest final validation error. The error on the validation set also indicates that parameter sharing is useful to alleviate the overfitting when the model is complex.

4.1.2. Detection
The learned models are then tested on the test set. After iteration times, the SER curves are plotted in Figure 8. Generally, the SNR–SER performance of DFR-I and DFR-IS is similar, but the ones using parameter sharing are slightly worse. All the SERs of improved-DFRs are lower than when dB, except for the NN with . However, the smallest NN still can achieve when the SNR is increased up to dB.

4.1.3. Complexity
The high-speed processing is significant for wireless communication systems, and thus the low complexity is preferred. The specific computation times and parameter number of different NNs are listed in Table 5. Generally, the computation times are proportion to the sum amount of hidden neurons. Although the improved-DFR with NN being suffers about dB SNR loss to achieve , the corresponding computation cost and parameter number are extremely small. Taking account of training difficulty, detection performance, computation, and model complexity, an improved-DFR with is adopted in the following simulations.
4.2. Iteration Times
In this subsection, we study the relationship between the iteration times and the detection performance, and dB. As shown in Figure 9, the SNR–SER curves of DFR and DFR-I after iterations are plotted. Generally, the SER of the DFR-I sharply declines in the first iterations, then the SER almost no longer declines and thus the corresponding solution converges. Meanwhile, the SER of the DFR still slowly declines after iterations, and the corresponding solution slowly converges. In the high SNR region, the SER gap between the DFR-I and the DFR becomes larger than that in the low SNR region. The SER performance of DFR after iterations is comparable to that of DFR-I after the first iteration. This indicates that the mismatch between the training set and test set seriously ruins the performance of DFR. To achieve after iterations, the required SNR of the DFR is dB, while the DFR-I only needs about dB.

In summary, the improved-DFR addresses the mismatch problem between the training set and the testing set in the vanilla DFR, and thus the improved-DFR outperforms the DFR. In comparison to the DFR, the detection SER of the initial solution of the improved-DFR is greatly reduced. Besides, the convergence speed and the final SER of the improved-DFR respectively are faster and lower.
On the other hand, we study the influences of parameter sharing on the detection performance of improved-DFR after different iteration times. As shown in Figure 10, generally the SNR–SER performance of the DFR-I is slightly better than that of the DFR-IS which uses parameter sharing. When dB, symbol errors cannot be captured on the test set with the DFR-I after iterations; meanwhile, the SER of the DFR-IS is around . To improve the detection performance of the improved-DFR using parameter sharing, we can properly increase the model complexity.

4.3. Band-Pass Bandwidth
In this subsection, the DFR, DFR-I, and DFR-IS respectively are trained and tested in the scenarios with different band-pass bandwidth . The bandwidth set is MHz, and the elements respectively represent the corresponding scenarios with severe, normal, and slight ISI. Under these three different bandwidths, the SNR ranges of the corresponding test sets respectively are dB, dB, and dB. Meanwhile, the SNR ranges of the training sets are all dB.
The SNR–SER curves of DFR-I after first iteration, and DFR, DFR-I, DFR-IS after iterations are illustrated in Figure 11. As a baseline, the DFR-I after first iteration is regarded as a normal NN-based receiver without feedbacks. In general, under different bandwidths, the DFR-I outperforms the other receivers, and the detection performance of improved-DFRs is better than the DFR. When the bandwidth is reduced to MHz, the ISI is serious, and the SERs of all receivers drop slowly. When dB, the estimated result of the normal receiver without feedbacks is . The performance improvement of DFR is poor, and its SER is only . In contrast, the SER of DFR-I and DFR-IS after iterations are and , respectively. Compared with DFR, the SER of improved-DFR is reduced by about one order of magnitude.

Then, we turn to the scenario with MHz. It can be seen that as the bandwidth increases, the ISI is alleviated. The detection performance of the three receivers gets closer, and the performance gain achieved by iterations is insignificant. However, the DFR-I still outperforms the other receivers. When MHz and dB, the SER of DFR-I is , while SER of the DFR-IS and DFR is around .
5. Conclusions
In a general way, we have quantitatively analyzed the increased test error brought by the mismatch problem between the training set and the test set, and the corresponding upper bound and lower bound are also given. We have pointed out that the increased test error is composed of data divergence and data missing, and the solution to eliminate the data missing and the conditions that the data divergence no longer exists are also given. The established analysis is further developed for the case where one single model is trained on an integral training set. We have proved that the deep unfolding using parameter sharing has no data divergence with sufficient model complexity. Meanwhile, with the aid of the MDP model, we have given the condition that the data divergence does not exist, when the model complexity is insufficient. Based on the aforementioned analysis, we have studied the mismatch problem and the caused sub-problems in the DFR. Then, the improvements to solve these problems are proposed and thus the improved-DFR is developed. The improved-DFR has low computation complexity and model complexity and can be executed by parallel processing. The simulation results show that the improved-DFR has a faster convergence speed and better final SER performance than the DFR. Moreover, the performance of DFR-I and DFR-IS is similar. In comparison to DFR-I, the DFR-IS is easier to train, and the slight performance loss can be reduced by increasing model complexity. In our future work, we will focus on the model and data-driven method for DFR, to further reduce the training complexity of the data-driven DFR.
Data Availability
No data were used to support this study.
Conflicts of Interest
The authors declare that there are no conflicts of interest.
Acknowledgments
This work was supported in part by the National Natural Science Foundation of China (grant no. 61976113).