Abstract

Image superresolution (ISR) is a hot topic. With the success of deep learning, the convolutional neural network-based ISR makes great progress recently. However, most state-of-the-art networks contain millions of parameters and hundreds of layers. It is difficult to apply models realistically. To solve this problem, we propose a Wavelet Sparse Coding-based Lightweight Network for Image Superresolution (WLSR). Our contributions include four aspects. Firstly, to improve the ISR performance, the WLSR utilizes the superiorities of wavelet sparse coding on ISR. Secondly, we take advantage of the dilated convolution to expand receptive fields. In this case, the filters in WLSR can acquire more information than common convolutional filters from input images. Thirdly, to deal with sparse code efficiently, we employ deformable convolution networks to obtain the convolutional kernels that concentrate on nonzero elements. Fourthly, to make WLSR uint8 quantization robust, we take advantage of the Clipped ReLU activation in the end of WLSR and balance the SR performance with running time. Experimental results indicate that, compared with state-of-the-art lightweight models, the WLSR can achieve exceptional performance with few parameters. Moreover, the WLSR contains 30 times fewer parameters than VDSR, but it works better than VDSR on the validation set.

1. Introduction

Image superresolution (ISR) is a classic task in computer vision, which obtains high-resolution (HR) images from low-resolution (LR) images. It is an ill-condition problem. The conventional approaches include interpolation-based models [16] and reconstruction-based models [79]. With the development of deep learning, many scholars propose various convolution neural networks (CNNs) [1028] for ISR.

Srinivasan and Kumaran applied CNNs to the ISR problem for the first time in 2015 [11], which only employed 3 convolutional layers to achieve the most outstanding results. They also find that the resource consumption can be reduced by upscaling the image in the final stage of CNNs. Hence, Dong et al. created FSRCNN in 2016 [13], which employed PReLU as the activation instead of ReLU and deconvolution as upscaling operator. Lim et al. propose the Deep2Space for real-time ISR tasks, which are widely used nowadays. Later, VDSR [18] introduces more layers and parameters, which shows that more parameters in CNNs make the performance better. Consequently, EDSR [19], WDSR [20], and RCAN [21] continue introducing parameters and reaching the huge volume. Although these methods present exceptional results in standard dataset, the tremendous parameters limit its application. Due to lacking memory and limited computational speed, these models are difficult to be deployed in mobile devices.

For the lightweight problem in ISR networks, many scholars have provided important solutions [2428]. IMDN [24] utilizes the channel splitting and information distillation to build a lightweight CNN. The IMDN has proved that the information distillation can help make an ordinary CNN achieve or exceed a complex model. Better than previous achievements, NASNet [26] exceeds many remarkable methods by searching for an optimal framework. In addition to focusing on the structure, NASNet is also devoted to quantization problem. Generally, the quantization problem does not involve network structure but hardware dependence in many cases. A method [28] is making use of pertained complex model, channel sparse, and cropping. By using this method, complex models can be simplified by removing unnecessary channels. These solutions are successful in ISR. However, these models still cannot fulfill the real-time requirements. The existing problem of lightweight methods contains four aspects: (1) insufficiently optimized element-wise operations, (2) unoptimized distortion and transpose operations, (3) unsupported subchannel quantization, and (4) data exchanging caused by numerous cross connections.

To sum up, there are two aspects worth studying. Firstly, because of numerous parameters, conventional CNNs cannot be deployed into mobile devices. Secondly, the data exchanging slows the running speed and limits the quantitative ability.

To reduce the CNNs’ parameters, we propose a Wavelet Sparse Coding-based Lightweight Network for Image Superresolution (WLSR). The contributions are as follows: (1) we take advantage of wavelet sparse coding to improve network performance. A Wavelet Sparse Coding Block (WSB) is designed in WLSR to deal with image texture specially. The WLSR processes image efficiently via WSB. (2) We utilize the dilated convolution to expand receptive fields. In general, the bigger the receptive field is, the better performance it would be. (3) We introduce the deformable convolution network (DCN [29]) to make filters concern on the nonzero elements, so as to restore the detailed information. (4) WLSR utilizes lightweight design. By making use of group convolution, the WLSR uses few parameters, but it has the effect of a deep CNN. Besides, we employ the Clipped ReLU to accomplish the quantitation. Extensive experiments indicate that the WLSR uses few parameters to achieve remarkable results. The uint8 quantification performance of WLSR is also exceptional.

2. Literature Review

Shao et al., [30] introduced the sparse-based effective algorithm, and it is frequently used in the area of high-quality restoration. Later, Deeba et al. in [31] proposed an updated popular sparse representation strategy for image superquality. The authors of the paper claim that sufficient lexicon selection can adequately depict the picture block. Following this discovery, we search for a limited image for every low-resolution input piece, then use the coefficients of this view to generate high-resolution output. Considering the large capacity (payload) of the neural network model and volume 8, 2020, scientists are deeply focused in using a neural network, deep learning-based solution to solve the SISR problem. The Creative Commons Attribution 4.0 License applies to this work.

2.1. Wavelet-Based Enhanced Medical Image SR Holistic Learning

According to Deeba et al. [32], such neural networks aid in the acquisition of conventional design functionality as well as the improvement of several deep learning algorithms. Assuming suitable quality, the upgraded Deep Neural Network (DNN) approaches are expense and drastically decreased.

3. Wavelet Sparse Coding-Based Lightweight Network for Superresolution (WLSR)

In this section, we introduce the proposed WLSR. The intention of the design is to use wavelet sparse coding and state-of-the-art structures to accomplish the lightweight network. Table 1 shows the parameters of some state-of-the-art CNNs. Obviously, with the development of ISR, model parameters become more and more abundant. The deployment of these CNNs is barely satisfactory. However, our WLSR only contains 22K parameters and achieves the outstanding performance. Thus, our WLSR are well performed in mobile devices.

3.1. Wavelet Sparse Coding Block

In orthogonal Multiresolution Analysis (MRA), the current space can be decomposed into and as shown in the following equation:

In this case, the WSB is obtained as shown in Figure 1, where LL, LH, HL, and HH, respectively, denote horizontal low-pass vertical low-pass, horizontal low-pass vertical high-pass, horizontal high-pass vertical low-pass, and horizontal high-pass vertical high-pass components. In WSB, we choose Haar wavelet coefficients.

Conversely, the reconstruction is the inverse of the encoding process, which is called Wavelet Reconstruct Block (WRB). The WRB and WSB have same wavelet coefficients.

3.2. Deformable Convolution Block

In this subsection, we introduce the Deformable Convolution Block (DCB) to deal with wavelet sparse code. DCB consists of deformable convolutional networks (DCNs). DCN includes two versions, namely, DCNv1 and DCNv2 [29, 33].

DCNv1 is the deformable kernel that uses one or more convolution to learn offsets. Compared with stranded convolution, there are more sampling points distributed over the receptive field. Standard convolutional kernels are in the form of a rectangle. Thus, the extracted features are limited. However, the DCN employs arbitrary shape, as shown in the following equation: where respects the points in feature maps, is the original sampling point and offsets, and, , and , respectively, denote the weight, kernel values, and output feature map.

DCNv2 adds modulation on DCNv1.

DCNv1 simply adds offsets, and DCNv2 introduces modulation based on DCNv1, i.e., learned weights at all positions. We use DCNv2 in our WLSR.

The utilization of DCNv2 in WLSR is shown in Figure 2. If we use standard convolution, there would be many zero elements in receptive fields. Hence, standard convolution wastes plenty parameters. The deployment of DCNv2 makes filter weights focus on nonzero elements and unconcern zero elements.

Figure 3 gives the Grad-CAMP [33] heat map of HH components. In addition, the white points in Figure 3(c) denote high gradient. It can be found that DCN’s weights centralize in the texture. The effective values of HH components are also in the texture, which illustrates DCNv2 works on wavelet sparse code.

3.3. WLSR Structure

In AlexNet, the group convolution is proposed for GPU limitations. The existing researches have shown that properly used group convolutions improve accuracy while reducing computational cost. Hence, group convolutions are usually used in mobile-focused networks. Combined with cascaded convolution and group convolution, the structure of WLSR is presented in Figure 4. To extract more information form inputs, we utilize the dilated group convolution in the first layer. Then, the DBlocks are used to deal with the extracted feature maps. After the upscaling operation, the HR images are acquired by the WRB.

For ISR and device limitations, the shuffling operator is unavailable. Because the reshape and transpose operators cannot be optimized in deployed device, we do not use the shuffle block. The skip connections promote the convergence of training and increase the depth of CNNs. However, too many skip connections slow down the running speed, and the element-wise operations are not able to optimize properly. In another aspect, according to the works of Sheng et al. [34], the deep-wise convolution increases the quantitative errors. As a result, we avoid all shuffling block and normalizations, so as to obtain the distinguished ISR effect.

When the uint8 quantitation is utilized on any of float32 or float16 SR models, we cannot obtain the distinguished uint8 quantitation model. To address this problem, the proposed model should be quantitation-friendly. On the other hand, the quantitation technique should be carried out efficiently.

Linear activations are widely used in SR models. For wavelet sparse code, we control the values from -510 to 510. Compared with float 16/32 models, accuracy may drop by 5-7 dB. In the early steps, the output value is not guaranteed between -510 and 510, and middle activations are also unbounded. In the later of training, the training data will indirectly enforce boundedness. However, the middle activation layers are not affected by Clipped ReLU. In the middle activation layers, the data flow may converge to around 510. These middle activations create outliers, which carry some important information. The relatively low values are zeroed out. As a result, when the effective values are spread to Clipped ReLU, the activations result in dull colors and a descent in PSNR.

During the training, we enforce the active function not to visit these outliners, so as to fulfill quantitative friendly. Compared with the floating quantization, this strategy only entails a 0.2-0.5 dB descent. Table 2 shows the quantitation performance. Without Clipped ReLU, we train the proposed model, quantify the WLSR, and validate the accuracy in DIV2K. The equation of Clipped ReLU is

Clipped ReLU is convenient to compute with minimal computational burden. Although one Clipped ReLU is sufficient to affect the middle activation layers, as the model becomes deeper, the regularization effect may disappear in deeper networks. The solution is to replace some ReLUs with Clipped ReLUs.

The Clipped ReLU makes the WLSR hard to train. The reason can be traced by the flat area at the border. In order to obtain an exceptional model, we need to design a training strategy, which is described in Section 3.

4. Experiment Results

In this section, we first introduce the test datasets, i.e., Set5 [36], Set14 [37], BSD100 [38], and Urban100 [39]. Then, we illustrate the training strategy. Finally, we present the compared results.

4.1. Datasets

The training and testing are carried out in 5 benchmark datasets, i.e., Set5, Set14, BSD100, Urban100, and DIV2K [36]. For training, we utilize DIV2K for 1000 epochs, which is about 3 days. The DIV2K contains 800 high-resolution training images, 100 validation images, and 100 test images. For fair comparison, we test some state-of-the-art CNNs (FSRCNN [13], VDSR [18], ESPCN [40], and XLSR [41]) in benchmark dataset, i.e., Set5, Set14, BSD100, and Urban100. Set5, Set14, and BSD100 consist of natural scenes; Urban100 contains images of challenging urban scenes with details in different frequency bands.

4.2. Training Strategy

For training, we crop 32 LR images randomly and use geometric transformations (original, rotated, and flipped) with the same probability to enhance the data. Furthermore, for robustness to illuminations, we randomly scale the brightness of the image by a factor of 1, 0.7, and 0.5. We employ Charbonnier loss, where . As shown in Equation (5), it is a smoothloss. Experiment results indicate that it is promising for Clipped ReLU.

In order to ensure the convergence, the following tricks are used: (1)We utilize a triangular cyclic learning rate scheduling strategy. The learning rate begins with and increases to in 50 epochs. Then, the learning rate decreases to a low value slowly, until 5000 epochs; it will reach (2)The WLSR is trained with 5000 epochs, every epoch includes 100 minibatches, and the batch size is 16. At the end of each batch, we validate the PSNR on Set5 and save the best model to quantize(3)The factor of Adam is set as and (4)The convolutional layers are initialized by random variable with 0.1 variance. The purpose is to make initialized value close to 0 and avoid the outliners(5)We use the standard postquantization strategy and train the WLSR on an NVIDIA RTX1070 GPU. The training process is shown in Figures 5 and 6.

4.3. Bicubic Downsampling

Bicubic is a widely used synthetic downsampling. We carry out the downsampling. The HR images are averaged on neighborhood to generate LR images. The quantitative comparisons are given in Table 3, and a part of SR images is shown in Figures 7 and 8.

In Table 3, our WLSR gains the best results in compared models with the fewest parameters. In particular, our WLSR uses thirty times fewer parameters than VDSR but achieves higher SR results. From Table 3, it can be seen that XLSR and VDSR perform the best except our WLSR, and VDSR is the second highest performing among the results. In Set5, WLSR is 0.01 dB higher than the second place, 0.2 dB higher in Set14, and 0.02 dB higher in BSD100. In the results, XLSR performs better on Set14 and Urban100. However, WLSR is 0.12 dB higher than XLSR in Set14 and 0.04 dB higher than XLSR in Urban100. Meanwhile, in Set5 and BSD100, WLSR is 0.11 dB and 0.01 dB higher than the second-place VDSR, respectively.

From the visual results in Figures 7 and 8, WLSR restores more detailed texture information and sharper outlines. In the image “0805,” the eyes in picture which are restored by compared methods are fuzzy, but WLSR constructs some texture of the wolf’s eyes. In the image “0823,” the compared CNNs cannot obtain clear building contours, while WLSR constructs a part of house contours and the spots on the stone.

5. Conclusion

Image superresolution is a hot topic. For this task, we propose a Wavelet Sparse Coding-based Lightweight Network for Superresolution (WLSR). Firstly, we utilize wavelet sparse coding to design the WSB. In addition, we take advantage of dilated convolution to expand receptive fields, so as to extract more information from input LR images. Furthermore, the WLSR employs the deformable convolutions which use the nonrectangular kernels to concentrate on nonzero sparse codes. Ablation studies indicate that the deformable convolutions deal with sparse code properly. Finally, the WLSR utilizes an extreme lightweight structure to accomplish SR tasks. In this case, our model uses the few parameters to achieve the distinguished results. Extensive experiments show that the WLSR contains only 1/30 the parameters compared with VDSR and achieves better performance than VDSR. Although the 4 state-of-the-art CNNs use more parameters, their reconstruction effects are not better than WLSR. The WLSR restores more texture and details. Besides, the WLSR has an outstanding quantization effect for uint8, with only 0.28 dB PSNR decreasing.

There are other wavelet transforms as well; however, we are only dealing with wavelet sparse coding in this paper. The proposed research can be expanded in the future by using other wavelet transforms, such as the dual-tree complex wavelet transform and the multiresolution discrete wavelet transform.

Data Availability

The data used to support the findings of this study are included within the article.

Conflicts of Interest

We declare that there is no conflict of interest regarding the publication of this paper.

Acknowledgments

This work was supported in part by the National Natural Science Foundation of China under Grant 61971328.