Progressive Rain Removal Based on the Combination Network of CNN and Transformer

Wang, Tianming; Wang, Kaige; Li, Qing

doi:https://doi.org/10.1155/2022/5067175

Computational Intelligence and Neuroscience

On this page

Abstract Introduction Related Work Methods Conclusions Data Availability Conflicts of Interest References Copyright Related Articles

Research Article | Open Access

Volume 2022 | Article ID 5067175 | https://doi.org/10.1155/2022/5067175

Progressive Rain Removal Based on the Combination Network of CNN and Transformer

Tianming Wang,^1,2Kaige Wang,^1,2and Qing Li¹

Academic Editor: Hubert Cecotti

Received25 Jul 2022

Revised23 Aug 2022

Accepted01 Sept 2022

Published24 Sept 2022

Abstract

The rain removal method based on CNN develops rapidly. However, convolution operation has the disadvantages of limited receptive field and inadaptability to the input content. Recently, another neural network structure Transformer has shown excellent performance in natural language processing and advanced visual tasks by modeling global relationships, but Transformer has limitations in capturing local dependencies. To address the above limitations, we propose the combination network of CNN and Transformer, which fully combines the advantages of CNN and Transformer structure to complete the task of image restoration. We use CNN to provide preliminary output and adopt Transformer architecture to further optimize the output of CNN. In addition, by using some key designs in module connection, our model strengthens feature propagation and encourages feature reuse, allowing better information and gradient flow. The experimental results show that compared with the existing methods, our method can remove the rain lines more comprehensively and achieve the state-of-the-art results. Besides, the experimental results also demonstrate that the CNN structure can be effectively combined with Transformer to fully utilize the superiority of different structures.

1. Introduction

In recent years, advanced computer vision tasks such as image classification [1], object detection [2], and object tracking [3] have made great progress, and are widely used in real life, such as intelligent monitoring, driverless, and so on. However, the performance of these models will be seriously affected in the case of bad weather such as rain, snow, and fog. It is very important to seek solutions to acquire high-quality images in bad weather conditions. In this paper, we solve the problem of removing rain from a single image. The imaging model of rainy days can be simply formulated as a linear combination of rainless image and rain steak image .where represents the raw data obtained by the camera. The rain removal task refers to separating the rain-free image from as shown in Figure 1. This is an ill-posed problem, because the same can be generated by different and pairs. Therefore, how to get high-quality rain removal image is an important problem to be solved in computer vision tasks.

Image rain removal is a very hot topic. Some methods [4–6] focus on removing the rain steaks in the video. Such methods make full use of the sequence relationship of continuous frames in video data. However, for single image rain removal, there is no continuous time series and can only use spatial context information, so it is more challenging. Single image rain removal went through an evolutionary process of moving from model-driven to data-driven. Model-driven methods are subdivided into filter-based methods and prior knowledge-based methods. In the filter-based methods represented by [7, 8], researchers have preliminary achieved image rain removal by analyzing the frequency characteristics of rain lines and backgrounds, and designing filters with specific structures or weights to obtain rain removal images. Based on prior knowledge, the rain removal method utilizes available mathematical methods and analytical techniques, such as morphological component analysis [9], sparse coding [10], dictionary learning [11], and GMM prior knowledge [12] to distinguish the raindrops from the background. However, the above methods have common drawbacks, including high computational complexity, long running time, and incomplete rain removal results.

With the proposal and rapid development of convolutional neural networks, data-driven methods have shown amazing results in various computer vision fields, and single-image rain removal using deep learning techniques has received widespread attention. These methods focus on designing various depth neural networks. Inspired by ResNet, a depth detail network [13] is proposed to remove high-frequency rainfall content, as well as a large-scale synthetic data set composed of rain/rainless image pairs. Some multilevel [14, 15] or multistream [16] network structures have been proposed to learn multiscale rain layer information. Due to the powerful learning ability of generative adversarial networks (GAN), some methods based on GAN structure [17, 18] have been proposed to realize the task of rain removal. Recently, some scholars proposed a series of new methods [19–21], which greatly improved the model performance. A recurrent strategy [19] is proposed to complete the rain removal task in which a recursive layer is introduced to take advantage of the dependence of the deep features of different stages. Aiming at the optimization process of the model, a model-driven deep neural network [20] has a completely interpretable network structure. MPRNet [21] proposed a multistage architecture, which gradually learns the recovery function to decompose the whole recovery process into more manageable steps. These methods take CNN network as the backbone and convolution as the basic operation. With its local connectivity and translation invariance, convolution is very suitable as a feature extractor of image data. However, there are still some problems in convolution operation. First, the convolution operator has a limited receptive field. The pixel in the image can only capture the information of its surrounding pixels and cannot model the dependence of long-distance pixels. Second, convolution operation has static weight, so the interaction between image and convolution kernel is independent of image content. Using the same convolution kernel to restore different image regions may not be the best choice. Due to the limitation of convolution operation, CNN architecture cannot achieve the ideal effect of rain removal.

To break through the limitation of convolution, an ideal method is to adopt self-attention (SA) mechanism, which is the core component of Transformer [22]. Transformer performs well in natural language processing. Since Vit [23] introduced it into vision tasks, its potential is being tapped. SA models the relationship between pixels by calculating the correlation matrix with all other positions, so it can obtain the global receptive field. In addition, the calculation of attention map is a dynamic mechanism, because the correlation matrix will depend on the input. Based on this, Transformer has the advantages that CNN does not have. However, local context information is also very important for image restoration tasks, because the neighborhoods of degraded pixels can be used to restore its clean version, but some work shows that Transformer has limitations in capturing local dependencies.

For the rain removal task, the model requires both global information to know where the rain line is and detail information to restore it. However, a single CNN or Transformer structure does not have both properties. If both structures are included in the model, it can capture local dependencies to improve the inference of content and capture global information to improve the inference of location. In this paper, inspired by the progressive step [21], we propose a combination network of CNN and Transformer (CNCT). Specifically, our rain removal network includes two subnetworks: Net-C and Net-T. Net-C, as the first stage of the network, takes CNN as the backbone architecture. This network adopts single-scale channel to provide spatially accurate output. Net-T is the second stage of the network, with Transformer structure as the backbone architecture. It takes the depth feature of the first-stage output as the input and uses the attention mechanism to capture the global interaction of the context and further optimize the semantic details. We show that the combination of these two design options is effective for image restoration in a multistage architecture.

In addition, we prove that simply transferring the final output from the first stage to the second stage cannot get the best effect. Thus, our basic unit in Net-T not only has SA module but also includes cross-attention (CA) module, which is a cross-stage attention mechanism by spreading semantic features from early to late. In addition, this method simplifies the information flow between stages and effectively stabilizes the multistage network optimization.

For this paper, the main contributions are as follows:(1)CNN is good at capturing local dependencies but has a limited receptive field, while Transformer is the opposite. For the rain removal task, the model requires both global information to know where the rain line is and detail information to restore it. Thus, we propose a new multistage method combining CNN and Transformer, which can generate rich context and accurate spatial output.(2)To strengthen feature propagation, encourage feature reuse, and avoid losing information, we propose a cross-stage attention mechanism, which aggregates the features of different stages.(3)We demonstrated the effectiveness of our CNCT on multiple synthetic and real-world datasets, and we also provided detailed ablation and qualitative results.

In this section, we briefly review the network structure used in the proposed network. Specifically, we introduce the applications of CNN and Transformer in recent years.

2.1. CNN Structure

In the past decade, neural networks, especially CNN, have made great progress and influence [24]. Although the method of back-propagation-trained network has been proposed in the 1980s, neural networks did not become the focus until AlexNet [25] won the champion of ImageNet competition in 2012. Since then, CNN has made great achievements in the field of image processing and computer vision, and some representative networks have been proposed, such as VGGNet [26], Inceptions [27], ResNe(X)t [28, 29], DenseNet [30], MobileNet [31], and EfficientNet [32]. They focus on different aspects of accuracy, efficiency, and scalability, and promote many useful design principles. It is not accidental that CNN is suitable for image processing. The shared convolution kernel parameters and the sparsity of interlayer connections enable CNN to learn grid topology features with less computation and stable effect. Specifically, convolution has a salient ability to extract features from image and has the characteristics of translation invariance. It can recognize similar features in different positions in space. When used in sliding window mode, computing is shared, so CNN is also efficient. Because of this characteristic, CNN is widely used in computer vision applications, such as image classification [1], object detection [2, 33, 34], object tracking [35], semantic segmentation [36], image painting [37], image restoration [21, 38], and image generation [39].

2.2. Vision Transformers

Transformer [22] has remarkable performance in natural language processing. Different from CNN’s local perception, the Transformer-based network captures the long-term dependence on the input data by calculating the global attention matrix, which also inspired computer vision researchers. Vit [23] uses a pure Transformer structure and achieves better results in image classification than the state-of-the-art CNN through large-scale data pretraining. After that, Transformer was also applied to advanced computer vision tasks such as object detection [40] and image segmentation [41]. The remarkable characteristic of these models is that they have a strong ability to learn the long-term dependence between image patch sequences, and are adaptive to the given input content. Although there are many explorations in the field of vision, the introduction of Transformer into low-level vision still lacks of exploration because of its complexity growing quadratically with the spatial resolution. One potential approach is to use Swin Transformer [42], which limits the calculation of attention matrix to local windows. These methods for image restoration cannot obtain the global receptive field, which is contrary to the original intention of using Transformer. Restormer [43] proposes a Transformer model that can learn long-term dependencies while maintaining computational efficiency. The Transformer model we used in this paper will follow the Restormer paradigm, which proves to be effective for image restoration.

3. Methods

In this section, we propose a progressive rain removal network as shown in Figure 2. The whole network process procedure is shown in Algorithm 1. The network consists of two subnetworks: Net-C and Net-T. Net-C takes CNN as the backbone, and Net-T takes Transformer as the backbone. Each unit of Net-T receives the output of the corresponding unit of Net-C as well as the previous unit as input. Next, we will introduce the components of the proposed method in detail.

Figure 2

The architecture of CNCT. The input image will go through two stages: Net-C and Net-T. Net-C is a convolutional neural network, which first maps the image into depth features by shallow feature extraction module, and then continues processing by a succession of Net-C units. Net-T adopts the Transformer structure, which takes the output of the last Net-C unit as the input and processes it by a succession of Net-T units. There is a cross-stage feature fusion mechanism (pink arrows) between the corresponding units in different stages. Finally, image reconstruction module restores the depth features to images.

	Input: is the input image. and represent shallow feature extraction module and image reconstruction module, respectively. and represent basic unit of Net-C and Net-T, respectively. is the number of Net-C units and Net-T units. is the number of basic units in i-th Net-C unit.
	Output: is the deraining image.
(1)	[], [].
(2)
(3)	append().
(4)	append().
(5)	for to do
(6)	for to do
(7)	.
(8)	append(.
(9)	end for
(10)	append().
(11)	end for
(12)	for to do
(13)	.
(14)	end for
(15)
(16)	return

3.1. Net-C

The architecture of Net-C is shown in the upper half of Figure 2. We will introduce the process of Net-C in detail.

First, for the input image , it will be processed by a convolution operation in which both convolution kernel and stride are p and the number of channels is . In this process, the p × p pixels in the image form a noncoincident patch and will be mapped from the image space to original feature maps , which is defined as follows:

Dividing the image into patches will not change the original image itself, but divides the original large image into small images. The resolution of the image becomes of the original. This operation greatly improves the processing efficiency. In our implementation, we use = 2 and = 48.

Then, will be sent into the basic unit sequence of Net-C, which adopts the mode of dense connection. Dense connection is an efficient architecture because it can enhance the transmission of feature streams. Net-C unit is composed of a series of dense blocks (DB), which is as shown in Figure 3. For the l-th DB, it concatenates the aggregated feature maps of the past l − 1 DBs and compresses them into dimension:where refers to the concatenation of the aggregated feature maps produced by . We directly use a 1 × 1 convolution to compress the channel. The compression operation greatly reduces the parameters of DB. Then, the compressed feature maps will be further aggregated and compressed with the output of the previous DB to obtain the aggregated feature maps of the current DB:

This process also uses 1 × 1 convolution to reduce parameters. The obtained aggregated features will be further processed by a residual network. Finally, the output of the current DB will be obtained.

The residual network includes two 3 × 3 convolution layers and the GELU activation function, in which the first convolution layer increases the number of channels by 4 times and the second convolution layer restores the number of channels. We stack DBs to get Net-C unit. Net-C units are stacked to obtain the backbone of Net-C. is transformed into the final depth feature after Net-C unit transmission and processing one after another. In our implementation, we use = 5 and the corresponding is (3, 3, 3, 3, 4).

Finally, in our restoration part, we first use a set of convolution layers to convert the number of depth feature channels to and then use Pixel shuffle operation and residual structure to transform it into rainless image as

In our implementation, consists of two convolution layers, of which the first maintains the number of channels and the second performs channel conversion.

3.2. Net-T

The architecture of Net-T is shown in the lower half of Figure 2. It is composed of a series of Net-T units as shown in Figure 4. We use two attention patterns: self-attention and cross-attention. After the attention calculation is completed, we use a feed-forward network (FFN) for further feature transformation. This module uses two 3 × 3 convolution layers with GELU activation function between them. We add a LayerNorm (LN) layer after SA, CA, and FFN, and all modules use a residual connection. The whole unit has three steps as follows:where Y is the output of feature maps from the first stage of the corresponding unit.

Next, we will introduce the attention component of the Net-T unit in detail.

3.2.1. Self-Attention Module

It is very difficult to apply Transformer directly to image restoration. The standard Transformer will calculate the correlation matrix between all locations. For input feature map , we can get in standard Transformer. The multiplication calculation times of calculating the attention map have quadratic complexity with the image resolution as

It is not appropriate to use a standard Transformer on a high-resolution feature map. Swin Transformer calculates attention map on the local window and continuously expands the receptive field by moving the window. However, this is not in line with our intention to adopt the global receptive field. Following Restormer [43], we introduced transposed attention to replace vanilla SA.

The SA module based on transposed attention is shown in Figure 5(a), and its pseudocode based on PyTorch is shown in Algorithm 2. In our implementation, for a given input , SA will first generate , , by 3 groups of 1 × 1 convolution and 3 × 3 depthwise convolution and reshape operation yielding:where represents reshape operation. Unlike vanilla SA, we calculate the attention map on the feature channel rather than in the spatial dimension. Specifically, instead of calculating , we calculate to obtain attention map , rather than the standard attention map . This method has the following advantages: first, the number of multiplication calculations required to calculate has linear complexity with image resolution as

(a)

(b)

	Input: is the input feature. consists of one 1 × 1 convolution and one 3 × 3 depthwise convolution. is the number of heads. is the temperature parameter.
	Output: is the output feature.
(1)	reshape().normalize()
(2)	reshape().normalize()
(3)	reshape()
(4)	transpose
(5)	.reshape()
(6)	return

In addition, it implicitly models the global relationship between pixels.

Thus, the process of SA is defined aswhere is the output of SA module and is a parameter that can be learned. We use a multihead attention mechanism following Restormer, and we set the number of heads as 2.

3.2.2. Cross-Attention Module

CA module is another attention component of the Net-T unit. Unlike SA, CA has two parts of the input. One part is the output feature of the previous step, and the other part is the output feature from the corresponding unit of the first stage as shown in Figure 2. To correspond to the first stage, Net-T and Net-C have the same number of units.

The function of the CA module is to interact the semantic features of Net-T and Net-C. The processing flow is similar to the SA module. Except that the acquisition methods of , , and are different, the other procedures will be exactly the same. In CA module, comes from , while and come from as

Figure 5(b) shows the idea of our cross-attention, where the fusion involves the X and Y. Its pseudocode based on PyTorch is shown in Algorithm 3. In particular, because X has learned its own abstract information in the SA step, interacting with Y helps to get information at a different stage. Based on the characteristics of Transformer, the CA module can selectively receive the results of the first stage, provide supplementary information for the current output results, and avoid redundant information. CA has several advantages. First, it helps to spread contextual features from Net-C to Net-T. Second, the features of one stage help to enrich the features of the next stage. Third, the network optimization process becomes more stable because it simplifies the flow of information.

	Input: is the input feature from SA. is the input feature from the corresponding Net-C unit. consists of one 1 × 1 convolution and one 3 × 3 depthwise convolution. is the number of heads. is the temperature parameter.
	Output: is the output feature.
(1)	reshape().normalize()
(2)	reshape().normalize()
(3)	reshape()
(4)	transpose
(5)	.reshape()
(6)	return

3.3. Loss Function

For the input rain image , our network will finally output the corresponding rain removal image , We use negative SSIM loss to optimize this process. SSIM measures the similarity of two images according to their brightness, contrast, and structure. The larger the SSIM value, the better the image restoration quality. However, to better train the network and make it converge, the negative value of SSIM needs to be considered in loss calculation aswhere is ground truth. Specifically, both stages will output the rain removal image We apply SSIM loss to the rain removal image at each stage. In addition, to improve the rain removal effect of the model, during the training process, the output feature maps of each unit are restored to the rain removal image where k is the number of units in the network through the restoration module. We impose additional SSIM loss on it. The loss of the whole network can be written as

To ensure the quality of the final rain removal image, we add additional loss to the output of the last unit of each stage. The whole loss function consists of four hyperparameters . In our implementation,

4. Experiments

In this section, we conduct ablation experiments on the structure of the proposed CNCT and compare it with the state-of-the-art methods to verify the effectiveness of the proposed method. Our ablation experiments include the verification of CA and SA modules, the impact of loss function, and the necessity of the combination of the two networks. Then, we compare our network with the results of some state-of-the-arts.

Our network is implemented in PyTorch. Training and testing were carried out on an NVIDIA Tesla V100 32G. Our network follows the settings of the previous work [19, 44]. Specially, we use a sliding window with a size of 112 and a sliding step of 96 to segment the image into patches. During training, batch_size is 16 and the initial learning rate is 1e − 3. The whole network trains 100 epochs, and when reaches at 30, 50, and 80 epoch, the learning rate decreases by 5 times. All tests were performed using the final epoch results.

4.1. Ablation Experiment

All ablation experiments were performed on Rain100H [14]. The training set includes 1800 images, and the test set includes 100 images. We use the average PSNR and SSIM of 100 test images as the evaluation results.

Attention module: the Transformer structure we use includes two parts: SA and CA. To verify the importance of these two parts, we performed ablation experiments on the role of each part. Table 1 shows the average PSNR and SSIM results of rain removal images of different transformer structures obtained on Rain100H. First, we retained SA and removed CA, resulting in a decrease of PSNR by 0.27 dB. When CA was retained and SA was removed, PSNR decreased by 0.30 dB. This shows that SA and CA are both necessary for our Transformer structure. They work together to make it have a stronger image restoration ability.

Loss function: in the deep learning task, the design of loss function will have a great impact on the final result. Table 2 compares the average PSNR and average SSIM values obtained on Rain100 H after 100 epoch training with different hyperparameters and . It can be seen from rows 1 and 5 that it is necessary to apply loss to the first stage; otherwise, Transformer cannot optimize the output. In the second row, we set and to 0; that is, we only trained Net-C but not Net-T. We use the output of Net-C as the final output. It can be seen that the PSNR trained with CNN and transformer is 0.64 dB higher than that trained with CNN only. In the third row, we set and to 0 and PSNR decreased by 0.91 dB, indicating that adding additional loss after each unit can produce a better rain removal effect.

Transformer vs. convolution: to verify the necessity of combining CNN with Transformer, we replaced all the attention modules in Net-T with 3 × 3 convolution, while keeping the others unchanged. The experimental results are shown in Table 3. It can be seen that using Transformer is 0.80 dB higher than using CNN. This research indicates that the reason for performance improvement is not by increasing the network depth. Compared with the original convolution block, the proposed combination metric is effective.

4.2. Evaluation on Synthetic Datasets

It is impractical to obtain the images of rainy days and the corresponding images of no rain in the real scene. Therefore, we train and test CNCT on synthetic image pairs. We train models on RainTrainH [14] and RainTrainL [14], corresponding to RainTrainH and RainTrainL training models of heavy rain and light rain images, respectively. The RainTrainH training model was tested on Rain100H, Rain200H, and Rain12 [45], and the RainTrainL training model was tested on Rain100L.

In order to prove the superiority of CNCT, we compare our method with the traditional method GMM [12] and the state-of-the-art deep learning methods RESCAN [46], PreNet [19], RMUN [47], TS-CGAN [48], LSPN [49], SSDRNet [50], MPRNet [21], and Restormer [43]. MPRnet is a multistage CNN architecture, and Restormer is a pure Transformer architecture. For GMM, we directly run the open source code and obtain the test results of the above test set.

Since the code of methods [47–49] cannot be obtained, we refer to some comparison results given in their paper. For other methods, if there is no pretrained model, we use the implementation provided by the author to retrain it. For Restormer, we use an unofficial reproduction version (https://github.com/leftthomas/Restormer). Following [21], we calculate SSIM and PSNR in YCbCr channel.

Table 4 shows the average PSNR and average SSIM values of the results obtained by our method and other methods on different data sets. In Table 4, the data marked with black and underlined represent the first and second levels, respectively. It can be seen from the table that our CNCT has the highest average PSNR and SSIM values on Rain100H and Rain100 L. On Rain200H and Rain12, we achieved comparable results. It is worth pointing out that CNCT has only 4.0 M parameters, whereas Restormer has 26.1 M, which is a so large model. This shows that our structure is very efficient in learning feature representation for image recovery.

Figure 6 shows the rain removal results on two groups of Rain100H test sets. We only show the methods that can be reproduced or have been open source. It can be seen the result of CNCT is obviously superior to other methods in visual effect and detail maintenance. The traditional algorithm GMM has some shortcomings in the ability of removing rainstorms. Neural network-based methods RESCAN, PreNet, and SSDRNet improved the performance but were limited. As state-of-the-art methods, MPRNet and Restormer still have some defects in maintaining details. In the enlarged area, we can see that there are some blurs in the restoration image. For example, the texture details in the restoration of letters and fences are not very satisfactory. Our method not only removes the rain lines but also retains the edge information, which is basically consistent with the real value of the ground.

4.3. Evaluation on Real-World Datasets

In the previous section, we showed that our model achieves the best performance on synthetic datasets. However, in natural scenes, rain lines are more complex. Following [14], we test the effectiveness of removing rain lines in natural scenes on the model trained on RaintrainH.

Figure 7 shows the rain removal effect of our model on real-world datasets. Since there is no ground truth corresponding to the real scene, we only compare our model with other models in terms of subjective visual effects. It can be seen that the rain removal effect of other methods is not ideal, and there are many residual traces. In addition, there are fuzzy and low-quality visual effects in the removal results. Our method removes rain lines as much as possible while retaining more details.

5. Conclusions

This paper proposes an end-to-end rain removal network by combining CNN and Transformer structure. This network consists of two subnetworks: Net-C and Net-T, which are used for single rain removal. We fully combine the advantages of CNN and Transformer to achieve a better rainwater removal effect. Net-C adopts CNN architecture, providing spatially accurate but semantically unreliable output. Net-T adopts Transformer architecture to further optimize the output of the previous subnetwork. We use cross-attention combined with skip connection to achieve better information flow transmission so that the network can make full use of shallow information to complete the rain removal task. A large number of experimental results show that our method has a better effect than the state-of-the-art methods. Besides, the experimental results also demonstrate that the CNN structure can be effectively combined with Transformer to fully utilize the superiority of different structures. This provides a new approach to building diverse networks for many researchers who are limited by the drawbacks of CNN or Transformer. In future research, it is also important to explore the application of this network in other image restoration tasks.

Data Availability

The data used to support the findings of this study are included within the article.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

References

A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” Advances in Neural Information Processing Systems, vol. 25, 2012.
View at: Google Scholar
R. Girshick, “Fast r-cnn,” in Proceedings of the IEEE International Conference on Computer Vision, pp. 1440–1448, Santiago, Chile, December 2015.
View at: Google Scholar
R. Xia, Y. Chen, and B. Ren, “Improved Anti-occlusion Object Tracking Algorithm Using Unscented Rauch-Tung-Striebel Smoother and Kernel Correlation Filter,” Journal of King Saud University-Computer and Information Sciences, vol. 34, 2022.
View at: Google Scholar
P. Barnum, T. Kanade, and S. Narasimhan, “Spatio-temporal frequency analysis for removing rain and snow from videos,” in Proceedings of the First International Workshop on Photometric Analysis for Computer Vision-PACV 2007, p. 8, Rio de Janeiro, Brazil, October 2007.
View at: Google Scholar
K. Garg and S. K. Nayar, “Detection and removal of rain from videos,” in Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004, vol. 1, p. I, CVPR 2004, Washington, DC, USA, June 2004.
View at: Google Scholar
X. Zhang, H. Li, Y. Qi, W. K. Leow, and T. K. Ng, “Rain removal in video by combining temporal and chromatic properties,” in Proceedings of the 2006 IEEE International Conference on Multimedia and Expo, pp. 461–464, IEEE, Toronto, ON, Canada, July 2006.
View at: Google Scholar
K. He, J. Sun, and X. Tang, “Guided image filtering,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 6, pp. 1397–1409, 2013.
View at: Publisher Site | Google Scholar
X. Zheng, Y. Liao, W. Guo, X. Fu, and X. Ding, “Single-image-based rain and snow removal using multi-guided filter,” in Proceedings of the International Conference on Neural Information Processing, pp. 258–265, Springer, Lake Tahoe, NV, USA, November 2013.
View at: Google Scholar
L. W. Kang, C. W. Lin, and Y. H. Fu, “Automatic single-image-based rain streaks removal via image decomposition,” IEEE Transactions on Image Processing, vol. 21, no. 4, pp. 1742–1755, 2012.
View at: Publisher Site | Google Scholar
Y. Luo, Y. Xu, and H. Ji, “Removing rain from a single image via discriminative sparse coding,” in Proceedings of the IEEE International Conference on Computer Vision, pp. 3397–3405, Santiago, Chile, December 2015.
View at: Google Scholar
Y. Wang, S. Liu, C. Chen, and B. Zeng, “A hierarchical approach for rain or snow removing in a single color image,” IEEE Transactions on Image Processing, vol. 26, no. 8, pp. 3936–3950, 2017.
View at: Publisher Site | Google Scholar
Y. Li, R. T. Tan, X. Guo, J. Lu, and M. S. Brown, “Rain streak removal using layer priors,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2736–2744, Las Vegas, NV, USA, June 2016.
View at: Google Scholar
X. Fu, J. Huang, D. Zeng, Y. Huang, X. Ding, and J. Paisley, “Removing rain from single images via a deep detail network,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3855–3863, Honolulu, HI, USA, July 2017.
View at: Google Scholar
W. Yang, R. T. Tan, J. Feng, J. Liu, Z. Guo, and S. Yan, “Deep joint rain detection and removal from a single image,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1357–1366, Honolulu, HI, USA, July 2017.
View at: Google Scholar
F. Yu and V. Koltun, “Multi-scale context aggregation by dilated convolutions,” 2015, https://arxiv.org/abs/1511.07122.
View at: Google Scholar
H. Zhang and V. M. Patel, “Density-aware single image de-raining using a multi-stream dense network,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 695–704, Salt Lake City, UT, USA, June 2018.
View at: Google Scholar
H. Zhang, V. Sindagi, and V. M. Patel, “Image de-raining using a conditional generative adversarial network,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 30, no. 11, pp. 3943–3956, 2020.
View at: Publisher Site | Google Scholar
R. Qian, R. T. Tan, W. Yang, J. Su, and J. Liu, “Attentive generative adversarial network for raindrop removal from a single image,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2482–2491, Salt Lake City, UT, USA, June 2018.
View at: Google Scholar
D. Ren, W. Zuo, Q. Hu, P. Zhu, and D. Meng, “Progressive image deraining networks: a better and simpler baseline,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3937–3946, Long Beach, CA, USA, June 2019.
View at: Google Scholar
H. Wang, Q. Xie, Q. Zhao, and D. Meng, “A model-driven deep neural network for single image rain removal,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3103–3112, Seattle, WA, USA, June 2020.
View at: Google Scholar
S. W. Zamir, A. Arora, S. Khan et al., “Multi-stage progressive image restoration,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14821–14831, Nashville, TN, USA, June 2021.
View at: Google Scholar
A. Vaswani, N. Shazeer, N. Parmar et al., “Attention is all you need,” Advances in Neural Information Processing Systems, vol. 30, 2017.
View at: Google Scholar
A. Dosovitskiy, L. Beyer, A. Kolesnikov et al., “An Image Is worth 16x16 Words: Transformers for Image Recognition at Scale,” 2020, https://arxiv.org/abs/2010.11929.
View at: Google Scholar
D. Bhatt, C. Patel, H. Talsania et al., “CNN variants for computer vision: history, architecture, application, challenges and future scope,” Electronics, vol. 10, no. 20, p. 2470, 2021.
View at: Publisher Site | Google Scholar
A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” Advances in Neural Information Processing Systems, vol. 25, 2012.
View at: Google Scholar
K. Simonyan and A. Zisserman, “Two-stream convolutional networks for action recognition in videos,” Advances in Neural Information Processing Systems, vol. 27, 2014.
View at: Google Scholar
C. Szegedy, W. Liu, Y. Jia et al., “Going deeper with convolutions,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–9, Boston, MA, USA, June 2015.
View at: Google Scholar
K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778, Las Vegas, NV, USA, June 2016.
View at: Google Scholar
S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He, “Aggregated residual transformations for deep neural networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1492–1500, July 2017.
View at: Google Scholar
G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, “Densely connected convolutional networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4700–4708, Honolulu, HI, USA, July 2017.
View at: Google Scholar
A. G. Howard, M. Zhu, B. Chen et al., “Mobilenets: Efficient Convolutional Neural Networks for mobile Vision Applications,” 2017, https://arxiv.org/abs/1704.04861.
View at: Google Scholar
M. Tan and Q. Le, “Efficientnet: rethinking model scaling for convolutional neural networks,” in Proceedings of the International Conference on Machine Learning, pp. 6105–6114, Long Beach, CA, USA, May 2019.
View at: Google Scholar
C. Patel, D. Bhatt, U. Sharma et al., “DBGC: dimension-based generic convolution block for object recognition,” Sensors, vol. 22, no. 5, p. 1780, 2022.
View at: Publisher Site | Google Scholar
K. Wang, T. Wang, J. Qu, H. Jiang, Q. Li, and L. Chang, “An end-to-end cascaded image deraining and object detection neural network,” IEEE Robotics and Automation Letters, vol. 7, no. 4, pp. 9541–9548, 2022.
View at: Publisher Site | Google Scholar
J. Zhang, W. Feng, T. Yuan, J. Wang, and A. K. Sangaiah, “SCSTCF: spatial-channel selection and temporal regularized correlation filters for visual tracking,” Applied Soft Computing, vol. 118, Article ID 108485, 2022.
View at: Publisher Site | Google Scholar
J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3431–3440, Boston, MA, USA, June 2015.
View at: Google Scholar
P. Li and Y. Chen, “Research into an Image Inpainting Algorithm via Multilevel Attention Progression Mechanism,” Mathematical Problems in Engineering, vol. 2022, Article ID 8508702, 12 pages, 2022.
View at: Publisher Site | Google Scholar
Y. Chen, L. Liu, V. Phonevilay et al., “Image super-resolution reconstruction based on feature map attention mechanism,” Applied Intelligence, vol. 51, no. 7, pp. 4367–4380, 2021.
View at: Publisher Site | Google Scholar
T. Karras, S. Laine, and T. Aila, “A style-based generator architecture for generative adversarial networks,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4401–4410, Long Beach, CA, USA, June 2019.
View at: Google Scholar
N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, “End-to-end object detection with transformers,” in Proceedings of the European Conference on Computer Vision, pp. 213–229, Springer, Glasgow, UK, August 2020.
View at: Google Scholar
R. Strudel, R. Garcia, I. Laptev, and C. Schmid, “Segmenter: transformer for semantic segmentation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7262–7272, Montreal, BC, Canada, October 2021.
View at: Google Scholar
Z. Liu, Y. Lin, Y. Cao et al., “Swin transformer: hierarchical vision transformer using shifted windows,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022, Montreal, BC, Canada, October 2021.
View at: Google Scholar
S. W. Zamir, A. Arora, S. Khan, M. Hayat, F. Shahbaz Khan, and M.-H. Yang, “Restormer: efficient transformer for high-resolution image restoration,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5728–5739, New Orleans, LA, USA, June 2022.
View at: Google Scholar
Y. Yang, J. Guan, S. Huang, W. Wan, Y. Xu, and J. Liu, “End-to-end rain removal network based on progressive residual detail supplement,” IEEE Transactions on Multimedia, vol. 24, pp. 1622–1636, 2022.
View at: Publisher Site | Google Scholar
X. Fu, J. Huang, X. Ding, Y. Liao, and J. Paisley, “Clearing the skies: a deep network architecture for single-image rain removal,” IEEE Transactions on Image Processing, vol. 26, no. 6, pp. 2944–2956, 2017.
View at: Publisher Site | Google Scholar
X. Li, J. Wu, Z. Lin, H. Liu, and H. Zha, “Recurrent squeeze-and-excitation context aggregation net for single image deraining,” in Proceedings of the European Conference on Computer Vision (ECCV), pp. 254–269, Munich, Germany, September 2018.
View at: Google Scholar
Y. Zhang, J. Zhang, B. Huang, and Z. Fang, “Single-image deraining via a recurrent memory unit network,” Knowledge-Based Systems, vol. 218, Article ID 106832, 2021.
View at: Publisher Site | Google Scholar
J. Wang, S. Gai, X. Huang, and H. Zhang, “From coarse to fine: a two stage conditional generative adversarial network for single image rain removal,” Digital Signal Processing, vol. 111, Article ID 102985, 2021.
View at: Publisher Site | Google Scholar
W. Fan, Y. Wu, and C. Wang, “Single image rain streak removal via layer similarity prior,” Applied Intelligence, vol. 51, no. 8, pp. 5822–5835, 2021.
View at: Publisher Site | Google Scholar
C. Y. Lin, Z. Tao, A. S. Xu, L. W. Kang, and F. Akhyar, “Sequential dual attention network for rain streak removal in a single image,” IEEE Transactions on Image Processing, vol. 29, pp. 9250–9265, 2020.
View at: Publisher Site | Google Scholar

Copyright

Copyright © 2022 Tianming Wang et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

PDF Download Citation

Download other formats

Order printed copies

Computational Intelligence and Neuroscience

Progressive Rain Removal Based on the Combination Network of CNN and Transformer

Abstract

1. Introduction

2. Related Work

2.1. CNN Structure

2.2. Vision Transformers

3. Methods

3.1. Net-C

3.2. Net-T

3.2.1. Self-Attention Module

3.2.2. Cross-Attention Module

3.3. Loss Function

4. Experiments

4.1. Ablation Experiment

4.2. Evaluation on Synthetic Datasets

4.3. Evaluation on Real-World Datasets

5. Conclusions

Data Availability

Conflicts of Interest

References

Copyright