Abstract

For single image dehazing, an end-to-end multistage with multiattention network is proposed in this paper. The network contains two different stages, in which the first stage uses an encoder-decoder subnet to obtain contextual features, and the second stage adopts a single-scale pipeline to provide spatial image details. At each stage, ground-truth supervision is provided, and an attention mechanism is used between the two stages, so the features learned from the previous stage will be refined before passing to the next stage. A basic multiattention unit that combines channel attention, spatial attention, and pixel attention is designed to earn more weight from important features, and a positional normalization that normalizes exclusively across channels is used in the multiattention unit to learn more weight from important features. Experimental results in several benchmarks indicate that the proposed network outperforms the state-of-the-art methods both quantitatively and qualitatively.

1. Introduction

Image dehazing is a challenging task in the field of image restoration. Since there are infinite feasible solutions, it is a highly ill-posed problem. The atmosphere scattering model [1, 2] proposes a simple and effective formula to solve the problem.where is the hazy image, stands for clear image, the global atmospheric light represents the intensity of the scattered light of the scene, and the transmission map describes the attenuation in intensity.

Let the clear image be the output, and the formula (1) can be re-written as follows:It can be observed from formula (2) that the goal of single image dehazing is to restore the clear image from the hazy image by estimating and . Since only is known, it is difficult to restore the clear image .

Recent decades have witnessed significant progress in image dehazing, and lots of techniques have been proposed. Early works were mostly based on priors such as the atmosphere scattering model, and these methods often try to design hand-crafted features to learn in formula (2). However, the methods are easily sensitive to image variations such as changes in viewpoints, illumination, and scenes [3]. With the success of deep convolution neural network (CNN) in the community of image processing and computer vision, lots of image dehazing methods based on CNN have been proposed [46], which can directly regress the intermediate transmission map or the final haze-free image. Compared to early methods based on hand-designed features, CNN-based methods achieve superior performance with robustness.

The net design is a primary reason of the superior performance achieved by CNN-based methods. Lots of network modules are introduced for image dehazing including residual dense connections [7, 8], attention mechanisms [8, 9], encoder-decoder [10, 11], and generative models [12, 13]. Nevertheless, most of them are single-stage models. On the other hand, multistage models have been shown to be more effective than single-stage models in different vision tasks such as segmentation and pose-estimation. Recently, few efforts have adopted multistage networks to solve image deblurring and image deraining [1416]. We analyze those methods and find there are several bottlenecks that prevent the performance. First, the existing multistage networks use same architecture in different stage, either an encoder-decoder architecture or a single-scale pipeline. The encoder-decoder [11] architecture provides broad contextual information but lacks image spatial details, and the single-scale pipeline is effective in preserving spatially accurate but unreliable in extracting semantical information. We combine the two architectures in a multistage network for image dehazing. As far as we know, this is the first attempt to solve this problem. Second, we do not naively pass the output of the previous stage to the next stage [15]. A ground-truth supervision is provided in the first stage to refine the feature map before moving to the next stage. Third, most attention modules (MAU) are single and limited, such as channel attention, which can extract the interdependencies among channels but lacks spatial information. We first combine different attention mechanisms to address the limits. The proposed multiattention combines channel attention, spatial attention, and pixel attention to extract more important information.

In summary, the main contributions of the work are as follows:(1)We employ a multistage network, which combines two different architectures. The proposed multistage network is capable of extracting broad contextual and spatially detailed information.(2)At each stage, a ground-truth supervision is provided and an attention mechanism is used among adjacent stages. By the supervision of ground-truth image, the features learned from the previous stage will be refined before moving to the next stage.(3)A multiattention unit (MAU) is proposed that combines channel attention in channel-wise, spatial attention in spatial-wise, and pixel attention in pixel-wise to earn more weight from important features.(4)The positional normalization [17] (PONO), which is position-dependent and reveals structural information at this particular layer of the deep net, is adopted to improve the training performance.

Most dehazing approaches follow the similar three-step methodology based on the atmospheric scattering model: (1) estimating the transmission map by the hazy image samples; (2) estimating the global atmospheric light using empirical methods; and (3) computing the clear image according to formula (3). Most of the work focuses on the first step. There are two ways to estimate : physically grounded priors and fully data-driven approaches.

Early methods based on physically grounded priors often require multiple images from the same scene under different conditions [2, 1820]. However, these methods do not work when there is only one image for a scene. The dark channel prior (DCP) [21] is the most successful prior-based method and is followed by many successors. Gibson et al. [22] adopted a standard median filter to improve the DCP computing speed. An effective contextual regularization based on boundary constraints is proposed in [23] to restore the hazy image. Based on depth estimation, a color attenuation prior [24] is proposed for haze removal. Berman et al. [25] assume that an image contains only several hundreds of distinct colors and proposed a nonlocal method. However, the prior is computationally expensive and unreliable.

With the success of deep learning in diverse computer vision tasks, the data-driven dehazing approaches have become popular. To avoid estimating the parameters inaccurately and designing hand-crafted features, algorithms use convolutional neural networks (CNNs) to directly learn from data.

Single-Stage Networks: currently, most single image dehazing methods are based on single-stage networks. The AOD-Net [26] is the first end-to-end network to generate clean images directly. It is a lightweight CNN, but still performs much better than prior-based methods. The EPDN [27] adopts a generative adversarial network to solve the image dehazing without relying on the physical scattering model. Zhao et al. proposed a weakly supervised refinement framework called RefineDNet [28], which can outperform the weakly supervised methods but is weak than the supervised networks. The DehazeFlow [29] proposes a conditional normalizing flow based framework for single image dehazing.

MultiStage Networks: the existing multistage networks usually use the identical architecture in different stages, such as the Grid DehazeNet [8] and the gated fusion network [30]. The information generated by the previous stage always naively flows to the next stage to refine the restored image [3]. However, a common practice is to use the same subnetwork for each stage may yield a suboptimal result, and the naive connection between adjacent stages is also a bottleneck, as shown in our experiments.

Attention: attention mechanisms are widely used in both high-level computer vision tasks, including image classification [31] and object detection [32], and low-level computer vision tasks such as image dehazing [8, 9], deraining [16], and deblurring [14, 15]. The main idea is to capture long-range interdependencies in channel-wise, spatial-wise, or pixel-wise.

3. Proposed Method

We mainly discuss the detail of the proposed network MSNet in this section. The MSNet is a multistage with multiattention network, and it is a trainable end-to-end network that does not rely on the atmosphere model. The MSNet consists of two stages, as shown in Figure 1, of which the first stage is based on the encoder-decoder network which learns the contextual information, and the second stage is a single-scale pipeline to provide the spatial image details. Inspired by [33], a supervised attention block (SAB) is used between the two stages. By the supervision of clear images, the feature maps in the first stage are refined by SAB before flowing to the second stage.

3.1. Multiattention Unit

In our framework, a multiattention unit (MAU) is proposed as the basic unit. The architecture of basic MAU is depicted in Figure 2, and it consists of two convolution layers, a local residual learning and a multiattention block. The convolution layers are activated by ReLU, and the second convolution layer adopts positional normalization (PONO) with moment shortcuts (MS) [17] to normalize the activations. A global residual learning connects the input feature and the output feature. With local residual learning and global residual learning, the low-frequency regions from the input features can be learned through the skip connection.

The multiattention block combines channel attention, pixel attention, and spatial attention, so it can provide additional ability in dealing with nonlocal and local information, and the representational ability of CNNs is expanded. The architecture of MAU is depicted in Figure 3.

3.1.1. Channel Attention

Usually, a network uses a number of convolutional layers to capture the neighboring spatial dependencies within local receptive fields. However, the global spatial patterns also need to be considered under the complicated nonuniform condition. When the neighborhoods of the image contain strong hazy component, the contextual information from clear regions may be required. Recently, a channel attention module [31] has been proposed to capture richer nonlocal features by modeling the interdependencies among channels. Thus, we propose the channel attention module to extract nonlocal context features, and the different weighted information from the different channel feature maps will be learned by the channel attention module.

Firstly, a global average pooling is used to capture the channel-wise global spatial features:where means the global average pooling function and is the value of cth channel of input at position . And the dimension of the feature map changes from to , denotes the channels, and is the size of the feature map.

Then, two convolution layers are applied to get the weights from different channels, and the first convolution layer uses PONO to normalize the activation.where stands for the sigmoid function that is used to activate the first convolution layer and is the ReLU function used to activate the second convolution layer.

Finally, the weight of the channel is computed by element-wise multiplying the input and .

3.1.2. Pixel Attention

The variant hazy pixels may distribute in the whole image, so we adopt a pixel attention module to get the variant features from the image in pixel-wise. The module is applied to learn weights in an adaptive way from pixels, and the network can learn more informative features from thick-hazed pixels and high-frequency image regions.

The architecture of the pixel attention module is depicted in Figure 3, it consists of two convolution layers and a sigmoid activation function, and the first convolution layer uses PONO to normalize the activations.

Then, we element-wise multiply and as the output of the channel-pixel attention map:

3.1.3. Spatial Attention

Spatial attention is designed to exploit the spatial attention map from the input convolutional features . The spatial attention module first applies global average pooling on along the channel dimensions and outputs a feature map . The feature is then passed through a convolution layer and sigmoid activation to get the spatial attention feature .

Finally, the spatial attention map and channel-pixel attention map are concatenated, and then the concatenated feature map is passed through a convolution layer to obtain the multiattention map.

3.2. Encoder-Decoder Subnetwork

The encoder-decoder subnetwork is based on the standard U-Net [34] as shown in Figure 4, each scale of the subnet uses an UBlock, which contains several MAUs to extract feature maps, and two down-sampling layers are adopted to reduce the size of the input map to reduce the computation. The skip connections are also processed by an UBlock and then concatenated with the decoder layer. The skipped connections enhance the detailed information of the image. The down-sampling and up-sampling are implemented by a convolution layer.

3.3. Single-Scale Subnet

The single-scale subnet in the second stage consists of several multiattention groups (MAGs), each of which contains several MAUs and a shortcut, and the module is depicted in Figure 5. With the dense attention modules, the net can generate high-resolution and enriched detailed features from the input.

3.4. Supervised Attention Block

Inspired by [30], a supervised attention block (SAB) is used between the two stages, and the architecture of SAB is shown as Figure 6. The SAB uses a ground-true image to supervise the feature maps at the encoder-decoder stage. With the supervision of the ground-truth, the encoder-decoder stage will provide more informative features to the next stage.

SAB takes the output from encoder-decoder as the input, where is the dimension of the features and denotes the channel’s number. After processed by a convolution, the is added to the input hazy image to obtain the dehazed image , and a ground-truth image is provided here to predict the dehazed image. Then is processed by a convolution layer with a sigmoid activation to generate the attention maps . Then, we element-wise multiply and transformed that processed by a convolution layer. Finally, a shortcut is used to generate the output, which will pass to the next stage.

3.5. Positional Normalization and Moment Shortcut

Although normalizing inputs is considered to be one of the tricks for training the network, several normalization methods have been proposed to improve the performance, such as batch normalization. Different from the prior normalization scheme, the positional normalization (PONO) is position-dependent and reveals structural information at this particular layer of the deep net. It normalizes exclusively over the channels at all spatial positions, so it is translation, scaling, and rotation invariant. The PONO computes the mean and standard deviation in the layer:where is the small stability constant.

Moment Shortcut (MS) fast-forward the PONO information and as shown in Figure 7.

The two moments of the activations are extracted from the early layer and are sent to the corresponding layer later aswhere denotes the intermediate layers, and and are predicted from and via a shallow convolution layer.

3.6. Loss Function

The perceptual loss, mean squared error (MSE), GAN loss, and loss is widely used in many dehazing networks. The research in [35] points that the smooth loss provides better PSNR and SSIM metrics in many image restoration tasks. So we use the smooth loss to train the network:where

stands for the intensity of the th color channel of pixel in the dehazed image, and N is the total count of pixels of the image.

At each stage, there is a ground-truth to predict, so we add the losses from each stage to optimize the net:

4. Experiment Results

4.1. Dataset

We evaluate the proposed network on three benchmarks including RESIDE [36], Dense-Haze [37], and real-world dataset [38]. RESIDE contains both indoor and outdoor synthetic hazy images, which are collected from depth datasets [39] and stereo datasets [40]. After data cleaning, the Indoor Training Set (ITS) contains 1399 clear images and 13,990 hazy images, and the hazy ones are generated by the clear images with global atmosphere light and scatter parameters . The Outdoor Training Set (OTS) contains 2061 clear images and 72,135 hazy images generated by the clear images with and . The Synthetic Objective Testing Set (SOTS) of RESIDE is used for testing, and the SOTS contains 500 indoor images and 500 outdoor images. The images of Dense-Haze and the real-world dataset [38] are collected from the real world.

4.2. Training Settings

We resize the size of training images to 240 × 240 with 3 channels, randomly rotate the images by 90,180,270°, and horizontal flip the images for data augmentation. We choose the Adam optimizer for accelerated training, where and take the default values of 0.9 and 0.999, respectively. In the encoder-decoder subnet, each UBlock contains 3 MAUs. In the single-scale subnet of the second stage contains 3 MAGs, each of which consists of 8 MAUs. The channel number of preprocessing convolution layer is set to 64, and the channel number of input and output in MAU are both 64. We adopt the cosine annealing strategy [41] to adjust the learning rate from the initial value to 0 by following the cosine function:where is the total number of training batches and is the current training batch. The batch size is set to 4, and we evaluate the model every 5000 steps, the total steps are set to 1,000,000.

4.3. Results and Analysis

In this section, we compare MSNET with recent state-of-the-art image dehazing algorithms which are DCP, AOD-Net, DehazeNet, GCANet, RefineDNet, DehazeNet, GridDehazeNet, and FFA-Net quantitatively and qualitatively. Following those methods, we use peak signal to noise ratio (PSNR) and structure similarity (SSIM) for quantitative assessment of the dehazed outputs, and the outputs higher is better. And the quantitative comparison results on SOTS and Dense-Haze are shown in Table 1. Among those methods, DCP is a prior-based, and it is often regarded as the baseline, and the others are deep learning based.

From Table 1, we can observe that the results of data-based method are better than the result of DCP, which is a prior-based method. Among the data-based methods, AOD-Net is simplest network, so the result value is low, but still much higher than DCP. The RefineDNet is a weakly supervised method, so its results are not good as the other supervised methods. Compared to FFA-Net, our results increased by about 1.8% on SOTS and about 4.6% on Dense-Haze because of the multistage and multiattention mechanisms used.

The qualitative comparisons of visual effect on SOTS are shown in Figure 8. We select three images from the outdoor dataset and the indoor dataset, respectively, and the upper three rows are indoor results; the left three rows are outdoor results. The first column is the hazy input, the last column is the ground-truth, and the middle columns are the dehazed results from DCP, AOD-Net, GCANet, and MSNet (ours), respectively. From the results, we can see that the DCP method suffers from severe color distortion extremely, especially in the blue sky and the halo of the sun in the last image, and it loses some details. AOD-Net cannot remove all the hazy regions from the hazy image because of its simple network architecture. In the second row of Figure 8, the fog near the tree is still there. And in the fifth row of figure 8, there still has fog near the bridge. And its brightness value is lower. GCANet also performs not well on the blue sky and sun halo in the last two rows of figure 8 especially. The images recovered from our network are almost entirely in line with real-scene information, especially, the restoration of blue sky and halo images is much better.

We further give the qualitative comparisons on the real-world dataset [38] in Figure 9, and the results are similar with those on the SOTS dataset. The DCP and GCANet still suffer from severe color distortions, such as the blue sky of rows 1 and 2, and AOD-Net cannot remove the haze completely, so the output images are of low brightness such as a result of row 3. Also, DCP cannot remove the haze completely, such as in the sky of row 3. Although none of the methods can completely remove the hazy regions such as the last row of Figure 9, other methods suffer from color distortions compare to ours. And the restoration of our method is more natural. Above all, our method is capable of performing in image details and color fidelity than other methods in general.

4.4. Ablation Studies

In this section, we present ablation experiments to discuss the different modules of our network. The factors below are mainly concerned: (1) the number of stages; (2) the choices of each stage; and (3) the SAB and PONO. Evaluation is performed on SOTS indoor dataset, and the images as training input are cropped to 48 × 48. The results are shown in Table 2. First, we compare the results of one-stage and two-stage without PONO and SAM. The results of the two-stage (ID 2) increased by an average of 2.3% than the results of the one-stage (ID 1). The results indicate the effectiveness of the two-stage network. Second, we prove the need of two different architectures in the two-stage network by the results of ID 2. While using two different architectures, the PSNR is increased. Finally, from the comparison of ID 3, “✘” in the table means the module is not used, and “✔” means the module is used. We can observe that while the PONO and SAM are used, the result is better.

5. Conclusion

In this work, we propose a multistage with multiattention network for image dehazing. The model consists of two different stages: one uses an encoder-decoder subnet to obtain contextual features, and the other adopts a single-scale pipeline to provide spatial image details. At each stage, a ground-truth supervision is provided, an attention mechanism is used between the two stages, and a multiattention unit with positional normalization is proposed to stack the network. The results in several benchmarks show that the proposed network outperforms the state-of-the-art methods and have a great advantage over those methods in terms of image detail and color fidelity.

Data Availability

The data used in this paper are all from public data sets, including RESIDE, Dense-Haze, and real-world dataset. which can be found in each reference in the paper.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was supported by the Jiangsu Industry Research Project BY2020552, and Nantong Science and Technology Program Project JC2020065.