Abstract
The traditional methods for multi-focus image fusion, such as the typical multi-scale geometric analysis theory-based methods, are usually restricted by sparse representation ability and the transferring efficiency of the fusion rules for the captured features. Aiming to integrate the partially focused images into the fully focused image with high quality, the complex shearlet features-motivated generative adversarial network is constructed for multi-focus image fusion in this paper. Different from the popularly used wavelet, contourlet, and shearlet, the complex shearlet provides more flexible multiple scales, anisotropy, and directional sub-bands with the approximate shift invariance. Therefore, the features in complex shearlet domain are more effective. With of help of the generative adversarial network, the whole procedure of multi-focus fusion is modeled to be the process of adversarial learning. Finally, several experiments are implemented and the results prove that the proposed method outperforms the popularly used fusion algorithms in terms of four typical objective metrics and the comparison of visual appearance.
1. Introduction
The target information may display the differentiation for the lengths of the focus during the imaging procedure, that is, the closer the object to the focus is, the clearer the image is. On the other hand, it is difficult to synchronously get the full-focus image by only one imaging device [1]. A common method to deal with this problem is to fuse multiple images of the same scene into images of different focal lengths, which is called the multi-focus image fusion and has been widely used in military monitoring, image analysis, and transportation [2]. For example, in modern wars, the multi-focus images can be used to monitor important targets and facilities of the enemy, and in the transportation domain, the multi-focus images can be used to track logistics and vehicle information and even penalize violations.
Nowadays, there are mainly four kinds of strategies for the fusion of multi-focus images: the spatial domain methods, the early transform domain methods, the multi-scale geometric analysis theory-based methods, and the deep learning theory-based methods. The spatial domain methods usually directly implement the linear computation on the image pixel, for example, the averaging method, maxing method, and weighted method. The early transform domain methods include the Laplace pyramid-based method, wavelet-based method, etc.
In these methods, the multi-focus images are decomposed into different scales and each scale is with a limited number of sub-bands. Then, the features in different levels can be obtained for fusion. For example, in reference [3], the authors proposed a fusion method by using the extremum of the wavelet coefficients in different sub-bands. Dou et al. [4] proposed a fusion method by using the region energy in different high-pass sub-band coefficients by considering their distributions. Due to the limited number of high sub-bands, some block artifacts of edges may appear in these methods. To deal with these problems, multi-scale geometric analysis theory-based methods have been popularly reported in recent years. The curvlet transform, contourlet transform, non-subsampled contourlet transform (NSCT), and the shearlet are the typical decomposition tools in this period. For example, Li and Yang [5] proposed a fusion method by combining the wavelet and curvelet to overcome the disadvantages of the wavelet. He et al. [6] proposed a multi-focus image fusion method based on the improved contourlet package. Qu et al. [7] proposed the spatial frequency-motivated PCNN model in NSCT domain, and the spatial frequency was used to implement the firing mapping in the method. Liao et al. [8] proposed a shearlet-based fusion method by employing the statistical information of the shearlet coefficients. Considering the fusion procedure of the aforementioned methods, it is obvious that the fusion results are highly determined by the performance of the decomposition abilities.
From the viewpoint of fusion rules, the fusion procedure of the multi-scale geometric analysis theory-based methods can be modeled to be the classification problem of the multi-scale transformation coefficients. There are also three typical categories for the fusion rules: the active level metric-based rule, the kernel learning-based fusion rule, and the neural network-based fusion rule. For the former, the algebraic operations, such as averaging and maxing, are popularly used. The second one includes the ICA, SVM, and PCA. In literature [9], the principal component analysis (PCA) is implemented in dictionary training to reduce the dimension of transform coefficients. In literature [10], the cartoon components and the texture components are combined by ICA. The artifacts are easy to produce for the classification determined by the simple computations for the single coefficient. The neural network-based rule has been popularly reported in recent years, and some good results have been obtained. For example, in literature [11], the PCNN model is used to be the fusion rule by combining the NSCT together. Though good results have been obtained, these models are not abstract enough, which means the features are all in low levels. So, advanced neural network models should be further developed.
Compared with the traditional neural networks, the deeply coupled neural network is the breakthrough in this domain and has been popularly applied in image denoising, image recognition and classification, and in image fusion [12]. For example, a novel image fusion method for the multi-focus image was proposed based on the support value-motivated deep convolutional neural network model in literature [13]; a general multi-focus image fusion framework, called IF-CNN, was developed based on the deeply convolutional neural network in literature [14]. The MFF-GAN and Pan-GAN, under the mechanism of the unsupervised generative adversarial network, are proposed in the literature [15] and [16], respectively, and detail preserving adversarial learning model is proposed in literature [17, 18]. Specially, according to some recent references, good results are always obtained by the GAN-based methods. The main reason lies in its unique characteristics: firstly, the GAN model has more complex and deep network structure than the commonly used neural network; secondly, modeling the fusion process into the adversarial learning is more in line with the general principle of human understanding of the world. However, the common characteristics of the well-known methods are that they are directly developed in the pixel level, the important image features are not carefully used.
In order to overcome the shortcomings of the above methods, a multi-focus image fusion method based on the GAN in the complex shearlet domain is developed. Different from the traditional transformation methods, such as the curvelet and the contourlet, the complex shearlet can divide the source images into high-pass and low-pass sub-bands to provide more useful features. Besides, the computational efficiency of complex shearlet is higher than that of the NSCT to get the same shift invariance. With the help of the GAN, the whole fusion can be modeled by the adversarial learning of the features in the complex shearlet. Therefore, better fusion results can be obtained in the feature level.
The rest of this paper is organized as follows. The details of the whole method are given in Section 2. Experimental results and some important discussions are given in Section 3. Finally, the paper is concluded in Section 4.
2. Methodology
Figure 1 shows the structure of the proposed method. Firstly, the images to be fused are input into the GAN, and at the same time, the complex shearlet is implemented to get the high-pass subbands for them. Then, the features in the complex shearlet domain are computed to produce the new form of the loss function. The loss function is updated to drive the training of the GAN, and the final fusion results can be obtained after the training is finished.

2.1. The Complex Shearlet Transform
As one of the most famous multi-scale geometric transformation tools, the complex shearlet transform can extract directional information of different scales and deliver highly sparse approximations of the 2D signals. Generally speaking, it divides the source images into low-pass and high-pass sub-band images in different levels, i.e., the approximately sparse representation of the source images and the obvious feature information of the images. Different from the discrete wavelet, contourlet, and shearlet, the complex shearlet is realized based on the multi-scale pyramid filters and the Hilbert transform [19, 20]. The former gives the multiple partitions of the image, and the latter provides the directional sub-bands in the complex space. Figure 2 gives an example of the complex shearlet transform on a “clock with the left focus.”

2.2. The Feature in the High-Pass Sub-Bands
After the complex shearlet transform is done, the shearlet coefficients with large absolute values are considered to be the sharp brightness or salient features, meaning that they are the focused regions in the source images. Considering the aim of bringing the focus to the fused image, it needs to extract the focus firstly by using the complex shearlet coefficients.
On the other hand, the features in the multi-focus images can be uniformly described by the activity-level measurements, such as the local energy, standard deviation, and spatial frequency [21, 22]. For the above reasons, local energy and spatial frequency are used to represent the important features in the high-pass coefficients. Furthermore, different from their common form used in other literatures, they are computed in multiple scales and directions.
2.3. The GAN Model
2.3.1. The Structure of the GAN
Usually, the complete structure of the GAN network consists of two parts: the generator and the discriminator [23, 24]. The detailed structure of the GAN model used in this paper is shown in Figures 3 and 4.


For the generator, five convolutional layers are used to extract features. A 5 x 5 convolution kernel is used in the first convolutional layer, and a 3 x 3 convolution kernel is used for the other four layers. The inputs of each layer are connected by the outputs of all the previous layers, with the aim of speeding up the convergence and improving the stability of the model [25] All the activation functions are set to be “ReLU,” i.e., rectified linear unit. Furthermore, layer normalization (BN) is also employed to preserve the contrast information of the source images. It calculates the average value of all the dimensional inputs in each layer and finally implements the normalization operation. The advantages are to reduce the sensitivity of initializing data and effectively avoid the gradient disappearance problem [26].
Different from the generator, the main propose of the discriminator is to give the decision by classification. As shown in Figure 4, the discriminator has the same structure with the convolutional neural network which has two inputs, i.e., the Laplacian joint enhanced image from the source images and the fused image from the generator. Four layers of the 3 × 3 filters are designed to implement the convolution to capture the feature information. Meanwhile, in order to reduce the loss of important information caused by using the downsampling scheme, the activation function is set to be “ReLU.” Finally, the fully connected layer is used to classify, and the sigmoid function is employed to output the final results.
2.3.2. The Loss Function
The loss function plays the role of minimizing the loss of the training to get the ideal model, and it usually consists of the generator loss function and discriminator loss function, as shown in the following formula:where , are the generator loss function and the discriminator loss function, respectively.
According to the original model, is defined by formula (2). It is computed by the summarizing the confrontation loss and the content loss from the procedure of the image generation.where is a balanced weight between and , is the target image to be fused, is the number of fused images, is the result of classification, means that the false data is recognized to be true by the discriminator, is the intensity loss, is the gradient loss, and is the balanced weight.
can be expressed aswhere is the joint Laplace enhanced gradient map, is the gradient map of the fused image, and and are their labels, respectively.
From all the formulas above, it can be seen that target image to be fused is very important in confrontation learning. The common way to get it is to average the images to be fused or let it be initialized by one of the images to be fused. The drawback is that it is far from the final results and should spend much time and resource to get the optimal decision during the confrontation.
Therefore, a new form of the target image is proposed. Let be the low-pass sub-band coefficient at position , . The low-pass coefficient of the fused image can be obtained bywhere is the local energy computed in the neighborhood.
Let be the high-pass coefficient at in the sub-band and the level after implementing the complex shearlet transformation, ; then, the high-pass coefficient of the fused image can be obtained bywhere is the spatial frequency which can be computed by the following formula:
Then, the target image can be obtained by applying the inversion of the complex shearlet transform on the fused low-pass and high-pass sub-bands.
3. Experimental Results and Analysis
The experiments are implemented to show the performance of the proposed method. The platform used is Inspur big data server NF5280M4 with Intel Xeon CPU and 256 GB RAM. 100 pairs of multi-focus images are used for the training. All the data can be downloaded from the web [27–30].
Seven typical methods, i.e., PCNN-based method (PCNN for short) [31], the contourlet-based method (contourlet for short) [32], the GAN-based method (GAN for short) [17], the DCNN-based method (DCNN for short) [33], the discrete shearlet-based method (shearlet for short) [34], the convolutional sparse representation-based method (CSR for short) [35], and sparse representation and sum modified-Laplacian-based method (SR-SML for short) [36], are implemented. The level of the complex shearlet is four.
So far, how to evaluate the quality of the fusion results is still a confusing question. Subjectively visual and objectively quantitative comparison is the mainstream practice in this domain. Without loss of generality, mutual information, entropy, standard deviation (MI, En, and SD for short, respectively), and QAB/F are selected to be the metrics. The greater their value, the better the fusion images [37–39].
To save space, only “Pepsi-Cola,” “Plane,” “Clocks,” “Flower,” “Cup,” and “Calendar” are shown in Figure 5. All the fusion results are shown in Figures 6–11 . In Figure 7, the middle part of the “Plane” is partially enlarged to compare the local detail features.


(a)

(b)

(c)

(d)

(e)

(f)

(g)

(h)

(a)

(b)

(c)

(d)

(e)

(f)

(g)

(h)

(a)

(b)

(c)

(d)

(e)

(f)

(g)

(h)

(a)

(b)

(c)

(d)

(e)

(f)

(g)

(h)

(a)

(b)

(c)

(d)

(e)

(f)

(g)

(h)

(a)

(b)

(c)

(d)

(e)

(f)

(g)

(h)
From the above methods, we can see that though the focus regions are expressed better than the source image, the fusion results are different from each other. For the PCNN-based method, blurred edges obviously occur, and so the details are not clear enough. For the contourlet method, shearlet method, CSR method, and SR-SML method, though the results are improved, the contours are sharpened and the phenomenon of ghosting occurs. This can be explained by comparing the ability of the sparse representations for the important image features.
As for the GAN based the DCNN-based method, the results are much clear, but the texture information is not good enough by comparing the results obtained by the proposed method. This is because these two models are directly learned based on the pixel of the images to be fused. The importance of the feature in the procedure of learning is not fully considered. On the other hand, the texture information in the proposed method is highly improved and the ghosting phenomenon is suppressed to the greatest extent. Furthermore, this can also be proved by the enlarged images in Figures 7 and 11. In addition, from the objective comparison in Tables 1 and 2, the best value of the four metrics can be almost obtained by the proposed method. All the above facts fully demonstrate the effectiveness and accuracy of the proposed method.
4. Conclusion
To get better fusion results for the multi-focus images, the features-motivated generative adversarial network is constructed with the help of the complex shearlet transform. Six typical experiments have been carefully implemented to show the full evidence of the effectiveness and accuracy. In the future, more complex models will be built to further improve the fusion performance.
Data Availability
The data used to support the findings of this study are available from the corresponding author upon request.
Conflicts of Interest
The authors declare that they have no conflicts of interest.
Acknowledgments
This study was supported by the National Natural Science Foundation of China (61502282 and 61902222), the Natural Science Foundation of Shandong Province (ZR2015FQ005), and the Taishan Scholars Program of Shandong Province (tsqn201909109).