Abstract

Existing image inpainting schemes generally have the problems of structural disorder and blurred texture details. This is mainly because, in the reconstruction process of the damaged area of the image, it is difficult for the inpainting network to make full use of the information in the nondamaged area to accurately infer the content of the damaged area. Therefore, the paper has proposed an image inpainting network driven by multilevel attention progression mechanism. The proposed network has compressed the high-level features extracted from the full-resolution image into multiscale compact features and then drives the compact features to perform multilevel order according to the scale size. Attention feature progression is to achieve the goal of the full progression of high-level features including structure and details in the network. To further realize fine-grained image inpainting and reconstruction, the paper has also proposed a composite granular discriminator to achieve image inpainting process performing global semantic constraints and nonspecific local dense constraints. The related experimental results in the paper can show that the proposed method can achieve higher quality repair results than state-of-the-art ones.

1. Introduction

Image processing is one of the most important fields for human beings to express and transmit information, and it plays an irreplaceable and prominent role in human communication. However, in the image information transmission and storage, losses will inevitably occur, which will affect the quality of information presentation. In the most commonly used digital images, the most important form of information loss is pixel loss phenomenon. At present, image processing technology has been applied in many fields. Image inpainting technology is in an important part of image processing. In daily life, pixel loss or network loss may occur during image transmission, which not only affects the visual effect of the image, but also destroys the integrity of image information and reduces the quality of image processing.

According to the different feature levels used, the current existing methods can be divided into two categories: (1) methods that use low-level nonsemantic features; (2) methods that use high-level semantic features [1]. Among them, in image inpainting methods that use low-level nonsemantic features, it is a traditional image inpainting method, usually based on diffusion or image block matching mechanism to paste low-level features of nondamaged area to damaged area [2]. This type of method has an excellent repair effect on specific image defect types. For example, diffusion-based methods will progress from the boundary of the damaged area to the inside, which can effectively repair small damage such as scratches. The method based on image block matching is powerful in background inpainting and is widely used in commercial software. However, such used image inpainting solutions with low-level nonsemantic features cannot have an in-depth understanding of the context of the damaged area; that is, the high-level semantic features of the image cannot be obtained, so that such methods cannot achieve a good repair effect on highly patterned images.

Advanced semantic features greatly improve the repair performance. Among them, the method based on Generative Adversarial Nets [3] (GANs) had become the mainstream in the field of image repair [4]. The method based on GANs transforms the image repair problem into a condition-based generation confrontation [5, 6]. Such methods usually take the damaged image and the mask of the calibrated damaged area as the conditional input, use the autoencoder network as the generator to reconstruct the content of the damaged area, and combine the discriminator network to counteract training, and finally get a complete image output [7]. To effectively utilize the features of the image context area, GL [8] had introduced cascaded dilated convolution and integrated it into the autoencoder network. Expanded convolution can incorporate long-distance features into its receptive field to a certain extent, to achieve the goal of comprehensive utilization of long-distance features. However, dilated convolution has a larger cavity area, with a regular and symmetrical grid. The image features are sampled, which causes the features of long-distance key areas to be ignored. MC [1] (Multicolumn Convolutional), CA [2] (Contextual Attention), and CI [9] (Contextual-based Inpainting) schemes adopt single-level contextual attention.

However, the methods usually cannot generate content with reasonable structure and rich details for the defect area of the complex image of the scene. As shown in Figure 1(b), there are obvious overall or local structural disorders in the mask image, and the generated image is also unclear and uncertain. There is a problem that the semantic feature reconstruction is not detailed enough; that is, the reconstruction of the image semantics is relatively fuzzy.

Figure 2 shows the generative network of autoencoder and it is commonly used in current mainstream image inpainting schemes. The defect image is encoded by the encoder to obtain shallow features, and the shallow features are sent to the bottleneck area for feature extraction, and then the decoder decodes into a complete image [10]. We have found that this type of autoencoder structure has a very serious problem of feature transfer obstruction, and the high-level features of the bottleneck area are too large (usually 64 × 64 pixels). Large cross-sectional features make expansion schemes such as convolution and single-level attention feature matching [11, 12] cannot fully acquire structure and detail features and at the same time hinder the spread of structure and detail features in the network, resulting in structural disorder and semantic objects in the repair phenomenon such as blur.

As shown in Figure 3, for the problem of blocked feature transfer, we improved the network part of the bottleneck area in the autoencoder structure in the following two steps: the first step, multilevel feature compression [13]. The high-level features of the bottleneck network with the size of pixels are scaled according to the compression ratios of 0, 2, 4, and 8, respectively, to construct multilevel compression features, namely, Fc0, Fc2, Fc4, and Fc8. The higher the compression ratio feature, the smaller the scale. If the multilevel compression features are arranged according to the feature scale size, the result is Fc0 > Fc2 > Fc4 > Fc8. The multilevel compression features are complementary in feature expression, and the smaller the scale features and smaller the structural feature space, the easier it is for the network to search for meaningful structural expressions. On the contrary, with larger scale features, although they have weaker structural expression capabilities, they have more detailed features which makes it easier for the network to search for meaningful detailed expressions. Therefore, this complementarity between large- and small-scale features is the second step, namely, multilevel attention progression, which provides great potential information. Multilevel attention progression can make full use of the advantages of different compression features in expressing different features. Specifically, we perform attention matching and replacement for each level of compression features Fc8, Fc4, Fc2, and Fc0 to obtain the attention features. According to the order of the small scale to the large scale, the attention features are spread hierarchically. As shown in Figure 3, the attention feature A8 is combined with the compressed feature Fc4 to progress the small-scale attention features to higher scales. The subsequent attention features A4 are then propagated to A2 and A0 in the same process. Because the results of the previous level of attention feature matching and replacement always have more accurate structural expressions than the latter level, the latter level’s compact compression features always have more than the previous level. Therefore, the multilevel attention dissemination scheme can prompt the network to keep the image structure accurate at multiple scales, while continuously enriching the details. Compared with the current single-level attention-based image inpainting scheme [14, 15], our multilevel scheme can get richer depth features.

At the same time, unlike the multistage scheme from coarse to fine in the current mainstream methods, we expect to achieve fine-grained image reconstruction in one stage. To this end, we also propose a composite granular discriminator network for image inpainting. The process performs global semantic constraints and nonspecific local dense constraints. Among them, the global semantic constraints are implemented by a global discriminator, and the output of the discriminator is a value that evaluates the overall reality score of the image. The nonspecific local dense constraints are implemented by the local dense discriminator; nonspecific local and dense ones are embodied in our local dense discriminator which performs dense discrimination of multiple overlapping local areas in the image. Therefore, this dense local discrimination method is very suitable for dealing with irregularities. A large number of experiments on multiple data sets including human faces, building facades, and natural images show that the multilevel attention progression-driven generative image repair method generates the image inpainting results of the above with higher image quality than the existing methods.

Aiming at the problem that existing image repair methods still have incorrect results in image reconstruction, this paper proposes an image repair algorithm based on multilevel attention progression. The encoder-decoder used in the image inpainting scheme usually adopts two downsampling and subsequent flat convolutional upsampling structures, so that the bottleneck area of the encoding-decoding network still maintains a size of 64 × 64. Regardless of the use of dilated convolution, the coarse-to-fine structure or single-stage attention matching and replacement cannot effectively use the context information of the 64 × 64 size feature of the bottleneck area. Accordingly, this paper proposes multilevel attention progression. First, the bottleneck area is further scaled into multiple small-scale features, including 8 × 8, 16 × 16, and 32 × 32. After that, the attention-based feature matching and dissemination started from a small scale, and the 8 × 8 attention was propagated to 16 × 16, until 64 × 64. In this way, high-level image features including structure and detail features can be spread smoothly at multiple scales, and the problem that the bottleneck area is too large to make full use of context information is solved.

In summary, the contributions of this paper are as follows: (1) An end-to-end image inpainting model is proposed, which can perform full-resolution image context encoding, the extracted high-level features are compressed into multiscale compact features, and the compact features are driven to perform multilevel attention feature progression according to the order of scales, which realizes the full progression of high-level features including structure and details in the network. (2) The paper proposed a composite granularity discriminator, which performs global semantic constraints and nonspecific local dense constraints on images, so that image inpainting can simultaneously produce high-quality fine-grained reconstructions in a single forward process.

In this paper, the main structure is as follows: (1) Section 2 had depicted the details of related works. The two types of image inpainting methods had included traditional image inpainting method and deep learning. (2) Section 3 had proposed an improved multilevel attention dissemination network. (3) Section 4 had described the experimental results on the Places2, CelebA-HQ, and Facade datasets. (4) Section 5 had concluded our research works and future works.

2.1. Traditional Image Inpainting Method

Traditional methods that use low-level image information and nonsemantic features can be divided into two categories: diffusion-based methods and image block-based methods. Diffusion-based methods use distance fields mechanisms such as spreading image information from neighboring pixels to the target area with a very effective repair effect for small areas of the image or narrow defect areas like scratches. When the defect area is too large or the texture changes greatly, they usually generate obvious visual artifacts. The block-based methods are first used for texture synthesis and then extended to image repair. Compared with diffusion-based methods, image block-based methods can repair more complex images of the scene [16]. Generally, image block-based methods use an iterative method to sample similar information from the nondefective area of the same image or from an external image library to fill the defect area. Since the similarity score of each target-source pair must be calculated, this type of method requires a lot of calculation and memory overhead. PatchMatch [17] is an effective image block-based method. It solves this problem through a fast nearest neighbor algorithm, which greatly accelerates the speed of traditional algorithms and achieves higher quality repair effects. Based on images block-based method it is assumed that the texture of the repaired area can be found in other areas of the image. However, this assumption may not be true from time to time, thus limiting the scope of application of this method. In addition, due to the lack of a high-level semantic understanding of the image, the image block-based method cannot reconstruct the semantically reasonable results for highly patterned damaged images such as faces. Therefore, no matter the traditional repair methods is based on diffusion or image blocks, they cannot perceive the high-level semantics of the image.

2.2. Image Inpainting Method via Deep Learning

In recent years, learning-based image inpainting methods learn high-level semantic representations from large-scale data, which greatly improves the inpainting effect [18]. [19]. Context Encoder [20] uses an autoencoder structure; by minimizing the pixel-level reconstruction loss and counter loss, the 64 × 64 rectangular defect area in the center area of the 128 × 128 image is repaired. The encoder maps the image with the damaged area to the high-level feature space. The feature space is used by the decoder to reconstruct the complete output image. However, due to the information bottleneck of the channel-dimensional fully connected layer and the lack of constraints on the local area of the image, the reconstruction area of the output image of this method often has obvious visual artifacts. By reducing the number of downsampling layers and replacing the channel fully connected layer with a series of expanded convolutional layers, the information bottleneck problem of the context encoder is solved to a certain extent [8]. At the same time, Iizuka et al. [8] also introduced a local discriminator to improve the quality of the image. However, this method requires complex postprocessing steps, such as Poisson mixing, to enhance the color consistency near the hole boundary. In Song et al. [9] and Yu et al. [2] coarse-to-fine convolutional network configuration scheme is introduced into the image inpainting. The scheme uses a deep convolutional neural network in the first step to achieve a rough estimation of the damaged area. Furthermore, in the second step of the deep convolutional network, use attention mechanism or feature block exchange operation searches for the most similar feature block in the image context and replace the feature block in the missing area to obtain a detailed output result. However, these two solutions are not very effective in repairing irregular damaged areas. Wang et al. [1] had proposed a multicolumn generation network for image inpainting, designed a confidence-driven reconstruction loss, and adopted an Implicit Diversified Markov Random Field (ID-MRF) regularization scheme to enhance local details. It has achieved good results on both rectangular and irregular masks. Liu et al. [21] had introduced partial convolution in image inpainting. Masking and renormalization are carried out, and only the effective pixels in the nondamaged area are used, which effectively solves the artifacts such as color difference and blur caused by convolution.

3. Multilevel Attention Dissemination Network

The problem of structural disorder in the results of current image repair methods is mainly because, in the reconstruction process of the damaged area of the image, the image repair network cannot make full use of the information of the nondamaged area to infer the damaged area. This paper proposes an image inpainting network based on multilevel attention progression. It encodes the full-resolution image context, performs multilevel and multiscale feature extraction of high-level semantic features in the downstream bottleneck network part, performs feature block replacement according to the attention characteristics of hierarchical features, and guides the feature extraction and feature extraction of the next level of high-level features. Replacement is done to make full use of advanced features including structural features. At the same time, a composite granularity discriminant network is proposed to implement global semantic constraints and nonspecific local dense constraints on the image inpainting process, to achieve the simultaneous reconstruction of coarse-grained and fine-grained images in one stage mapping of their respective networks.

As shown in Figure 4, the proposed multilevel attention dissemination network consists of two parts: (a) a multilevel attention dissemination generator ; (b) a compound discriminator . The multilevel attention dissemination network generator is aimed at improved autoencoder for image inpainting tasks reconstructing the damaged area of the image through the encoding process, the multilevel attention progression process, and the decoding process. The composite discriminator network punishes by discriminating the image generated as false, thereby promoting to generate the real image. We describe the learning process from the damaged image to the complete image as a mapping function that maps the damaged image manifold to the complete image manifold . To simplify the symbols, we will also use these symbols to represent the functional mapping of their respective networks.

3.1. Generator

As shown in Figure 5, the multilevel attention progression generator is mainly composed of three subnetworks: feature extraction network, multilevel attention progression network, and upsampling network. Let and be multilevel attention; the input and output of force progression network generator are in Figure 4. In the shallow feature extraction stage, the shallow feature is extracted as follows:

Among them, is the encoder network. The encoder of network firstly performs flat convolution and then uses downsampling and convolution operations to compress and encode the damaged image. They perform feature refinement on the extracted useful local features as follows:

Among them, is a bottleneck area network composed of four layers of dilated convolutional cascades, the size of the convolution kernel is 3 × 3, and the expansion rate is 2, 4, 8, and 16. Next, they perform multilevel attention progression. The first step of attention multilevel progression is to scale the refined high-level features into multilevel compression features as follows:

Among them, is the feature scaling operation, and n is the scaling rate, which means that the feature size is scaled to 1/n of the original. Subsequently, the compressed features are subjected to attention-based multilevel feature matching and progression, and the subsequent processing is guided by small-scale results.

represents the channel dimension superposition, is the matching replacement and progression operation performed on the feature with compression rate , and more details will be given in Section 3.2.

Finally, after multilevel attention feature replacement and progression, an upsampling network is used to convert the high-level feature mapping into a complete output image:

Among them, is the decoder network, and the feature is upsampled twice to obtain a complete reconstructed image.

3.2. Feature Matching and Dissemination

We use the most advanced attention feature matching scheme [11]. Attention is usually obtained by calculating the similarity between image blocks or feature blocks inside and outside the missing area. Therefore, the related features are transferred; that is, the image blocks/feature blocks of the image context are weighted and copied into the missing area through the similarity relationship.

As shown in Figure 6, first learns the region affinity from the compressed feature ; that is, it extracts the feature block from the Fc and then calculates the cosine similarity between the internal feature block and the external feature block of the damaged area.

is the ith feature block extracted from the damaged area, and is the feature block extracted from the damaged area. Then softmax function is used to process the similarity to obtain the attention score of each image block:

After obtaining the attention score from the high-level feature map, use the context based on the attention score to fill the damaged area in the phase feature block:wherein is the ith feature block extracted from outside the damaged area, and fills the jth feature block of the missing area. All these operations can be expressed as convolution operations for end-to-end training [2]. The features obtained at each level are upsampled to guide the spread of the attention of the next layer. This design ensures the consistency of the image structure at multiple scales and enriches the image details level by level. It is worth noting that the size of the most compact compressed feature in our scheme is only , so there is no need for additional expansion convolution for long-distance feature borrowing during the attention matching process.

3.3. Discriminator Network

As a supplement to the generation network, the composite discriminator network is used to judge whether the image generated is sufficiently real. In image inpainting, a high-quality image not only depends on the overall characteristics of the image but also depends on the characteristics of the image’s local objects. Global and local discriminators are used to constrain the global and local damaged areas, respectively, and we designed a composite discriminator to implement global semantic constraints and nonspecific local dense constraints.

As shown in Figure 4(b), the global semantic constraint and the nonspecific local dense constraint are, respectively, implemented by the global discriminator and the nonspecific local dense discriminator . The global discriminator is composed of a convolutional layer and a fully connected layer, and the output is a value that evaluates the overall reality score of the image. The nonspecific local dense discriminator is similar to the Patch-GAN [22] structure, which is composed of five step convolutions (kernel size is 5 and the step size is 2) superimposed. The input is the image; the output is a three-dimensional feature map of shape , where h, , and c represent the height, width, and number of the channel, respectively. Then, we apply the loss of the discriminator directly to the last layer of the discriminator feature map on each element; a generative adversarial network is formed for different local positions of the input image. The global discriminator and the nonspecific local dense discriminator in the composite discriminator network are complementary in terms of function. The discriminator is aimed at global constraints and promotes the natural transition between damaged and nondamaged areas of the generated image at the global level. The nonspecific local dense discriminator performs dense and overlapping judgments on multiple local areas in the image so that the image locally owned detailed textures.

3.4. Loss Function

The loss function consists of three sections: (1) Adversarial loss function ; (2) feature matching loss function ; (3) reconstruction loss function . The overall objective function can be expressed as follows:

Among them, the hyperparameter of the loss term is , .

3.4.1. Adversarial Loss Function

Our proposed method uses an improved Wasserstein GAN [23], and the adversarial loss is applied to both the network and the network , which ultimately affects the reconstruction process of the generated network on the damaged image. The output value of the composite discriminator network represents the output image of the generated network . The degree of similarity with the real image is used to punish and prompt the generation network to generate more realistic images. Our composite discriminator network is composed of and . The adversarial loss can be expressed as follows:

3.4.2. Feature Matching Loss Function

The feature matching loss is used to compare the activation mapping of the middle layer of the discriminator, forcing the generator to generate a feature representation similar to the real image, thereby stabilizing the training process, which is similar to the perceptual loss [24]. Different from the perceptual loss comparison from the pretrained VGG network to obtain the activation mapping from the true value image and the output image, the feature matching loss is compared to the activation mapping of the discriminator’s middle layer. We define the feature matching loss as

is the final convolutional layer of the discriminator, is the number of elements in the ith activation layer, is the activation map of the ith layer of the discriminator , and is the activation map of the ith layer of the discriminator .

3.4.3. Reconstruction Loss Function

Image inpainting must ensure not only that the restored image has a sense of semantic reality, but also the pixel-level accurate reconstruction of the image. Therefore, for the pixel-level reconstruction process, we define the reconstruction loss as follows:

3.5. Algorithm Flow

The training process of network is shown in Algorithm 1.

Input: Damaged Image , Random Mask .
Step 1: Batch sample images from the training dataset, generate random mask for each image in each batch, and obtain damaged images;
Step 2: if stage = = 1
Step 3: Training Generator = True; Training Discriminator = False
Step 4: The epochs is 30, and every 5000 iterations, the generator network is updated under the loss function of feature matching loss and reconstruction loss, and the repairing image is obtained
Step 5: elseif stage = = 2
Step 6: Training Generator = False; Training Discriminator = True
Step 7: The epochs is 5, and every 2000 iterations, the discriminator network is updated under the loss function of adversarial loss
Step 8: elseif stage = = 3
Step 9: Training Generator = True; Training Discriminator = True
Step 10: The epochs is 5, and every 2000 iterations, the generator network and discriminator network are updated under the loss function of adversarial loss, feature matching loss and reconstruction loss.

4. The Experimental Results and Analysis

4.1. Experimental Settings
4.1.1. Datasets

We had chosen three universal datasets for image inpainting tasks to validate our proposed model. The dataset segmentation is shown in Table 1.(i)Places2 [25] dataset: a dataset released by MIT, containing more than 8 million images from 365 scenes.(ii)CelebA-HQ [26] dataset: a high-quality face dataset from CelebA dataset.(iii)Facade [27] dataset: a collection of building facades in different cities around the world.

4.1.2. Parameters Setting

Use Python language to develop and compile the program code of the proposed method on Windows 10 operating system. The deep learning platform used for compilation and testing is configured as TensorFlow v1.8, CUDNN v7.0, and CUDA v9.0. The core hardware configuration is Intel 8700 3.20 GHz CPU, 12G NVIDIA TITAN XP GPU. We use the Adam optimizer to train the model with a batch size of 6, and and are set to 0 and 0.9, respectively. The learning rate in the initial stage of model training is set to . Then use the learning rate to fine-tune the model. During the model training process, all images in the training set are scaled to a size of 256 × 256. The trained model can be run on the CPU and GPU, regardless of the size of the defect. The average running time of the repair process on Intel(R) CPU is 1.5 seconds, and the average running time on NVIDIA(R) TITAN XP GPU is 0.2 secs. All experimental results in this paper are directly from the trained model output, without any postprocessing procedure.

We had chosen several state-of-the-art methods including PatchMatch [3], CA [2], and MC [1].(i)PatchMatch (PM) [3]: A typical image block-based method that copies similar image blocks from the surrounding environment(ii)CA [2]: A two-stage image inpainting model that uses high-level contextual attention features(iii)MC [1]: A confidence-driven reconstruction loss is designed for the image inpainting model, and implicit multiple Markov random field regularization is used to enhance local details

4.2. Experimental Results

We have carried out qualitative and quantitative analysis of the method in this paper and the three current classic mainstream schemes, respectively, to prove the superiority of the method in this paper.

Qualitative comparison: Figures 79 can, respectively, show the comparison results of proposed method on the Places2, Facade, and CelebA-HQ datasets and the comparison methods. In most cases, our image inpainting results are compared. The proposed method is more accurate and reasonable in structural reconstruction. Compared with other methods, our proposed method is more detailed in detail texture reconstruction.

Quantitative comparison: We use indicators such as PSNR, SSIM, and average loss to objectively measure the quality of the repair results. Among them, PSNR and SSIM can roughly reflect the model's ability to reconstruct the original image content and provide a good approximation for human visual perception. The average loss directly measures the distance between the reconstructed image and the true value image, which is a very practical image quality evaluation index. As shown in Table 2, our method achieves the best results in the Places2, CelebA-HQ, and Facade datasets as a result, SSIM and PSNR are the highest, and the average loss is the lowest.

4.3. Component Analysis

To verify the effectiveness of the multilevel attention mechanism and the composite granularity discriminator network, we took the average loss as a performance reference (the smaller the average loss, the better the performance) and conducted a comparative quantitative study. The results are shown in Table 3. Among them, to are attention components, Single-D is a single global discriminator, and Cg-D is the compound granularity discriminator proposed in this paper.

From Table 3, we can see that multilevel attention progression can greatly improve network performance. At the same time, due to the dense constraints of the composite granular discriminator on global semantics and nonspecific local ones, network performance has been further improved.

4.4. Generalization Application Research

To further verify the generalization ability of our method, we also research the practical application of object removal on the proposed model. As shown in Figure 10(a), we try to remove the glasses in the face image. We can see that the methods in this paper have successfully deleted the glasses and reconstructed clear and natural human eyes in the glasses area. In Figure 10(b), our model removes a large area of the face and reconstructs a reasonable result. It is worthwhile to note that the face images of Figures 10(a) and 10(b) are not facing the front, and during the training process, the nonfrontal front images in the entire training set occupy only a small number. Chemical ability: More successful removal of specific objects and reconstruction of high-quality results are shown in Figure 10(c).

5. Conclusions

This paper proposes an image repair network based on hierarchical attention progression. To solve the problem of structural disorder and semantic object blur in the image repair results, we propose to perform multiscale compression and multilevel attention features on the high-level semantic features encoded by the encoder progression to achieve full utilization of advanced features including structure and details. At the same time, to complete the simultaneous reconstruction of coarse-grained and fine-grained images in one stage, we propose a composite granular discriminator network for the image inpainting process to carry out global semantic constraints and nonspecific local dense constraints. A large number of experiments have shown that our proposed method can produce higher quality repair results compared with classic mainstream methods. In the following research content, it is necessary to further improve the complexity of proposed algorithm and reduce the running time of the proposed algorithm.

Data Availability

The data used to support this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

The authors would like to acknowledge Places2, CelebA-HQ, and Facade datasets, which allowed them to train and evaluate the proposed model. This work was supported by grants of the Natural Science Foundation of Hunan Province of China (No. 2020JJ4623), and the Changsha Major Science and Technology Projects (Nos. KQ2102007, KQ1703018, and KQ1706064).