RETRACTED: Image Inpainting via Context Discriminator and U-Net

Wei, Ruifang; Wu, Yukun

doi:https://doi.org/10.1155/2022/7328045

Mathematical Problems in Engineering

On this page

Abstract Introduction Conclusions Data Availability Conflicts of Interest Authors’ Contributions References Copyright Related Articles

Research Article Retraction

!

This article has been Retracted. To view the article details, please click the ‘Retraction’ tab above.

Research Article | Open Access

Volume 2022 | Article ID 7328045 | https://doi.org/10.1155/2022/7328045

RETRACTED: Image Inpainting via Context Discriminator and U-Net

Ruifang Wei¹and Yukun Wu¹

Academic Editor: Michal Kawulok

Received24 Jan 2022

Revised22 Feb 2022

Accepted04 Apr 2022

Published06 May 2022

Abstract

Image inpainting is one of the research hotspots in the field of computer vision and image processing. The image inpainting methods based on deep learning models had made some achievements, but it is difficult to achieve ideal results when dealing with images with the relationship between global and local attributions. In particular, when repairing a large area of image defects, the semantic rationality, structural coherence, and detail accuracy of results need to be improved. In view of the existing shortcomings, this study has proposed an improved image inpainting model based on a fully convolutional neural network and generative countermeasure network. A novel image inpainting algorithm via network has been proposed as a generator to repair the defect image, and structural similarity had introduced as the reconstruction loss of image inpainting to supervise and guide model learning from the perspective of the human visual system to improve the effect of image inpainting. The improved global and local context discriminator networks had used as context discriminators to judge the authenticity of the repair results. At the same time, combined with the adversarial loss, a joint loss has proposed for the training of the supervision model, which makes the content of the real and natural repair area and has attribute consistency with the whole image. To verify the effectiveness of the proposed image inpainting model, the image inpainting effect is compared with the current mainstream image inpainting algorithm on the CelebA-HQ dataset based on subjective and objective indicators. The experimental results show that the proposed method has made progress in semantic rationality, structural coherence, and detail accuracy. The proposed model has a better understanding of the high-level semantics of image, and a more accurate grasp of context and detailed information.

1. Introduction

Images are one of the most important ways of expressing and transmitting information for human beings, and they play an irreplaceable and prominent role in human communication. However, in the process of image information transmission and storage, losses will inevitably occur, which will affect the quality of information presentation. For digital and logical images, the important manifestation of information defects is pixel loss circumstance. Information deficiency has a serious impact on the presentation of image content. This is because the image belongs to nonstructural information, and the relative independence of each part of the content is relatively poor. Therefore, even a small size defect content will give people an intuitive feeling that the overall image quality is poor. Obviously, images with content defects cannot be directly presented to users. Today, with the continuous popularity of image acquisition equipment, the number of various types of images has exponentially increased every day. Research and design can intelligently repair image defects and automated methods, and technology is imperative [1].

According to the repair strategy, the traditional algorithms are divided into the structure diffusion method, texture synthesis method, and sample image repair method [2]. The related methods based on structure diffusion imitate the process from contour to detail when craftsmen repair artworks [3]. The continuity of boundary and the consistency of the local area are given priority when designing the algorithm. The known information in the image to be repaired is diffused to the defect area through algorithms such as partial differential formulas to achieve the purpose of repairing the image. The smaller nontexture region is better, but it is difficult to achieve the ideal effect in the face of large-area image defect, large object removal, and other tasks. The texture synthesis method and the sample image inpainting method have achieved good results in texture detail inpainting [4]. However, there are still some deficiencies in the understanding of image high-level semantics and the grasp of image global structure. In the case of limited information, there are often problems such as inconsistency between image inpainting region and image global semantics. The result of image inpainting is not satisfactory.

The deep learning method is gradually used in image processing tasks, in which the image inpainting task is defined as conditional image generation. The context encoder (CE) [2] has proposed one of the earliest image inpainting methods using a convolutional neural network (CNN). It trains the coding-decoding network and designs the loss function with the idea of an adversarial network. Compared with the traditional algorithm, it achieves a good image inpainting effect. However, this method is limited by the network structure, only suitable for fixed-size low-resolution image inpainting tasks, and the repair trace is obvious [5].

Iizuka et al. [6] had used the fully convolution neural network structure as the framework and used the expanded convolutions and double-discriminant network to improve the context encoder so that the repair network can repair the image with arbitrary irregular shape and large defect area, but the image repair results need postprocessing to achieve the ideal repair effect, and it increases the cost of image inpainting and destroys the integrity of the network. Yu et al. [7] had proposed a rough refined network structure and introduced the attention mechanism, which improved the repair effect to a certain extent, but there is still a problem that the repair effect is not ideal when facing a large area of defects. Wang et al. [8] had proposed a generative multicolumn convolution neural network structure, which uses convolution kernels of different sizes to fully extract features, and designed a novel confidence-driven image inpainting algorithm, which uses many techniques to improve the quality of image inpainting and achieves excellent visual effect, especially when there are many kinds of objects or scenes in the dataset, the repair effect is not ideal, the network parameters are difficult to fit, and the structure and texture of the repair results are fuzzy. By expanding the number of input image channels, Portenier et al. [9] had introduced the contour constraints, color constraints, and other artificial intervention information into the network model and achieved the goal that the present conditions can interfere with the results of the image inpainting. Humans are the ultimate perceivers of images, and the most direct and reliable method for evaluating image quality is a human subjective evaluation. However, subjective evaluation is easily affected by individual factors such as the observer's experience, mood, age, and esthetic taste. In order to eliminate this influence, a large number of observers are usually invited to score the quality of the same image, and the mean opinion score (MOS) has been used as the subjective score of the image.

Aiming at the problems of existing image inpainting methods [10, 11], based on the design idea of U-Net [12] and generative countermeasure network [13], this study comprehensively uses mature and effective methods in the field of image processing to improve the context coding network from the aspects of network structure, loss constraint, and training strategy and proposes an image inpainting model with consistent global and local attributes. The experimental results show that the proposed method has made great progress in both subjective and objective indicators. The results of image inpainting are consistent in global and local attributes, and the performance of color reconstruction and local details is outstanding, which is more in line with human visual perception.

2. The Proposed Algorithm

In view of the situation of large-area damage, an image inpainting model with the consistency of global and local attributes is proposed. The model uses a novel network to repair the input image with a defect mask and uses global and local context discrimination networks as additional networks. In the training process, the weight learning process of the image inpainting network is constrained by antagonistic loss so that the image inpainting network can truly repair the image. The proposed model is trained by stages to accelerate the fitting of network parameters and improve the quality of image inpainting. The proposed model in this study is shown in Figure 1.

2.1. The Image Inpainting Network

The image inpainting network is based on a convolutional neural network and constructed in the form of an encoder-decoder. The input of the image inpainting network is a four-channel composite image composed of a binary single-channel mask (1 represents the pixel to be repaired), which can indicate the defect area and a three-channel RGB original image. The output of the network is a three-channel RGB image inpainting result. To enhance the understanding of the global semantics and grasp the local details of the network, this study introduces the expansion convolution and jump connection methods, which have been proved to be effective in the field of image processing in the network structure and improve the image inpainting effect through the design optimization of the network structure.

To ensure the quality of image inpainting and maintain the uniform input and output sizes of the image inpainting network, this study uses the stride convolution with two steps to replace the pooling layer, which may lose key information and control the number of subsampled input images, but this will bring the problem of the receptive field to the subsequent operation. We define the receptive field size of the convolution kernel aswherein is the receptive field of the convolution kernel of the upper layer, is the size of the convolution kernel of this layer, and is the convolution step of the layer.

The small convolution kernel receptive field is insufficient, and the network cannot get the global view and cannot effectively understand the image high-level semantics, which may cause the inconsistency between the repair area and the global semantics. A large-size convolution kernel will bring exponential growth in the amount of computation, which makes the model training more difficult. Dilated convolution injects holes into the convolution kernel and fills these holes with zero. At this time, the number of effective calculation points of the convolution kernel remains unchanged, to increase the size of the convolution kernel and the receptive field on the premise of keeping the same amount of calculation. The size of the equivalent convolution kernel of extended convolution is

In formula (2), is the standard convolution kernel size. is the dilation rate of dilated convolution. is the size of the equivalent convolution kernel of extended convolution. The comparison of receptive field size between ordinary convolution and extended convolution is shown in Figure 2.

The dilated convolution can capture more input image area than ordinary convolution. By setting the convolution step size to one and depending on the change of padding parameter, the size of the feature map output by each layer's expansion convolution remains unchanged so that the whole image feature information of the input image of this layer can be preserved. Through the expansion convolution superposition of multilayer expansion rate multiplication, the problem of missing representation points in the dilated convolution itself is alleviated. It can reduce the loss of image feature information, lay a good foundation for the subsequent repair texture generation, and meet the needs of the image repair task, as shown in Table 1.

Deep learning usually maps low-level features layer by layer by deepening neural networks, to extract more important features [13]. However, this will lead to the loss of some important features in the process of information transmission layer by layer, which will reduce the correlation between deep features, directly affect the expression ability of the model, and also cause the gradient to disappear. The difficulty of network model training increases [14].

To solve the above problems, He et al. [15] had proposed a residual network, by adding a short shortcut between some layers, namely, skip connection, to build a residual block to ensure that important feature information can be transmitted to the deep layer of the network. The hop connection of the residual network only connects the upper and lower adjacent layers, so it is called a short connection [16]. Ronneberger et al. [12] had proposed U-Net, a long connection architecture. Compared with the traditional autoencoder, the network structure of the U-Net is completely symmetrical. Long-jump connection is used in the corresponding stages of encoding and decoding, and low-level features are transmitted to the decoding network by concatenating so that the decoding network can integrate features of different sizes [17]. To improve the detailed performance of the image inpainting results, this method uses a long-jump connection. The structure comparison between self-encoder and U-Net [18] is shown in Figure 3.

(a)

(b)

As mentioned, the proposed model does not use a pooling layer in an image repair network. For the structural information loss caused by the pooling layer missing, this study transfers the low-level features through a long-jump connection to the decoding stage to strengthen the structural information. The results of the later experiments show that the introduction of a long-jump connection can effectively reduce the difficulty of network model training and improve the image inpainting quality.

To reduce the loss of information while compressing the image and extracting the features, the compression times are controlled in the coding stage, and the image size is only compressed to 1/8 of the original image. On this basis, the feature image information is supplemented with the expansion convolution and long-jump connection of different expansion rates, which ensures that the image inpainting model has enough feature information in the decoding stage to generate a clear and reasonable repair texture in the defect area, and achieve the ideal effect of image inpainting.

2.2. The Image Discriminant Network

An image discriminant network is a parallel discriminant network composed of a global context and local context discriminant networks, which can determine whether the image is real or repaired. The convolutional neural network is used to compress the image layer by layer into small feature vectors, and then, the output of the network is fused through the full-connection layer to obtain the continuous value corresponding to the real probability of the image. Finally, the sigmoid function is used to make the value in the range of , which indicates that the image is a real image rather than a repaired probability.

The input of the global context discrimination network is the whole image zoomed to pixels, and the output is the probability that the input image is a real image [19]. The role of the global context discrimination network is to supervise the image repair network to repair the damaged image and ensure that the repaired area and the whole image have the consistency of context attributes. The structure of the global context discrimination network is shown in Table 2.

The structure of the local context discriminant network is the same as that of the global context discriminant network. Due to the different sizes of the input images, the first-layer convolution is removed when designing the network structure. The input of the discriminant network is the pixel image containing the defect area. When the image is a real image, the 1/4 image block of the whole image is randomly selected as the input. The function of the local context discrimination network is to enhance the detailed representation of the repaired area and reduce the fuzziness of the generated texture. The structure of the local context discrimination network is shown in Table 3.

The global and local context discrimination networks perform their respective duties and rely on adversarial loss to help the image inpainting network and improve the quality of image inpainting. Different from the method that Iizuka et al. [6] put the continuous values of the output of two subdiscriminant networks into a -dimensional vector and then do the discriminant, the two subdiscriminant networks of this model output a result, respectively. According to the weight of the subdiscriminant network, the final discriminant result is obtained by comprehensive calculation. This method is not only helpful to the fitting of a discriminant network but also can improve the accuracy of the discriminant network, and it indirectly improves the quality of image inpainting [14]. It should be noted that the weight of the subdiscriminant network in this study is the empirical value obtained by repeated experiments. In practical application, the weight can be adjusted according to the characteristics of the dataset to achieve the ideal training effect.

2.3. Loss Function

To make the network model of image repair that can repair the image of a defect and get a satisfactory image repair effect, we use several loss functions to reduce the gap between the repair result and the original image and improve the effect of image inpainting.

2.3.1. Reconstruction Loss

For the two images with both dimensions , the mean square error is defined as

According to formula (3), the traditional loss based on mean square error is simply to calculate the square of the pixel difference between the repaired image and the original image in the way of elementwise and then average the whole image to get the final result, which is not enough to express the intuitive feeling of human visual system on the image. Figure 4 is the visual effect comparison of the image after different processing. Figures 4(b) and 4(c) are the images whose gray value is adjusted to 0.9 times of the original image and the image after Gaussian blur processing. From the perspective of human vision, it is obvious that the image with adjusted gray value is clearer than the image with Gaussian blur processing, which is closer to the original image. The image quality of the Gaussian blurred image is higher than that of the image with reduced gray value, which is inconsistent with the result of human visual perception, Therefore, using the loss of mean square error as the loss of image inpainting to guide image inpainting, the result of inpainting may perform well in the result data, but the effect is not good in human visual perception.

(a)

(b)

(c)

To solve this problem, this study uses structural similarity (SSIM) as the loss of image inpainting to guide image inpainting and seeks a better restoration effect from the perspective of human visual perception. SSIM is divided into brightness, contrast, and structural similarity. The calculation methods are shown in formula (4), formula (5), and formula (6), respectively.wherein and are the mean value of images and . and are the variance of images and . is the covariance of images and . and are two constants to avoid dividing by zero, and is the range of pixel value. and are the default values. The formula of similarity between two graphs is defined as follows:

Let , the formula of contrast and structure similarity can be simplified, and the final SSIM formula is as follows:

Taking Figure 4 as an example, the structural similarity is used to measure the similarity between the two adjusted images and the original image. The similarity between the Gaussian blurred image and the original image (0.9260) is less than that between the reduced gray value image and the original image (0.9899), which is consistent with the result of human visual perception.

The function is used to represent the image inpainting network, where is the input image, is the defect mask with the same size as the input image, the pixels of the defect area in the binary single-channel mask are assigned one, and the pixels of other areas are assigned zero. The loss of image inpainting is defined as follows:

2.3.2. Adversarial Loss

In this study, the loss of image discrimination network plays an important role in the training of the network model as a confrontation loss. The standard optimization network of a neural network is transformed into a minimum-maximum optimization problem. Similarly, the image discrimination network can be represented by a function . In the third stage of network model training, the image inpainting network and image discrimination network are trained in series and optimized in series. For the whole image inpainting model, the optimization function is defined as in formula (10).wherein is the random mask, is the expectation, and the expectation is the average pixel value of an image in a training batch.

The final joint optimization function is obtained by combining image inpainting loss and adversarial loss, as shown in formula (11).

and are the weights of image inpainting loss and adversarial loss, respectively. After many experiments, the weight reference values are and . The joint optimization function is suitable for the third stage of training, that is, the joint training of the image inpainting network and image discrimination network. In this stage, the image inpainting model is fine-tuned to improve the effect of image inpainting.

2.4. The Training of the Proposed Model

To speed up the network model fitting and improve the quality of image inpainting, this study adopts the phased training method to train the image inpainting network and, alternately, the image discrimination network. The training process is as follows. Input: Image of additional defect mask. Output: Repaired image.(1)According to batch size, images are randomly grabbed from the training set, and preprocess is performed such as scaling and random flipping.(2)The first stage of training:(2.1)The training image is covered with a defect mask of random size, and the defect area is filled with the average pixel value of the dataset.(2.2)The processed training image is input into the image repair network to repair the image, and the image discrimination network does not participate in the training at this time.(2.3)According to the input and output images, formula (9) is used to calculate the image inpainting loss and update the image inpainting network parameters.(3)The second stage of training:(3.1)The training image is covered with a defect mask of random size, and the defect area is filled with the average pixel value of the dataset.(3.2)The image repair network with fixed parameters is used to repair the image, and the binary cross-entropy loss (BCE_Loss) is used to calculate the image discrimination loss and update the image discrimination network parameters.(4)The third stage of training:(4.1)The training image is covered with a defect mask of random size, and the defect area is filled with the average pixel value of the dataset.(4.2)The image inpainting network and the image discrimination network are jointly trained, and the image inpainting loss is calculated by the formula (11), and the overall network model is fine-tuned according to the joint loss.

The experiment in this study is based on the deep learning framework PyTorch 1.1.0 for training and testing under the Ubuntu 18.04.2 system. The hardware environment is an Intel Core i7-8700K processor, 32 GB memory, NVIDIA GTX 1080Ti graphics card. It takes about 21 hours to complete a complete training of the image inpainting model, and the entire training period is about two weeks.

3. The Experimental Results Analysis

To verify the effectiveness of the method in the image inpainting task, the method in this study is horizontally compared with the three excellent works in the field of image inpainting in subjective effects and objective indicators, and the method in this study is comprehensively evaluated through the subjective feelings of the human eye and objective data.

The proposed method is mainly trained and tested on the CelebA-HQ dataset [20]. The CelebA-HQ dataset is a dataset obtained by cropping and reconstructing super-resolution after filtering the images based on the face dataset CelebA dataset. It contains 30,000 high-resolution face images. The reason why the CelebA-HQ dataset is chosen as the experimental dataset for this study is that, on the one hand, the face is the most special image, which not only contains a specific facial structure but also has the possibility of infinite changes in details, which can fully test the repair of the network model. On the other hand, because most related work in the field of image inpainting has used this dataset, it is helpful for comparison experiments to be carried out in a real and effective manner [21]. To prevent the network model from overfitting during the training process, this study implements operations such as random inversion and scrambled data in the actual use process. To prove the generality of the method in this study, the image inpainting model proposed in this study has also been trained and tested on the Places2 dataset [22]. The CelebA-HQ dataset is obtained from 202,599 data statistics given by 10,177 observers, and the MOS value range is [0,1].

To ensure that each algorithm in the comparison experiment achieves the best repair effect, the code and examples used in the comparison experiment in this study are the official versions released by the author of the algorithm [23]. The image inpainting results are obtained by loading the network model and official pretraining files or using a live demo. At the same time, to fairly compare and evaluate the four image inpainting algorithms, the image inpainting effects shown in this study are all the original output of the network without any postprocessing. The state-of-the-arts are Iizuka et al. [6], Yu et al. [7], and Wang et al. [8].

3.1. Subjective Effect Comparison

Figure 5 shows the results obtained by performing image inpainting on six images in 3 groups with 4 image inpainting algorithms based on deep learning. The six images are divided into 3 groups, which, respectively, represent the defect of single facial features, partial defect of facial features, and large-area facial defect. Figure 5(b) shows the image with the defect mask attached, which is the input of the neural network. Figures 5(c)–5(f) compress the image inpainting effect of the algorithm and the method in this study. To verify the robustness of the algorithm, there is one male and female image in each group of images, and the differences in appearance such as skin color and facial features are more obvious.

It can be seen from the experimental results that the method of improving the context encoder [6] had limited by the network structure and does not perform well on high-resolution datasets. Although the defective facial features are correctly repaired, more obvious pixel noise appears in the repaired area. The repair marks are obvious, especially in the case of large-area defects on the face, the facial features of the repair result are blurred, and the details are not good. The progressive repair network combined with the attention mechanism [7] had performed well in the repair results of image defects in small areas, with rich details, natural transitions at the edges of the defects, and no obvious repair marks but in the case of large-area defects on the face. The repair effect is not good, the repair effect of the defect area is not uniform, and the repair traces are more obvious. The generative multicolumn convolutional network model [8] was the best among the three sets of comparison algorithms. It can accurately repair the defect area under various defect conditions, the outline of the facial features is clear, the detailed information is processed in place, but the defect edge is excessively unnatural, there is a certain degree of chromatic aberration between the repaired area and other areas, and postprocessing is required to achieve the best results, which increases the cost of image repair. In contrast, the restoration effect of this method is good, and it can accurately restore the image in various situations. Thanks to the use of expanded convolution in the image repair network structure, the image repair network can capture a larger area around the defect area. Therefore, the method in this study can repair the defect based on the known information of the image in the case of partial defects in the five sense organs. Compared with the contrast algorithm, the image inpainting result is closer to the original image. The role of long-jump connection and adversarial loss is more reflected in the enhancement of the detailed performance of the image and the consistency of global and local attributes. Figure 5 shows that the image inpainting results obtained by the method in this study are naturally transitioned at the edge of the defect, the facial features are more detailed, and the restoration area has the same attributes as the entire image. From the subjective visual perception comparison of the image inpainting results, it can be found that the restoration results of the method in this study are more real and natural, thanks to the use of structural similarity (SSIM) that is more in line with the human visual system and the loss function of image inpainting. The test results of this method under the Places2 dataset are shown in Figure 6.

Subjective image quality assessment is uncertain. Randomness means that different observers may give different evaluations to the same image due to individual factors such as the observer's age, gender, knowledge structure, and esthetic taste; in addition, the same observer may also evaluate the same image at different times. Different evaluations are given. Fortunately, there is a certain statistical regularity behind this randomness. The subjective scoring test shows that the multiple subjective scores collected for the same image are relatively centralized and distributed, and the variance is small. Therefore, the sample mean is taken as the final subjective evaluation value, which is called the mean opinion score (MOS). At present, the datasets commonly used in the field of image quality evaluation use this method to obtain the MOS value. This processing method eliminates the uncertainty caused by randomness to a certain extent.

3.2. Objective Effect Comparison

To quantitatively measure the effect of image inpainting, the mean loss, mean loss, peak signal-to-noise ratio, and structural similarity index (SSIM), which are four commonly used image quality evaluation indicators in image inpainting tasks, are used to evaluate and analyze the image inpainting results of different algorithms. Among them, the peak signal-to-noise ratio (PSNR) calculation formula is as follows:

In formula (12), MAX is the maximum possible pixel value in the image. The unit of the calculation result is dB. The larger the value, the smaller the distortion of the repair result. A value higher than 40 dB indicates that the image quality is very good, that is, it is very close to the original image; 30–40 dB usually indicates that the image quality is good, and image distortion can be noticed, but it is acceptable; 20–30 dB indicates that the image quality is poor, and the degree of distortion is not acceptable.

Among the four image quality evaluation indicators, the average and errors focus on calculating the difference between the corresponding pixels of the two images, which is similar to the peak signal-to-noise ratio. Since the visual characteristics of the human eye are not considered, compared to structural similarity, the results obtained by using average , errors, and peak signal-to-noise ratio indicators may not be completely consistent with human subjective visual perception.

Regarding test case images, this study simulates different degrees of image defects by masking each test case image with four different areas. The mask color is the average value of the dataset pixels. The effect of the test case image after the additional defect mask processing is shown in Figure 7.

(a)

(b)

(c)

(d)

In the CelebA-HQ testing dataset, 50 images were randomly selected with a 1 : 1 ratio of male to female to perform the above operations. The test case images processed with additional defect masks were repaired by different algorithms, and the repaired images were compared with the original images. The four kinds of evaluation indexes, respectively, calculate the similarity between the images, and finally, the results obtained from 50 images are averaged to obtain the final result. The experimental results are shown in Table 4.

Table 4 lists the similarity between the repaired image and the original image obtained by the four image repair algorithms for repairing the defect image when the defect mask is covered in different areas. According to the analysis of the experimental results, it can be seen that when the four algorithms consider the global and local attribute consistency of the image in terms of network structure design, loss constraints, etc., the method in this study has achieved relatively satisfactory results on the four evaluation indicators. In terms of the average and error indicators, the method [6] to improve the context coding network seems to have better indicator data in the case of large regional defects. The reason for this result is that in the process of model training, this work uses the adversarial loss (GAN_loss) joint loss when constraining the loss of the image repair result, so it pays more attention to the pixel performance on indicators of interdifferentiation that may be better. In terms of peak signal-to-noise ratio and structural similarity indicators, the method proposed in this study has achieved more obvious results. The method proposed in this study has achieved more obvious results [8]. The signal-to-noise ratio is a high score for the evaluation index, but the method in this study performs better on the structural similarity index that is more in line with the human visual perception. By comprehensive comparison and analysis of the experimental data of various image inpainting quality evaluation indicators, it can be considered that the method in this study has achieved a relatively ideal image inpainting effect.

In summary, whether it is from subjective visual perception or objective data comparison, the image inpainting effect of the algorithm in this study is significantly better than the current better image inpainting algorithms based on deep learning, which can reasonably repair the defect area and repair the edge of the area. The transition is natural, the internal details are rich and true, and the properties of the repaired area and the image as a whole are consistent. The overall structure of the repaired image is coherent and unified, and the visual effect conforms to the human visual experience, achieving the ultimate goal of the image inpainting task.

3.3. Ablation Experiment

To verify the role of various loss functions in the image repair network, this section uses the CelebA-HQ dataset to perform ablation experiments on the proposed image repair model. The proposed model is used as the basic model. Then, one of the loss functions is removed for training to obtain five different ablation models. Table 5 shows the quantitative comparison of the loss function ablation experiment results. It can be seen that when the pixel reconstruction loss is removed from the joint training loss, in the various evaluation indicators of the image repair results appear a certain degree of degradation. When the confrontation loss is removed, the image inpainting result has significantly improved.

Figure 8 shows the qualitative comparison of the experimental results of loss function ablation. From Figure 8(c) and Figure 8(e), it can be seen that when the pixel reconstruction loss in the joint training loss is removed, the repair model cannot be effectively reconstructed. It can be seen from Figure 8(d) and Figure 8(e) that when the combined training loss is removed from the antiloss , the repair in the resulting image content has blurred texture (the beard area repaired in the second row of Figure 8(d)).

4. Conclusions

This study proposes a novel image inpainting model based on a fully convolutional network, combined with the idea of a generative adversarial network. This model starts from the perspective of image global semantic understanding and the consistency of global and local attributes. Aiming at the problems existing in existing work, learning from mature and effective methods in the field of image processing, this model proposes a novel image inpainting model through various improvements and improvements so that the model can truly and effectively repair the images of different degrees of defects, and the results of image inpainting have reached an excellent level from subjective visual effects to objective data indicators. It should be noted that although deep learning has strong learning and representation capabilities, the scenes involved in image inpainting tasks are too diverse. Therefore, this study proposes a general model, and the training difficulty and time cost are within acceptable limits. When dealing with a specific problem, performing targeted training on the sample corresponding to the problem can obtain a more ideal image inpainting result. When faced with more professional application scenarios, the network model can be adjusted according to actual needs, and operations such as transfer learning and fine-tune can be performed to achieve better image inpainting effects.

What this study proposes is an end-to-end automatic image inpainting model, which cannot be artificially interfered with the image inpainting results. In the future, based on this study, while continuing to improve the effect of image inpainting, we will further explore conditional image inpainting tasks so that preset conditions and human-computer interaction can directly or indirectly affect the results of image inpainting, thereby further improving the practical value of the model. [24].

Data Availability

The data used to support the findings of this study are included within the article.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Authors’ Contributions

Ruifang Wei is mainly responsible for the proposal of the thesis idea, the realization of the experiment, the adjustment of the model parameters, and the writing of the thesis. Yukun Wu is mainly responsible for the overall conception of the thesis and the design of the experiments, and checking and proofreading the grammar of the article.

References

M. Bertalmio, G. Sapiro, V. Caselles, and C. Ballester, “Image inpainting,” in Proceedings of the 27th annual conference on Computer graphics and interactive techniques, pp. 417–424, ACM Press, New York, NY, USA, July 2000.
View at: Publisher Site | Google Scholar
A. Criminisi, P. Perez, and K. Toyama, “Region filling and object removal by exemplar-based image inpainting,” IEEE Transactions on Image Processing, vol. 13, no. 9, pp. 1200–1212, 2004.
View at: Publisher Site | Google Scholar
I. J. Goodfellow, J. Pauget-Abadie, M. Mirza et al., “Generative adversarial nets,” in Proceedings of the Conference on Neural Information Processing Systems, pp. 2672–2680, Quebec, Canada, December 2014.
View at: Google Scholar
J. Hays and A. A. Efros, “Scene completion using millions of photographs,” ACM Transactions on Graphics, vol. 26, no. 3, p. 4, 2007.
View at: Publisher Site | Google Scholar
D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, and A. A. Efros, “Context encoders: feature learning by inpaint ing,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2536–2544, Salt Lake City, UT, USA, June 2018.
View at: Google Scholar
S. Iizuka, E. Simo-Serra, and H. Ishikawa, “Globally and locally consistent image completion,” ACM T GRAPHIC, vol. 36, no. 4, p. 107, 2017.
View at: Publisher Site | Google Scholar
J. H. Yu, Z. Lin, J. M. Yang, X. Shen, X. Lu, and T. S. Huang, “Generative image inpainting with contextual attention,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5505–5514, Salt Lake City, UT, USA, June 2018.
View at: Publisher Site | Google Scholar
Y. Wang, X. Tao, X. J. Qi, X. Shen, and J. Jia, “Image inpainting via generative multi-column convolu tional neural networks,” in Proceedings of the International Conference on Neural Information Processing Systems, pp. 329–338, Salt Lake City, UT, USA, December 2018.
View at: Google Scholar
T. Portenier, Q. Y. Hu, A. Szabo, S. A. Bigdeli, P. Favaro, and M. Zwicker, “Faceshop: deep sketch-based faced image editing,” ACM T GRAPHIC, vol. 37, no. 4, p. 99, 2018.
View at: Publisher Site | Google Scholar
H. Y. Zhang and Q. Z. Peng, “Review of digital image inpainting technology,” Journal of Image and Graphics, vol. 22, no. 1, pp. 1–10, 2017.
View at: Google Scholar
T. F. Chan and J. Shen, “Nontexture inpainting by curvature-driven diffusions,” J VIS COM MUN IMAGE R, vol. 22, no. 6, pp. 436–449, 2011.
View at: Google Scholar
O. Ronneberger, P. Fischer, and T. Brox, “U-net: convolutional networks for biomedical image segmentation,” in Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 234–241, Springer, Munich, Germany, October 2015.
View at: Publisher Site | Google Scholar
T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen, “Improved techniques for training GANs,” in Proceedings of the International Conference on Neural Information Processing Systems, pp. 2234–2242, Barcelona, Spain, December 2016.
View at: Google Scholar
J. Shen and T. F. Chan, “Mathematical models for local nontexture inpaintings,” SIAM Journal on Applied Mathematics, vol. 62, no. 3, pp. 1019–1043, 2002.
View at: Publisher Site | Google Scholar
K. M. He, X. Y. Zhang, S. Q. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778, Las Vegas, NV, USA, June 2016.
View at: Publisher Site | Google Scholar
Y. T. Chen, V. Phonevilay, J. J. Tao et al., “The face image super-resolution algorithm based on combined representation learning,” Multimedia Tools and Applications, vol. 80, pp. 30839–30861, 2021.
View at: Publisher Site | Google Scholar
Y. T. Chen, H. P. Zhang, L. W. Liu et al., “Research on image inpainting algorithm of improved total variation minimization method,” Journal of Ambient Intelligence and Humanized Computing, vol. 13, pp. 1–10, 2021.
View at: Publisher Site | Google Scholar
Y. T. Chen, H. P. Zhang, L. W. Liu et al., “Research on image Inpainting algorithm of improved GAN based on two-discriminations networks,” Applied Intelligence, vol. 51, no. 6, pp. 3460–3474, 2021.
View at: Publisher Site | Google Scholar
R. L. Xia, Y. T. Chen, and B. B. Ren, Improved Anti-occlusion Object Tracking Algorithm Using Unscented Rauch-Tung-Striebel Smoother and Kernel Correlation Filter, Journal of King Saud University Computer and Information Sciences, Saudi Arabia, 2022.
View at: Publisher Site
Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image quality assessment: from error visibility to structural similarity,” IEEE T IMAGE PROCESS, vol. 13, no. 4, pp. 600–612, 2004.
View at: Publisher Site | Google Scholar
A. Tsai, A. Yezzi, and A. S. Willsky, “Curve evolution implementation of the Mumford-Shah functional for image segmentation, denoising, interpolation, and magnification,” IEEE Transactions on Image Processing, vol. 10, no. 8, pp. 1169–1186, 2001.
View at: Publisher Site | Google Scholar
C. Yang, X. Lu, Z. Lin, E. Shechtman, O. Wang, and H. Li, “High-resolution image inpainting using multi-scale neural patch synthesis,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4076–4084, Honolulu, HI, USA, July 2017.
View at: Publisher Site | Google Scholar
F. Yu and V. Koltun, “Multi-scale context aggregation by dilated convolutions,” 2016, https://arxiv.org/pdf/1511.07122.pdf.
View at: Google Scholar
B. Shen, W. Hu, Y. M. Zhang, and Y. J. Zhang, “Image inpainting via sparse representation,” in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. 697–700, Taipei, Taiwan, April 2009.
View at: Publisher Site | Google Scholar

Copyright

Copyright © 2022 Ruifang Wei and Yukun Wu. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

PDF Download Citation

Download other formats

Order printed copies