Abstract

Despite the extensive research on developing robust image inpainting algorithms in recent years, there are almost no objective metrics for the quality assessment of inpainted images currently. Inspired by the feature coherence in the inpainted image and the human visual perception mechanism, this paper proposes an image inpainting quality assessment (IIQA) that takes into account both visual saliency and structural features. First, the quality issues associated with image inpainting are categorized into three aspects: incoherent structure, unreasonable texture, and other results that are inconsistent with human visual perception. These quality problems are further expressed as “regions of interest” and extracted by the visual saliency method using the natural statistics model. Subsequently, the structural features are computed based on the nonlinear diffusion of the horizontal and vertical gradient field of the inpainted image. Finally, the IIQA metric incorporates brightness, gradient similarity, structural similarity, and visual saliency is established. The quality evaluation process is conducted by comparing each patch within the inpainted region with its best match from the known region. The quantitative experimental results demonstrate the effectiveness of the proposed method, especially for images with structural discontinuity. A comparative study also shows that the Spearman rank order correlation coefficient of our method achieves 0.875 on certain databases, which outperforms existing IIQA metrics.

1. Introduction

The primary goal of image inpainting is to carefully select appropriate patches from the known region and subsequently utilize them to fill in the repaired area, thereby achieving a comprehensive restoration of the entire image. It involves estimating and speculating about the missing information. With the development of science and modern technology, methods such as super-resolution [1], attention mechanism [2], and deep learning [3] have been widely applied to image inpainting and achieved satisfactory results. However, inpainted image quality assessment (IIQA) currently depends on observers’ subjective expertise. For observers, as long as there are no noticeable visual imperfections in the repaired image, the image inpainting task is considered to be successful. Despite the extensive research conducted on the development of robust image inpainting algorithms, there has been a lack of focus on creating effective quality assessment metrics to evaluate the performance of image inpainting techniques. Due to the absence of reference images for comparison, the commonly used metrics for assessing image quality, such as mean squared error and peak signal-to-noise ratio, cannot be directly applied.

In recent years, efforts have been made by scholars to tackle these concerns using techniques such as feature analysis and machine learning. Wang et al. [4] believed that human eyes are highly adaptive to the changes in the structure information in images. Therefore, an indicator named structural similarity image measurement (SSIM) was proposed to measure the similarity between the reference image and the distorted image. Feie [5] proposed an image evaluation indicator, Borsal, to estimate the attention density of the sight in a narrow band around the target boundary. Meanwhile, they also proposed the StructBorSal index, which combined SSIM with visual saliency to evaluate the quality of structural regions in the inpainted image. Liu et al. [6] also proposed an image quality assessment metric based on the variant of SSIM. When it comes to the quality evaluation of inpainted images, SSIM may not be reliable for images containing large inpainted regions as these areas can significantly deviate from the actual ones. Qureshi et al. [7] classified IIQA methods into three categories: structure-based methods, saliency-based methods, and machine learning-based methods. The performances of typical inpainting approaches were also tested in [7]. Hu et al. [8] proposed a nonreference quality evaluation metric for the Thangka image. The structural features of the Thangka image were extracted, and the differences between the original region and the repaired region were further analyzed. However, this metric is only available for Thangka images. Liu et al. [9] introduced the preattention theory to emulate visual perception by refining luminance-channel data and developed an image quality assessment metric of distorted images. The study in [10] utilized sample-based image completion methods to generate a set of 100 inpainted images. Subsequently, an evaluation metric based on the saliency map and repaired region was proposed and discussed. Unfortunately, only part of the image features were considered in [9, 10], and they performed well only in some special cases. Additionally, a significant limitation shared by the aforementioned metrics is the need for a reference image, which is not available for image inpainting.

Image quality evaluation approaches based on machine learning offer effective functional approximations of image features and evaluation scores [11]. Voronin et al. [12] proposed to extract the local binary pattern features of images and collect quality rating scores from subjective experiments. These data are utilized to train a support vector regression network for image quality prediction. The database consisted of 300 images, and a total of 10 participants were engaged in conducting subjective assessment. The findings demonstrated that the objective evaluation results have a strong connection with human perception. Isogawa et al. [13] proposed a learning-based sorting method to automatically estimate the optimal parameters for image inpainting algorithms. The relative ranking of different inpainted images was determined by employing a learning-to-rank approach. Meng et al. [14] introduced a novel approach for assessing the performance of nonreference image inpainting algorithms using deep rank learning. A pair of inpainted images were taken as input, and their ranking order was predicted. This method can be applied for the evaluation of both inpainted images and inpainting algorithms. Madhusudana et al. [15] tried to solve the problem of image quality assessment by using a deep convolutional neural network. Images containing synthetic and realistic distortions were taken as the database for the prediction of distortion type and degree. This method is more applicable for the quality assessment of images affected by synthetic or authentic distortions. Chen et al. [16] first applied unlabeled data to conduct self-supervised pretraining for blind image quality assessment. Distorted images are generated from high-quality samples and taken as the database. Additionally, a contrastive loss function is introduced to capture information that is sensitive to image quality. For machine learning-based image quality assessment methods, it is crucial to have subjectively and manually annotated rating scores for the training of regression models. In addition, the current no reference-based image quality assessment algorithms rely on artificially created distorted image features. The learning phase needs a substantial amount of labeled samples; however, there are currently no public databases available for different machine learning-based IIQA methods.

The challenge in assessing the quality of image inpainting lies in the fact that the inpainting procedures might result in visible anomalies within and surrounding the inpainted regions. IIQA is essentially based on the consistency of the repaired region and the known region. Although reference images do not exist, available information can still be extracted from the known region and used for quality evaluation, which also makes feature extraction and feature comparison key factors for image quality assessment. Hence, we propose a new inpainted image quality assessment metric that reformulates visual inconsistency problems as human attention alterations and quantifies the similarity between inpainted regions and their best-match patches in the known region. The main contributions of the present work are as follows: (i)We propose to extract the “regions of interest” in the inpainted image by applying an improved visual saliency method using natural statistics model (ISUN). (ii)The structure map of the inpainted image is extracted through the nonlinear diffusion of the horizontal and vertical gradient field. (iii)The IIQA metric involves brightness, gradient similarity, structure similarity, and visual saliency is proposed and tested on different inpainted images.

The rest of this paper is organized as follows: the visual saliency of inpainted images is calculated in Section 2, and the IIQA metric is established in Section 3. Experimental results and discussion are presented in Section 4, and the conclusion is drawn in Section 5.

2. The Extraction of “Regions of Interest” in the Inpainted Image

The reason why human eyes can recognize various image quality problems is that the inpainted regions often contain features that are not similar to their surroundings or even the entire image. As shown in Figure 1, we summarized some typical problems in the inpainted image. The main factors that affect the quality of image inpainting are structural inconsistencies, unreasonable texture, and other issues that do not align with human visual perception.

In the inpainted image, it is expected that the inpainted region will exhibit a coherent structure and texture to the known region. Two structural inconsistency problems are illustrated in the red squares in Figure 1. The edge of the water is discontinuous in Figure 1(a), and there is an unknown structure, a special circle, in Figure 1(b). According to human perception, the interlacing between different textures, the repetition of the same texture in a large area, and the block effect should be avoided. Examples of typical texture problems are given in Figures 1(c) and 1(d). Although texture irrationality is not as obvious as structure problems for human perception, it still has a great impact on the image quality. In addition, there are still some other problems, as shown in Figures 1(e) and 1(f). Although there is no structural discontinuity or texture irrationality in these images, we still believe that the plants in Figure 1(e) should not multiply infinitely in the water; also, no objects can cast a shadow at the given position in Figure 1(f).

It is not difficult to find that these areas in Figure 1 are “regions of interest” that will attract human attention easily. Therefore, the human visual perception model is applied to extract “regions of interest” in the inpainted image. In this way, traditional image evaluation approaches, which include feature extraction, feature comparison, and other processes, can be transformed into human visual attention alterations in the inpainted region. Image features are always used to measure whether the region can attract human attention or not. Typically, regions with sparser features or more distinctive features compared to their surroundings exhibit higher saliency. In this paper, we introduced the ISUN statistics model [17], which utilizes the inherent statistical properties of images to extract the distorted region in inpainted images. In ISUN, massive natural images in a specific image library are taken as sample sets. Subsequently, the unsupervised learning of sample images is used to obtain basic independent component analysis (ICA) filters, which can represent image characteristics. Based on the probability distribution of the ICA feature set, we can get the saliency map of an image.

In ISUN, image features are acquired by applying ICA and principal component analysis (PCA) to a group of preprocessed images from a database [17]. For each image, 100 patches with size b × b × 3 from random locations are extracted, and each patch is treated as a 3b2 dimensional vector. The dimensionality of the patch collection is first reduced using PCA, and then the number of image features is limited to d. Afterward, ICA filters are applied to these patches to get the learned filters. In this way, features can be extracted from an image by filtering it with each of the learned ICA filters. As discussed in the ISUN model, the visual saliency of an image is defined as [9], where X represents the obtained ICA feature set. The components of X are statistically independent after ICA processing, so can be described as a product of a series of 1D distributions, as shown in Equation (1).where xi is the ith element in vector x, i.e., the ith feature. The generalized Gauss distribution (GGD) given in Equation (2) is applied to simulate each 1D distribution shown in Equation (1).where is the shape function, is the scale function and is the gamma function. For each ICA filter, select a suitable 1D GGD function and estimate the parameters of the GGD function using the algorithm mentioned in [5]. Each feature is assigned a weight based on its frequency of occurrence to enhance the visual saliency of rare features. The flowchart of saliency map extraction is shown in Figure 2.

Taking the image to be impaired in Figure 3(a) as an example, the methods proposed in [18, 19] are applied to repair this image, and the results are shown in Figure 3(b). The visual saliency map of each inpainted image is extracted using the ISUN model, and the results are shown in Figure 3(c). Obviously, the incoherent structure and unreasonable contents in the inpainted region have notable saliency. The visual saliency map of the inpainted region is directly related to image quality. Both the saliency region size and the corresponding saliency degree indicate where image inpainting has not been effectively performed. In addition, not all regions in an image are equally important to human eyes, according to the visual cognitive theory. Take Figure 3(c), for example; human visual perception exhibits greater sensitivity toward structural breaks, shields, and discontinuities, while displaying relatively lower sensitivity toward texture details. Thus, structural features should be emphasized in image quality assessment.

3. IIQA Metric Using Structural Features and Visual Saliency

The SSIM proposed by Wang et al. [4] provided an effective way for image quality evaluation from the perspective of structural distortion. In SSIM, the mean, standard deviation, and covariance of an image are utilized as estimations for brightness, contrast, and structural similarity, respectively. The SSIM metric is defined in Equation (3).where f and f’ represent the reference image and the image under evaluation. The parameters α, β, and γ serve to calibrate the weight of different indicators. In general cases, α = β = γ = 1. The brightness function l(f, f’), contrast function c(f, f’), and structure function s(f, f’) are illustrated in Equations (4)–(6).where , C1, C2, and C3 are constant, μf and μf represent the average brightness of two images, σf and σf denote the standard variance of two images. According to Equations (3)–(6), structural similarity remains invariant to changes in contrast and brightness. However, SSIM computes the brightness, contrast, and structural similarity indiscriminately for all image regions. As discussed in Section 2, the subjective sensitivity of human eyes to texture regions and structure regions is quite different. In this paper, the spatial variation of image gradient features extracted by the structure tensor method is considered to replace the contrast function and structure function in SSIM.

The anisotropic diffusion-based structure tensor method [20] effectively enhances structure contours by employing nonlinear diffusion in color space, as well as horizontal, vertical, and 45° gradient spaces, thereby reducing feature redundancy and computational complexity. For an image I (x, y), the feature space extracted by the structural tensor method is given in Equation (7).where Kσ is a Gaussian kernel with variance σ, Ix is the horizontal gradient, and Iy is the vertical gradient of an image. The convolution of Kσ with different gradient fields can smooth the gradient features in different directions. Unfortunately, the convolution with the Gaussian kernel often leads to the misalignment of target boundaries in the feature space. In order to solve the problem caused by the linear filter, the Gaussian convolution can be replaced by the nonlinear diffusion [21], as given in Equation (8).where N1 represents the number of gradient channels, we set N1 = 2 in this paper, which means only the gradient features in the horizontal and vertical directions are considered. g is the diffusion coefficient with , ξ is a small positive number and we set here. Additive operator separation [22] is used to solve Equation (8). In the initial moment, t = 0, u10 and u20 represents the horizontal gradient and the vertical gradient . When t = K, iteration stops, and features are recorded as u1K and u2K. Gradient feature U of image I (x, y) can be obtained by adding the horizontal gradient and vertical gradient, as given in Equation (9).where is a coefficient used to adjust the value range of U. Based on the gradient feature, the gradient similarity of images f and f’ is defined in Equation (10).where the value of φf and φf’ represent the average gradient features Uf and Uf, C4 is a smaller constant. The structural similarity function based on gradient features is defined in Equation (11).where . Accordingly, the proposed IIQA metric is defined as follows:

It is worth noting that in image inpainting, the reference image f does not exist for quality evaluation. We propose to measure the quality of image inpainting by comparing the inpainted region with the known region. For each pixel p in the repaired region, a patch M(p) centered at p with size w × w is selected. Subsequently, the proposed IIQA is employed to search for the best match M of patch M(p) within the known region. The similarity between M(p) and the known region is defined as follows:where ФC represents the known region, and ФS represents the inpainted region. As discussed in Section 2, human eyes are more sensitive to the fracture, occlusion, and discontinuity of the structure in the image rather than texture or other details. The objective quality evaluation of image inpainting is thus defined as the linear weighted sum of saliency degree and IIQAp for all patches within the inpainted region. Let Sp be the normalized saliency degree of pixel p in the inpainted region, and the quality Q of an inpainted image considering visual saliency and structural features is given in Equation (14).where P is the number of pixels in the inpainted region. The IIQA algorithm is described in detail as follows:

Input: Image I(x, y)
Output:Q
 u10 = Ix, u20 = Iy, k = 0
ifk < K
else
for allM (p)ɛФSdo
end for

4. Results and Discussion

Typical images shown in Figure 4(a) are taken as examples and repaired by the Criminisi algorithm [23], PAMSRIC algorithm [24], priority-BP algorithm [25], and DLIC algorithm [26], respectively. Sixteen inpainted images are obtained for quality evaluation, as shown in Figures 4(b) and 4(c). For subjective rating, 18 observers were selected to score the inpainted image according to the criteria given in Table 1. The proposed approach was employed for objective evaluation. The Pearson linear correlation coefficient (PLCC) and the Spearman rank order correlation coefficient (SROCC) [27] are applied to test the performance of IIQA.

As shown in Figure 4, inpainted images are classified into group (b) and group (c). The average subjective rating scores from 18 observers are shown in Figure 5(a). The objective scores using the proposed IIOA method are given in Figure 5(b). The value of PLCC and SROCC between the subjective scores and objective scores is shown in Table 2.

As shown in Table 2, the correlation between subjective and objective evaluation scores is very high, indicating that the objective evaluation results are basically consistent with the subjective rating. These results verified the effectiveness of the proposed IIQA method. It is also worth noting that the PLCC and SROCC coefficients of images in group (c) are lower than those of images in group (b). This is because some inpainted images in group (b) are obviously not in line with the subjective perception of human eyes and are easier to be noticed by observers. However, the inpainted images in group (c) had no obvious structure or texture problems, which resulted in a certain difference between the subjective rating and objective evaluation. This is because, rather than texture issues, our approach exhibits a higher sensitivity toward structural defects that tend to capture human attention more easily.

The results in Figure 5 also indicate that the proposed approach does not directly depend on the size of the inpainted regions. For example, the inpainted regions are large in image 5 and image 7 compared to others; however, both the subjective and objective rating scores of these two images are relatively high. In contrast, the inpainted regions in image 1 and image 4 are smaller, but the rating scores of images 1(b) and 4(b) are much lower. The reason for this lies in the fact that our IIQA metric incorporates both visual saliency and structural features, thereby yielding consistent results with subjective evaluation, particularly for images with structural problems.

To further verify the performance of the proposed approach, we tried to compare our method to the related works. Unfortunately, IQA methods based on machine learning were trained using datasets generated locally, making it challenging to compare other methods with those metrics. Therefore, we chose StrucBorSal [5], OIM [10], and B-IIQA [28] for the comparative study. All these methods were applied on a public database TUM-IID which contains 272 inpainted images. The obtained SROCC values of these approaches are summarized in Table 3. Obviously the SROCC value of our method is higher than other metrics, indicating the effectiveness and superior performance of the proposed IIQA metric.

5. Conclusions

The absence of a reference image has consistently posed a significant challenge for the IIQA. In this paper, we proposed to address the IIQA problem by employing image feature extraction techniques and visual saliency analysis. The texture and structure defects in the inpainted image were converted into “regions of interest” and analyzed through the extraction of a visual saliency map. Based on the structure feature obtained by the nonlinear diffusion of gradient space, the IIQA metric, including brightness, gradient similarity, and structural similarity, was proposed and tested on different types of images. Experimental results showed that the rating scores by our method are consistent with the subjective rating results, especially for images with structural defects. The proposed metric also performed better than the existing ones. The limitation of our method is that when there are no obvious structure or texture problems in the inpainted image, the result showed a lower correlation with objective evaluation. In the future, our focus will be on IIQA metrics that integrate both human perception and deep learning to enhance its performance.

Data Availability

The data used to support the study are openly available in a public repository.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was funded by the Qin Xin Talents Cultivation Program, Beijing Information Science & Technology University (QXTCP C202106).