Abstract

Blind image quality assessment (BIQA) has made significant progress, but it remains a challenging problem due to the wide variation in image content and the diverse nature of distortions. To address these challenges and improve the adaptability of BIQA algorithms to different image contents and distortions, we propose a novel model that incorporates multiperspective consistency. Our approach introduces a multiperspective strategy to extract features from various viewpoints, enabling us to capture more beneficial cues from the image content. To map the extracted features to a scalar score, we employ a content-aware hypernetwork architecture. Additionally, we integrate all perspectives by introducing a consistency supervision strategy, which leverages cues from each perspective and enforces a learning consistency constraint between them. To evaluate the effectiveness of our proposed approach, we conducted extensive experiments on five representative datasets. The results demonstrate that our method outperforms state-of-the-art techniques on both authentic and synthetic distortion image databases. Furthermore, our approach exhibits excellent generalization ability. The source code is publicly available at https://github.com/gn-share/multi-perspective.

1. Introduction

Nowadays, the digital images have become a crucial media format in people’s daily life, thanks to the widespread of intelligent devices. However, various distortions can occur during the image capture, processing, and transmission processes, making image quality assessment (IQA) an urgent need. IQA methods can generally be categorized into subjective image quality assessment and objective image quality assessment [1]. Subjective image quality assessment is reliable and accurate as it relies on human participation. However, it is also time-consuming and laborious. Therefore, considerable effort has been dedicated to objective image quality assessment in past decades [26]. The goal of objective image quality assessment is to explore image quality perception models that conform to the human vision system (HVS). Based on the availability of reference images, objective image quality assessment methods can be further divided into three categories, namely, full-reference IQA (FR-IQA) [7, 8], reduced-reference IQA (RR-IQA) [9], and no-reference IQA (NR-IQA) [10]. FR-IQA and RR-IQA models utilize either the entire or part of the pristine images to predict the quality scores, which usually perform well [11]. However, their application scenarios are very limited as reference images are not available in most cases. On the other hand, the NR-IQA predicts image quality without any pristine image information. Despite being the most challenging problem in IQA, blind image quality assessment (BIQA) continues to attract significant attention due to its wide range of applications [12].

In addition to the absence of reference images, the existing datasets for blind image quality assessment (BIQA) exhibit diverse image contents and distortions. Figure 1 shows several sample images from LIVE Challenge (LIVEC) and LIVE datasets. It is evident that the LIVEC dataset, comprising authentic distortion images, encompasses a wide range of content, including indoor and outdoor scenes, day and night scenarios, as well as natural and artificial landscapes. Similarly, the synthetic distortion images in the LIVE dataset demonstrate significant differences compared to authentic distortions and cover various categories. The diversity in both distortion and image content variation further amplifies the challenge associated with the BIQA problem. Firstly, it necessitates a more robust representation capability to effectively capture the nuances of images with diverse content. Secondly, adapting the model to encompass a broad spectrum of authentic and synthetic distortions poses significant difficulties. Over the past few decades, extensive research has focused on identifying effective quality-aware features that accurately represent image content and distortion. Early studies predominantly employed handcrafted features such as natural scene statistics (NSS) [13] and the generalized Gaussian distribution (GGD) [13]. In recent years, learning-based approaches, particularly convolutional neural network (CNN) methods [4, 1416], have gained significant attention in BIQA research. While these studies have achieved promising results, further efforts are required to bridge the gap between BIQA methods and the human visual system (HVS) for enhanced performance.

An exemplary method that demonstrates the advantages of utilizing powerful feature learning and content-aware hyperparameter generation is HyperIQA [4]. HyperIQA leverages the ResNet-50 architecture [17], known for its robust feature learning capabilities, and incorporates a content-aware hyperparameter generation mechanism based on hypernetworks [18]. This approach surpasses the performance of state-of-the-art methods when evaluated on databases containing authentic distorted images. However, it is worth noting that HyperIQA’s performance on synthetic distorted image databases is comparatively weaker. This observation further highlights the challenge of adapting the model to handle a wide range of distortion types and characteristics. The difficulty in achieving consistent performance across various distorted images underscores the need for further advancements in BIQA research.

As the ancient Chinese poem described, “It’s a range viewed in face and peaks viewed from the side,” the concept of perceiving different aspects through various perspectives serves as inspiration for our proposed approach. We aim to enhance the adaptability of our algorithms to accommodate the content variation and diverse distortions present in images. Interestingly, similar ideas can be observed in contrastive self-supervised learning algorithms [19, 20], where two augmented views of an input image are processed by two encoders to generate similarity features in an embedding space. In our approach, we deviate from contrastive self-supervised learning by employing distinct architectures to simulate different perspectives specifically tailored for the blind image quality assessment (BIQA) task. By leveraging multiple perspectives, we aim to capture a more comprehensive understanding of image quality, effectively addressing the challenges posed by content variation and diverse distortions. More specifically, we apply two different ResNet architectures to extract information from two different perspectives. Figure 2 illustrates the visualization results of partial feature maps using different networks. For the same image, different perspectives learn different cues. When using multiple perspectives, we must solve the problem of how to integrate these perspectives into the model. To effectively incorporate both the multiperspective cues and network complexity, we propose a consistency supervision strategy to integrate multiple perspectives. This strategy allows us to merge and harmonize the information from multiple perspectives. The proposed training strategy is similar to knowledge distillation [21], which is utilized in a recently proposed dual-branch semisupervised framework named SSLIQA [22]. The main difference between our model and the knowledge distillation based method is that subnetworks in our proposed model promote each other in the training process instead of using a single direction supervision. Moreover, to take the advantage of content-aware ability of hypernetworks, we employ HyperIQA as a backbone for two different perspectives.

In this paper, we present a novel approach to address the BIQA problem using a multiperspective way. The main contributions of our paper are outlined as follows:(1)We introduce a multiperspective approach for BIQA that enhances the adaptability of the algorithm to account for content variation and diverse distortions. By incorporating multiple perspectives, we capture a more comprehensive understanding of image quality. To simulate these perspectives, we employ different ResNet architectures, each representing a distinct viewpoint.(2)We devise a training strategy based on multiperspective consistency to effectively integrate the perspectives. This strategy leverages the specificity of individual perspectives and the generality achieved by considering multiple perspectives. The integration of these perspectives leads to improved assessment accuracy. Extensive experiments conducted on five representative IQA datasets validate the effectiveness and generalization ability of our proposed method. The results demonstrate significant improvements in blind image quality assessment, highlighting the advantages of our multiperspective approach.

In the past two decades, many algorithms have been proposed to address the BIQA problem. These approaches can be broadly categorized into two main groups: the handcrafted feature and learning feature-based methods.

Early studies in BIQA focused on representing distorted images by designing artificial features. One common used approach was to leverage natural scene statistics (NSS) as a basis for handcrafted feature design. For example, the blind/referenceless image spatial quality evaluator (BRISQUE) [13] used NSS of locally normalized luminance coefficients to measure the unnaturalness of an image. These extracted features were then fed into a support vector regression (SVR) model to predict the image score. NIQE [23] constructed “quality-aware” collection of statistical features based on the NSS model. Zhang et al. [24] integrated the local natural image quality evaluator (ILNIQE) by incorporating more local quality-aware information into NIQE, which measures the distance between statistical features of the NSS model learned from pristine images and statistical features of the distorted image. Additionally, Jain et al. proposed a model that combined NSS with CNN and achieved promising results [25]. Multiple distributions such as generalized Gaussian distribution (GGD) [13], asymmetric generalized Gaussian distribution (AGGD) [13, 24], and histogram counting [26] were also used to capture the statistics from distorted images. Yue et al. combined statistical property, NSS-based features, and structure and texture features to predict the quality of transparently encrypted images [27]. Moreover, some works used corner descriptors (e.g., SIFT [28] and Harris [29]) to predict image quality.

Learning feature-based methods seek to automatically learn quality-aware features from images. For example, Xu et al. [30] proposed an efficient and robust BIQA model based on a high order statistics aggregation (HOSA). It was a codebook-based approach, which utilized local normalized image patches as local features and constructed codebook using K-means. Very recently, a convolutional neural network (CNN) has been adopted for BIQA and made a great progress. Additionally, Kang et al. [14] addressed the BIQA problem using a simple end-to-end CNN model consisting of one convolutional layer, one maxpooling, one min pooling layer, and three fully connected layers, which is considered to be the earliest CNN-based approach for BIQA. Kim and Lee [15] proposed a blind image evaluator based on a convolutional neural network (BIECON), which imitated FR-IQA behavior by generating a local quality map using a deep convolutional neural network. To simultaneously handle both synthetic and authentic distortions, Zhang et al. [16] proposed a deep bilinear CNN (DB-CNN) model for BIQA. They adopted a specific CNN architecture inspired by VGGnet [31] to extract features for synthetic distortion and a tailored VGGnet for authentic distortion. VGGnet was also used to construct feature extractors in [12], in which the authors proposed weighted average deep image quality measure (WaDIQaM) for both FR-IQA and BIQA. In addition to VGGnet, AlexNet [32] and ResNet [3, 32, 33] were also typical learning feature-based approaches for BIQA.

Recent studies in blind image quality assessment (BIQA) have aimed to construct more powerful architectures to tackle the challenges in this task. Similar to DB-CNN, Yue et al. proposed a dual-branch network for screen content images’ quality assessment [34]. The original image was first decomposed into predicted and unpredicted portions, which were then fed into two branches for feature extraction. Su et al. [4] proposed a model that learns multiscale features for distorted images and estimated the quality score in a self-adaptive manner through a hypernetwork. Compared with the previous supervised learning methods, Madhusudana et al. [35] considered the image quality prediction problem in a self-supervised manner. They used an unlabelled image dataset containing both synthetic and authentic distortions to train a CNN model. Furthermore, Zhang et al. [36] proposed a unified BIQA model optimized by a pairwise learning-to-rank training strategy to overcome the challenge of cross-distortion-scenario. Moreover, Golestaneh et al. [37] extracted both local and nonlocal features for BIQA by using a hybrid approach that benefits from CNN and self-attention mechanism in transformers. Zhang et al. [38] proposed a continual learning approach that incorporates the concept of distillation learning to address the devastating forgetfulness brought by the growth of IQA new databases. Similarly, Liu et al. [39] proposed a lifelong blind image quality assessment (LIQA) approach to effectively mitigate the catastrophic forgetting in cases of continuous distortion types and even dataset shifts.

3. The Proposed Method

In this paper, we present a novel approach for image quality assessment that leverages multiperspectives to better represent image content and distortion. The proposed method utilizes information from different aspects of an image to better capture its quality characteristics. Figure 3 shows the overall architecture of our proposed method. Our model includes two content-aware subnetworks, namely, Master Network and Assistant Network. We will use Master Network and Assistant Network to learn quality prediction from two different perspectives. To construct Master Network and Assistant Network, we adopt the hypernetwork architecture of HyperIQA [4], which demonstrates powerful feature learning capabilities and content awareness.

The reason we use two subnetworks is that we design two networks to collaborate on image quality prediction, in which each network is associated with a different perspective. We name these two subnetworks, Master Network and Assistant Network, based on their roles in the test phase. More specifically, during the training stage, the Master Network and Assistant Network interact and provide each other with valuable cues from different perspectives. This collaboration allows them to assist each other to learn more cues effectively. However, in the test phase, only the Master Network is used for image quality prediction. To achieve this goal, we introduce a perspective consistency training strategy to integrate two perspectives learned from two networks for the BIQA problem. We will discuss more details about the proposed model in the following subsections.

3.1. Multiperspective BIQA Model

Multiperspective strategy is applied for BIQA to take more beneficial cues into consideration for the prediction task. To capture different perspectives for image quality assessment, we employ different feature extraction architectures that simulate distinct viewpoints. Specifically, we use two different ResNet modules (ResNet-50 and ResNet-18) as a feature extractor to extract features from two different perspectives to construct the Master Network and Assistant Network. The architecture details of Master Network and Assistant Network are shown in Figure 3. It can be seen that the architecture difference between the two networks only exists in the backbone network for feature extraction (ResNet-50 for Master Network and ResNet-18 for Assistant Network). It needs to be clarified again that the only role difference between Master Network and Assistant Network is that Master Network is used for quality prediction in the test phase. We denote the proposed network by , where and represent the Master Network and Assistant Network, and the superscript and of and stand for “Master” and “Assistant,” respectively. As shown in Figure 3, the Master Network and Assistant Network are structurally independent of each other. Both networks are integrated through perspective consistency constraints. Since and share the similar structures, we apply the unified notation , where represents and . Given an input image , we learn the two subnetworks and so as to map the input image to a scalar score aswhere is the parameters of (Master Network when is and Assistant Network when is ) and represents the scalar quality score generated by .

Next, we will demonstrate more details about the proposed two subnetworks . From Figure 3, we can see that both the Master Network and Assistant Network are composed of a feature extractor, a hypernetwork and a target network with , , and as of their parameters, where . Therefore, we rewrite as . To extract representative features from different perspectives, ResNet-50 and ResNet-18 are adopted to construct the feature extractors for Master Network and Assistant Network, respectively. Suppose for an input image , the output of the four stages of ResNet (conv2_10, con3_12, con4_18, and conv5_9 in ResNet-50 and conv2_5, con3_4, con4_4, and conv5_4 in ResNet-18) are denoted as and , where for ResNet-50 is used in Master Network and for ResNet-18 is used in Assistant Network, and then we can extract multiscale features for the input image using the feature extractor aswhere represents operation and LDA and GAP are the local distortion aware module and global average pooling, respectively.

To cover wide image content variation, the dynamically generated parameter strategy is adopted to adaptively learn the quality perception rule according to percepted contents. The dynamic parameters in this type of network are known as hypernetworks [18]. Moreover, the hypernetwork and target network are used to form a hypernetwork for quality regression. The hypernetwork takes the output of the last stage of ResNet as input. For an input image , the output of the hypernetwork is

As the hypernetwork consists of three convolution layers and four hyperparameter modules (HPM), the computing procedure of is as follows:

The target network takes the multiscale feature extracted by as input and consists of four fully connected layers. The target network maps the multiscale feature extracted from image to a scalar quality score as

As introduced previously, the target network and hypernetwork together form a hypernetwork. We replace parameters of with the output of . It means that we use instead of in equation (5). Therefore, equation (5) can be rewritten as

Thus, given an input image , equation (6) is to compute the scalar quality scores and for the Master Network and Assistant Network , respectively.

3.2. Multiperspective Consistency-Based Model Training

The main idea behind our proposed method is to use cues learned from different perspectives for image quality assessment. To meet this goal, we need to not only consider the specific features of the individual perspective but also effectively use the generality cues of different perspectives. Based on such requirements, we will design a training objective function for the proposed network .

Let be a training set, where is the i-th training image and represents the ground true (mean opinion scores (MOS) or different mean opinion scores (DMOS)) for . Then, to use the specificity features of each perspective, we train both Master Network and Assistant Network to predict that scalar scores are as close to the ground true scores as possible. We use 1-norm to evaluate the distance between the predicted score and the ground truth and obtain the perspective specificity loss term for individual subnetworks as

The subscript of indicates specificity. Note that when , the specificity loss term is for the Master Network , and when , the specificity loss term is for the Assistant Network .

The generality of the two perspectives makes two subnetworks to learn unified representation for image content and distortion from different aspects. This integration strategy is a consistency constraint between perspectives. We propose a multiperspective consistency loss term , where the subscript refers to perspective consistency, to constrain each subnetwork to learn under the supervision of each other. loss is used to measure the difference between the outputs of two perspectives. Thus, the multiperspective consistency loss term can be defined as

Note that both the Master Network and Assistant Network have the same consistency loss term as follows:

After defining the specificity loss term and the perspective consistency loss term , we finally obtain the optimization loss function for and as equations (10) and (11), respectively.

With the previously defined loss functions for Master Network and Assistant Network , the Adam algorithm is used as an optimizer to optimize the parameters of the the proposed network . Furthermore, the training procedure is described in Algorithm 1. in Algorithm 1 adjusts the original gradient using adaptive momentum estimation, and we update and using and . The outputs of are illustrated in equations in Algorithm, respectively. In the procedure of Algorithm 1, the Master Network and Assistant Network play the same role during training. However, once we have obtained the trained network , only the subnetwork (Master Network) is used for image quality score prediction in the test phase. This is the reason we name as “Master Network.”

Input:, learning rate and , Epochs,
Output:
Initialize parameters of and : initialize with pretrained parameters on ImageNet, initialize randomly; ;
whiledo
while Fetch minibatch from do
  Compute parameters using equation (3);
  Compute for images in minibatch using equation (6);
  Compute using equations (10) and (11), respectively;
  Compute the gradient and update and using Adam optimizer as equations:
   
   
end
;
end
return ;

4. Experimental Results and Discussion

4.1. Datasets

To test our proposed model on both authentically and synthetically distorted images, three authentic distortion image databases including LIVE Challenge (LIVEC) [40], KonIQ-10k [41], and BID [42] and two synthetic distortion databases including LIVE [43] and CSIQ [44] are used for evaluation. The score type for the three authentic distortion image databases LIVEC, KonIQ-10k, and BID is MOS. The score type for LIVE and CSIQ is DMOS. The authentic distortion dataset LIVE contains five different types of distortion including JP2K (JPEG2000) compression, JPEG compression, White Gaussian Noise (WN), Gaussian Blurring (GB), and Fast Fading (FF). Similarly, CSIQ contains six types of distortions including JP2K compression, JPEG compression, additive White Gaussian Noise (WN), additive Pink Gaussian Noise (PN), global Contrast Decrements (CD), and Gaussian Blurring (GB). More details about image number, score range, etc., of each dataset are shown in Table 1.

4.2. Comparison Methods and Evaluation Metrics

To evaluate the performance of our proposed model, thirteen state-of-the-art BIQA methods are selected for comparison. Among the comparison methods, ILNIQE [24] and BRISQUE [13] are handcrafted feature-based approaches. The other approaches including HOSA [30], BIECON [15], WaDIQaM [12], SFA [33], PQR [32], DB-CNN [16], HyperIQA [4], CONTRIQUE [35], UNIQUE [36], GraphIQA [45], and TReS [37] are learning feature or deep learning-based methods.

We employ two commonly used criteria, namely, Spearman’s rank-order correlation coefficient (SRCC) and Pearson’s linear correlation coefficient (PLCC) to evaluate the performance of the proposed method and compared methods. Before computing PLCC, the predicted quality scores are first processed by a four-parameter logistic regression to remove nonlinear rating, which is caused by human visual observation according to the report from the Video Quality Expert Group [46]. The better method produces a higher SRCC and PLCC ranging between −1 and 1. The definitions of SRCC and PLCC are as follows:where is the rank difference between MOS and the predicted score of the i-th image and represents the number of images. and refer to MOS and the predicted score of the i-th image, respectively, and and are corresponding mean values for all images.

4.3. Implementation Details

We train and test our model using an NVIDIA Tesla K40 Graphics Card with video memory 12 GB. The Adam optimizer with a learning rate 2e-5 and weight decay 5e-4 is employed to train the network for 15 epochs with a batch size of 48. In addition, we set parameters and throughout the experiment. We employ the same experimental protocol as HyperIQA [4]. Specifically, we split each dataset into the training set and test set by 4:1. Note that synthetic distortion image datasets LIVE and CSIQ are split into train and test sets according to reference images to avoid content overlapping. In the test phase, we randomly select cropped subimages from each test image . The final quality score of the test image is defined as the mean of the scores of all subimages predicted by the Master Network as follows:where is the cropped subimage of , and is set to 25 for all the test datasets. We repeat the experiment 10 times and implement the random train-test splitting operation at each time. The median SRCC and PLCC values are used as the final results. For more details, refer to our released source code at https://github.com/gn-share/multi-perspective.

4.4. Performance Evaluation
4.4.1. Quantitative Evaluation

First, we conduct experiments on a single dataset and summarize the results in Table 2. The colors red, blue, and green refer to the highest, second, and third score for all comparison methods. For the three authentic distortion datasets, the results indicate that our proposed method outperforms others on LIVEC and BID. Both SRCC and PLCC of our method are only less than those of TReS. For the two synthetic distortion datasets, our proposed model achieves the best results on LIVE for both SRCC and PLCC evaluation, and the SRCC and PLCC are the second and third largest on CSIQ. We highlight that (1) our method outperforms HyperIQA on all the five test datasets; (2) our proposed model achieves the best results on all authentic distortion datasets except for KonIQ compared to the state-of-art HyperIQA [4], UNIQUE [36], CONTRIQUE [35], GraphIQA [45], and TReS [37]. Furthermore, our method performs only weaker than TReS on KonIQ; (3) our proposed method also achieves competitive results on the two synthetic distortion datasets (i.e., LIVE and CSIQ). In particular, our model shows a significant performance improvement over HyperIQA; (4) the average SRCC and PLCC of our method are larger than those of all the compared approaches, which indicate that the overall performance of our proposed model is better than compared methods.

Then, we further conduct experiments to evaluate the performance of our approach on different distortion types of images. As not all comparison methods reported SRCC values for individual distortion, we only show the results of methods reported from [4] in Table 3. Table 3 presents the SRCC values on individual distortion of each method on LIVE and CSIQ datasets. Based on the experimental results, our approach outperforms the compared methods on four of the five distortion types on the LIVE dataset. For Gaussian blurring (GB) distortion on the LIVE dataset, our approach obtains the second largest SRCC (0.956), which is only lower than the result of BRISQUE (0.964). For the CSIQ dataset, our method achieves the best results on three of the six distortion types, while the WaDIQaM obtains the best results on two out of the six distortion types as shown in Table 3. Note that WaDIQaM has the best performance for the CSIQ dataset in Table 2. Overall, our method is more efficient compared with the other methods on the individual distortion test.

In order to validate the generalization ability of our proposed method, we run cross database tests for the performance evaluation. Due to the lack of source code and reported results, three competitive methods PQR, DB-CNN, and HyperIQA are selected for comparison. We use four test protocols, which are (1) train on LIVEC and test on BID, (2) train on BID and test on LIVEC, (3) train on LIVE and test on CSIQ, and (4) train on CSIQ and test on LIVE. The first two test protocols are for the authentic distortion, and the last two are for the synthetic distortion. The SRCC values of each comparison method for each test protocol are summarized in Table 4. The results show that our approach significantly outperforms the compared methods for all of four cross database test cases. Specifically, when using LIVEC for training and BID for test set, the SRCC value of our approach is 0.882, which is much higher than the second largest SRCC of 0.762 (DB-CNN).

4.4.2. Qualitative Evaluation

In addition, to intuitively evaluate the performance of our proposed method, we present the scoring results for the authentic distortion dataset LIVEC and synthetic distortion dataset LIVE in this section. Figure 4 shows scoring results for images of LIVEC, from which we can see that our proposed method produces remarkable results in the 1st to the 4th columns although the content of images in LIVEC varies. Some failure cases are observed, as shown in the last column of Figure 4, which correspond to two images with serious distortion and very high quality.

Moreover, Figures 5 and 6 are scoring results for distorted images from the LIVE dataset. Figure 5 shows the predicted scores and the corresponding standard deviations of the plane images distorted by five different types of distortion. The results indicate that our method produces prediction scores close to the DMOS for images of different distortions with satisfactory standard deviations (maximum std 7.06 (std of FF distorted image) and minimum std 1.09 (std of WN distorted image)). Figure 6 presents the prediction scores for images with the same distortion (GB) but different distortion intensities. The distortion intensities increase from left to right, and our model generates prediction scores that are consistent with the expectations.

Figure 7 is the scatter plots of DMOS/MOS versus prediction scores on the test sets of LIVE and LIVEC. The blue solid line represents the fitting line, which is the fitting for all scatter points, while the red dash line represents the desired fitting line. From the scatter plots we conclude that, on one hand, these scatter points can be distributed along the desired line. On the other hand, the result of LIVE is better than that of LIVE, which is consistent with the quantitative results shown in Table 2.

4.5. Effect of Different Experimental Settings

In this section, more experiments are conducted to explore the effect of different hyperparameters, backbone architectures, and training set size.

4.5.1. Selection of Weight Parameter Value for Perspective Consistency Term

As the loss function for and are composed of two terms, namely, the perspective specificity loss term and multiperspective consistency loss term , we further conduct experiments to verify the performance of our proposed model using different weight values. During the implementation, we always set and as the same value. Table 5 presents the SRCC and PLCC values on LIVE and LIVEC datasets using different and ranging from 0.0 to 0.9 with a step length of 0.3. The results demonstrate both SRCC and PLCC obtain the maximum value when on both LIVE and LIVEC datasets. Note that means that the model is not using the perspective consistency constraint during the model training. The results on both LIVE and LIVEC indicate that the perspective consistency term can improve the performance of the algorithm. However, an improper weight value may lead to degradation of model’s performance (e.g., both SRCC and PLCC values corresponding to are less than those of without using perspective consistency).

4.5.2. Evaluation of Architectures of Master Network and Assistant Network

The network architecture significantly affects the performance of the algorithm. We also experimentally compare different network structures to test their effectiveness. We test ResNet-18 and ResNet-50 for both Master Network and Assistant Network and obtain four test cases results as shown in Table 6. The results indicate that, on one hand, the model performs better when we use ResNet-50 as the architecture of . On the other hand, using different architectures for and has more advantages than using the same architectures. As a result, we apply ResNet-50 and ResNet-18 for and , respectively, which generate the best testing performance both on LIVE and LIVEC as shown in Table 6. We further implement t-tests between average SRCC values of the above and combinations to ascertain whether the results are significant or not. The test results indicate that the architecture of  = ResNet-50 and  = ResNet-18 is significantly better than the combination of  = ResNet-18 and  = ResNet-50 (p = 0.0496 on LIVE and on LIVEC) and  = ResNet-18 and  = ResNet-18 ( on LIVE and p = 0.0202 on LIVEC). However, the performance advantage of the architecture  = ResNet-50 and  = ResNet-18 over the combination of  = ResNet-50 and  = ResNet-50 is not significant, with p = 0.2946 on LIVE and on LIVEC.

4.5.3. Performance of the Proposed Method Using Different Sizes of the Training Set

In this section, we conduct experiments to verify the performance of our proposed model with different sizes of training sets. The relationship between the performance of the proposed method and the proportion of the training set is shown in Figure 8. It is obvious that both SRCC and PLCC on LIVE and LIVEC gradually decrease when the proportion of the training set in all the datasets decreases. It indicates the importance of training data. The model still performs well even when trained with 20% samples of the whole dataset, which further verifies the effectiveness of our proposed method.

4.6. More Results for the Multiperspective Strategy on Different Backbone Networks

To further verify the proposed method, we conduct more experiments using different backbone networks including VGGNet [31], DenseNet [47], ResNet [17], and GoogleNet [48]. Additionally, to better study the performance impact of different backbone networks, we do not use HyperNet throughout the experiments in this section. Specifically, we remove the last softmax layer of each backbone network and combine it with a three-layer MLP to form a baseline. It should be noted that we use the same MLP in the following experiments. Each layer of the MLP contains 512, 32, and 1 neurons, respectively, and ReLU is used as the activation function for the first two layers. The weights for perspective consistency terms are set to . The SRCC and PLCC values for these baseline methods on LIVE and LIVEC are listed in Table 7. The results show that (1) ResNet-50 and DenseNet-169 achieved better performance compared to other baseline algorithms, especially on dataset LIVEC, and (2) for models of the same type, deeper networks (i.e., VGG-16, DenseNet-169, and ResNet-50) perform better.

To verify the effectiveness of the proposed multiperspective strategy, we conduct experiments to test the performance of different combinations of baseline networks. Table 8 shows the SRCC and PLCC values for different combinations of backbone networks. From the results in Tables 7 and 8, we conclude that the proposed multiperspective strategy has a significant improvement for all baselines (i.e., DenseNet-169, ResNet-50, GoogleNet, and VGG-16). For example, the SRCC value on LIVEC increases from 0.833 (line 6 of Table 7) to 0.857 (line 4 of Table 8) if we use ResNet-50 to assist DenseNet-169, and the SRCC value on LIVE increases from 0.959 (line 9 of Table 7) to 0.974 (line 12 and 13 of Table 8) if we use ResNet-50 or VGG-16 to assist GoogleNet.

4.7. Discussion

The main idea of this study is to improve the performance of BIQA with valuable cues learned from different perspectives. To achieve this, we utilize different architectures to construct different perspectives and propose a multiperspective consistency-based training strategy. Specifically, architecture of HyperIQA is used to construct different perspectives. As a BIQA model designed for real-world images, HyperIQA achieves competitive performance on authentic distortion datasets. However, its performance on synthetic distorted images is relatively limited. This discrepancy can potentially be attributed to the limited content diversity present in synthetic distortion datasets, which poses challenges for effective model learning. The proposed multiperspective strategy can alleviate this challenge as it can take advantage of both specific features and generality cues of different perspectives. By considering information from different perspectives, the proposed strategy enables mutual exchange of valuable insights while also constraining each perspective to reduce the risk of overfitting.

Moreover, the selection of Master Network has a great impact on performance of our proposed model. Based on experimental results, we adopt the ResNet-50-based HyperIQA as the Master Network. The deeper networks generally have stronger representation abilities. Intuitively, selecting a deeper network as the Master Network from multiple perspectives seems reasonable. In fact, deeper networks usually have better performance with proper training configurations. The experimental results in Tables 7 and 8 support this view to some extent. The results in Table 8 show that using a deeper network as Master Network always performs better.

Lastly, while our proposed model achieves better overall performance compared to other state-of-the-art methods, it is important to acknowledge the notable advancements made by some recent models. For instance, some latest models (e.g., transformer-based TReS [37] and graphical convolutional network (GCN)-based GraphIQA [45]) show great promise, especially for large scale of data. In addition, some early deep models still show competitive performance. For example, the overall performance of the dual-stream network DB-CNN is very close to some recent models, which is still instructive for BIQA model design.

5. Conclusions

In this paper, we propose a novel model for BIQA problem to increase the adaptability of the BIQA model for image content variation and a diverse range of distortions. To represent the image from different aspects, we employ a multiperspective strategy that incorporates more cues. Specifically, we present a perspective consistency constrained training strategy to integrate different perspectives effectively, considering both multiperspective cues and the complexity of the network. Extensive experimental results show that our proposed approach has a promising performance both of authentic and synthetic distorted image databases compared to the state-of-the-art methods. Moreover, generalization ability of the proposed method is also remarkable, further enhancing its practical applicability.

Despite the competitive performance of the proposed multiperspective method and recent models, there is still considerable room for improvement in effectively handling authentic images with diverse content and distortion. Moving forward, our future research will focus on exploring architectures with enhanced representational capabilities, leveraging transformer, GCN, and other advanced techniques. Additionally, we aim to conduct further investigations into the multiperspective integration strategy to enhance the model’s adaptability to a wide range of content and distortion variations.

Data Availability

All datasets supporting this study are publicly available.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was supported by the National Natural Science Foundation of China under grant no. 61866031.