Abstract

Image similarity metric, also known as metric learning (ML) in computer vision, is a significant step in various advanced image tasks. Nevertheless, existing well-performing approaches for image similarity measurement only focus on the image itself without utilizing the information of other modalities, while pictures always appear with the described text. Furthermore, those methods need human supervision, yet most images are unlabeled in the real world. Considering the above problems comprehensively, we present a novel visual similarity metric model named PTF-SimCM. It adopts a self-supervised contrastive structure like SimSiam and incorporates a multimodal fusion module to utilize textual modality correlated to the image. We apply a cross-modal model for text modality rather than a standard unimodal text encoder to improve late fusion productivity. In addition, the proposed model employs Sentence PIE-Net to solve the issue caused by polysemous sentences. For simplicity and efficiency, our model learns a specific embedding space where distances directly correspond to the similarity. Experimental results on MSCOCO, Flickr 30k, and Pascal Sentence datasets show that our model overall outperforms all the compared methods in this work, which illustrates that the model can effectively address the issues faced and enhance the performances on unsupervised visual similarity measuring relatively.

1. Introduction

During the past decades, metric learning (ML) in computer vision (CV), also known as image similarity measurement, has been a fundamental problem in a variety of image applications, including image retrieval [13], face recognition [47], and visual search [811]. The goal of metric learning is to learn an embedding space, where the mapped feature vectors of similar instances are encouraged to be closer. At the same time, samples of different categories are pushed apart from each other [1214]. In other words, the image similarity metric aims to estimate whether a given pair of images are similar or not.

Deep metric learning (DML) is a novel measure technique that combines deep learning (DL) with metric learning. With the recent rapid development of deep neural networks (DNN) in computer vision, DML has drawn growing attention and has become a mainstream metric learning method. Most previous deep metric approaches rely on supervised learning, meaning labels must be provided to the model along with input data. However, most data have not been labeled in the real world, and annotating a large-scale dataset is time-consuming and expensive. Consequently, learning effective visual-metric representations without human supervision is a crucial problem. The research on this problem is also known as Unsupervised Learning (UL). Self-Supervised Learning (SSL) can be regarded as a particular type of UL with a supervised form, where supervision is induced by self-supervised tasks rather than preset prior knowledge. SSL has delivered promising results in recent years, and a majority of mainstream approaches of it fall into one of two classes: generative or discriminative [15]. Generative techniques learn to reconstruct the original input [16, 17], and discriminative methods learn representations using objective functions. Discriminative ones are similar to the supervised learning approaches but require specific pretext tasks [18, 19] in which both inputs and labels are derived from an unlabeled dataset. Recently, discriminative approaches based on contrastive learning have achieved outstanding results in unsupervised visual representation. Those approaches [2022] show significant potential to close the performance gap with supervised methods or even achieve the same results in specific downstream tasks and datasets. Therefore, it is of great research value to introduce SSL into image similarity measurement.

In addition, the current image similarity measurement techniques based on deep metric learning frequently focus on the image itself merely without further utilizing the information of other modalities correlated to the image. Nevertheless, pictures usually do not appear independently in the real world, yet with other modal information such as text and sound, especially in social media. For example, when people browse images in social media applications like Twitter and Tumblr, the pictures are always accompanied by text descriptions. Figures 1(a) and 1(b) are images selected randomly from Tumblr and Twitter, respectively, and both images have related sentences. Thus, when searching for relevant images in social apps, the results would be better if we make good use of the text information that appeared with the pictures. Besides, online shopping malls like Amazon and eBay often provide a text description of the product when listing it. Figure 1(c) shows an image of a sweater jacket randomly picked from Amazon Marketplace, and a detailed product description can be found below the image. In this way, we can use relevant description information to obtain more accurate product candidates when exploring similar products in online shopping. From the above two application scenarios, it can be concluded that the textual information that can make the image more specific will be lost if we only focus on the image itself in image similarity measuring.

However, a text sometimes appears ambiguous in real life. For example, “he left the bank” has two interpretations. One is that he is now far away from the bank, like “he left the bank five minutes ago.” Another is that he is no longer working at the bank, like “he left the bank five years ago.” Namely, each of those ambiguous sentences can map to more than one point in the embedding space. Nevertheless, an injective model [23], a one-to-one mapper, can embed merely one point from an ambiguous text input sample. That is to say, the semantics resulting from the injective model will likely deviate from the ones we expect. Therefore, if these ambiguous sentences are used in the image similarity metric directly, this will not only hardly improve the accuracy of the algorithm but also have a negative impact on the results.

To address the above issues, we propose a Simple Contrastive Model with Polysemous Text Fusion (PTF-SimCM) for visual similarity metric in this work. The architecture of PTF-SimCM consists of two branches and a middle cross-modal encoder. Of all the current best contrastive learning models, SimSiam [24] is the simplest without negative samples and no need for large batch size. Hence, the two branches in our model adopt the same asymmetric contrast structure as SimSiam for solving the trouble of visual-metric representations without human supervision. It contains upper and lower subnets with shared weights. Besides, we use a cross-modal model for encoding input sentences rather than a unimodal text model to reduce the computation of late features fusion and improve training productivity. For the problem in polysemy text fusion, we employ the pretrained Sentence PIE-Net [23] as our cross-modal encoder and fuse the output embeddings with image features. Moreover, to make the image similarity measurement simpler and more efficient, PTF-SimCM straight learns an embedding space where distances directly correspond to a measure of image similarity.In short, our contributions are summarized as follows:(i)We propose a novel method based on the contrastive learning structure for unsupervised visual similarity measuring.(ii)For better model performance, we further pay attention to information on the textual modality correlated to the image by effectively combining multimodal fusion with the contrastive structure.(iii)To address the problem caused by polysemous sentences and improve the efficiency of late feature fusion, we exploit a specific cross-modal net to process input text modality rather than a standard unimodal text sequence model.(iv)We obtain competitive results compared with the methods which do not consider textual information in image similarity measuring or cannot process the unlabeled training data on MSCOCO [25], Flickr 30k [26], and Pascal Sentence datasets [27].

The rest of this paper is organized as follows. Section 2 reviews and discusses the related works in this area. Section 3 characterizes the proposed PTF-SimCM model in detail, followed by the implementation details, performance comparison experimental results, and ablation study in Section 4. Conclusions are in Section 5.

This section will briefly review some of the most relevant research to our work. The structure of this section is as follows. Section 2.1 will review the recent research on deep metric learning, followed by the recent study on contrastive learning in Section 2.2. The SimSiam model [24], as same asymmetric contrastive structure as our approach, is described in Section 2.3. The recent examinations on cross-modal representation are in Section 2.4.

2.1. Deep Metric Learning

Deep metric learning (DML) is a specific type of metric learning that aims to measure the similarity between input data samples by mapping data to an embedding space in which similar instances are close together, and dissimilar data are far apart. One of the fundamental ideas where explicit metric learning is performed is the Siamese network model [28, 29]. Siamese network, with a contrastive loss [30], is asymmetric neural network architecture consisting of two identical subnetworks that share the same parameters. The contrastive loss, as shown in equation (1), is a classic loss function and is one of the most straightforward and intuitive training objectives for metric learning.where are input samples and are their corresponding labels. denotes the identical function that is equal to 1 indicating that the data pairs are similar or positive and shows that the pairs are dissimilar or negative. The parameter is the margin threshold between positive and negative.

Triplet loss [5] is another popular and widely used loss function for metric learning. Let be samples from the dataset and are their corresponding labels and . Usually, is called anchor, is called positive sample, and is called negative sample. The loss function is defined aswhere represents the margin threshold between and . It needs to be pointed out that negative samples mining, on which we sample such triplets satisfying , plays a key role in the effect of this loss function.

Based on contrastive loss and triplet loss, many improved methods have emerged. N-Pair Loss [31] generalizes triplet loss by allowing joint comparison among more than one negative example and reduces the computation by an efficient batch construction strategy. MS Loss [32] can fully consider three similarities for pair weighting.

Other commonly used DML methods are noncontrastive. The main idea of SphereFace [4] is enforcing the class centers to be at the same distance from the center by mapping them to a hypersphere. Furthermore, it employs angular distance with angular margin to measure distance. CosFace [7] proposes a more straightforward yet more effective method to define the margin. ArcFace [6] is very similar to CosFace. However, instead of defining the margin in the cosine space, it defines the margin directly in the angle space. The SoftTriple Loss [33] takes a different strategy by expanding the weight matrix to have multiple columns per class, providing more flexibility for modeling class variances.

2.2. Contrastive Learning

Self-Supervised Learning (SSL) is a technique to derive information from unlabeled data itself directly by formulating specific predictive tasks, and contrastive learning, based on contrastive constraints, is one of the most popular methods of self-supervision. Deep InfoMax [34] was presented to learn representations of images by leveraging the local structure present in an image. Structurally, SimCLR [15] is a simple contrastive learning method that utilizes negative samples and a sizeable mini-batch up to 4096 to work better. To get around the problem of an enormous number of negative examples, MoCo [20] maintained a large queue of them instead of updating the negative encoder by backpropagation. BYOL [35] assumed that it is still possible to train an SSL model with outstanding results without negative samples. It adopted an asymmetric network structure with momentum updates and a stop-gradient operation to avoid collapse. SimSiam [24] removed the target encoder updated by the momentum based on BYOL. Barlow Twins [36] started from the perspective of embedding rather than from samples and proposed improving the representation ability of features by allowing features of various dimensions to represent different information as much as possible.

2.3. SimSiam

SimSiam [24] has been one of the most popular contrastive learning methods recently, and the architecture, as shown in Figure 2, uses two views and randomly augmented from the same image as input. Two encoder modules, consisting of a backbone and a projection [15], extract image features from two views, respectively, and the encoder shares weights between the two views. Then only the left branch of the model takes the output and feeds it to a predictor module . The output in the right branch does not require any treatment. Then the cosine similarity of results of two branches is calculated.

“Stop-grad” denotes that the right branch does not update gradients during training. The strategy to stop updating parameters of the right branch and the asymmetric structure of the model are both for preventing model collapse, and the experiments of the paper [24] show that the above methods, especially stop-gradient, can effectively avoid the issue of model collapsing.

2.4. Cross-Modal Representation

Cross-modal representation has played an indispensable role in representation learning in recent years, and the goal is to build embeddings using information from multiple modalities. Fusing information from the different domains into unified embeddings is one of the wide implementations of the cross-modal representation approaches. In this field, [37] presented that through unifying local document context and global visual context to learn better visually grounded word embeddings. Based on word2vec [38], Kottur et al. [39] proposed adding images as inputs to the training of it so that the resulting word vectors have visual semantic information.

Building embeddings for different modalities in a common semantic space has been another popular way over the past few years. This method allows the model to compute cross-modal similarity, which can be further used for downstream tasks, such as cross-media retrieval [4042]. Ba et al. [43] presented a model that can classify unseen categories from their textual description by cross-modal similarity in Zero-Shot Learning (ZSL). Huang et al. [44] presented the cross-media model to transfer valuable knowledge in existing data to new data. On the basis of cross-modal representation, Song et al. [23] further considered the ambiguity of instances, thereby improving the accuracy of visual semantics.

3. Proposed Method

In this section, we propose a Simple Contrastive Model with Polysemous Text Fusion (PTF-SimCM) for visual similarity metric. The structure of this section is organized as follows. In Section 3.1, we describe the problems. We provide an overall description of the model in Section 3.2. Section 3.3 characterizes the cross-modal encoder, followed by the multimodal fusion and projector in Section 3.4. Predictor and inference are in Section 3.5 and Section 3.6, respectively. Section 3.7 presents the algorithm details.

3.1. Problem Formulation

The image similarity metric aims to estimate whether a given pair of images are similar or not, and the problem to be addressed in this study is how to measure image similarity and improve the performance of the algorithm in scenarios where there are only a large number of unlabeled images. Firstly, learning effective visual-metric representations without human supervision is a crucial problem as plenty of images that are not human-labeled in our daily lives exist. Moreover, images in our daily social media always appear with correlated text descriptions. Therefore, it is a meaningful study to further consider and utilize textual modalities among image similarity metrics. Unfortunately, the textual description is sometimes ambiguous, so solving this polysemy is another critical problem.

The proposed model combines self-supervised contrastive learning, cross-modal representation, and modalities fusion to address the above issues and advance the performance. Furthermore, our model learns the metric embeddings directly, the distance of which corresponds to image similarity. The notation list of the proposed method is shown in Table 1.

3.2. Overview of PTF-SimCM

PTF-SimCM employs the same contrastive learning structure as SimSiam. It aims to learn an embedding space where the cosine similarity of embeddings directly corresponds to image similarity: images with similar content have small distances, and images with distinct content have large distances. In contrast, the previous approach [15, 24, 35] of contrast learning was to learn a representation that can then be used for downstream tasks. The architecture of PTF-SimCM, composed of two branches of neural networks and a middle cross-modal encoder (CME), is presented in Figure 3. The upper branch is defined by a set of weights and is composed of three stages: an encoder , a multimodal projector , and a predictor . The lower branch is the same as the upper branch, except that there is no final predictor. The CME is a pretrained cross-modal network with frozen parameters for reducing the computation. Alternatively, as if computing resources are not a concern, fine-tuning the parameters in training may be a better option.

Given a set of image-text pairs , a pair sampled uniformly from , and two image augmentations and , image augmentation is a technique that generates similar but distinct views after a series of random changes to the input images. The model generates two augmented views and to image as the upper branch and lower branch input, respectively, and takes description of the image as input of CME. The encoder network is actually a backbone (e.g., ResNet [45]) used to extract image features. CME is used to process text description . To make the text learn the semantic information of the image before fusing with it so that the subsequent fusion and projection get more uncomplicated, and to improve the efficiency of model training, we directly use a pretrained cross-modal representation model instead of a unimodal text sequence net as our CME module, which can embed the sentence to shared visual-textual space. In addition, considering the ambiguity of the sentence, CME further transforms the input into embeddings with different semantics. The embedding, output from the middle tube, is a visually grounded word representation, and we call it a cross-modal embedding. The image feature output from the encoder and each of the embeddings are fused through simple concatenating, as seen in the following equation:where denotes the ith fusion feature and is the ith cross-modal embedding. The multimodal projector is a four-layer multilayer perceptron (MLP) and it maps fused representations to the specific metric space. Then, is followed by an MLP predictor in the upper side path. The predictor transforms the output of the upper side multimodal projector and matches it to the lower branch output. Stop-gradient as same as [24] is adopted in the lower branch; that is, parameters of network on lower side path will not be updated by backpropagation. Various experiments in [35] have shown that predictor and stop-gradient are the keys to preventing the model from falling into a collapsed solution. Negative cosine similarity, shown in (4), is used as the loss between output embeddings of the upper and lower branch networks.where is -norm, and , and is one of the mixed features. Following [24], we define a symmetrized loss aswhere denotes stop-gradient operation and then the final output of the lower branch network is denoted as . Loss function is used to measure the similarities between and , which are two augmentation views with textual content description. The function is defined for each view-text pair, and total loss is averaged value over all pairs.

3.3. Cross-Model Encoder Module

For the model to perform feature fusion and projection more efficiently, we use a cross model to handle the sentence input rather than a unimodal text model. Moreover, a sentence sometimes appears ambiguous in real life. That is to say, each of those captions can map to more than one point in the embedding space. Unfortunately, a one-to-one mapper can merely find one single point, and there is no doubt that it has a negative impact on subsequent applications.

PVSE (Polysemous Visual-Semantic Embedding) [23] is a visual-semantic embedding model which could more effectively address the partial cross-domain association issue and ambiguous instances issue compared with other approaches. The architecture of PVSE is composed of two submodels called PIE-Net (Polysemous Instance Embedding Network), which can extract embeddings of each sentence instance based on the different meanings. To address the issue of ambiguity in input sentences, we adopt Sentence PIE-Net pretrained through PVSE architecture as the cross-modal encoder.

The sentence encoder is composed by GloVe [46] and a bidirectional GRU (Bi-GRU) [47]. A sentence with words is converted by the pretrained GloVe into 300-dim vectors as local features . Then they are taken into a Bi-GRU with hidden units. The output of the final hidden layer as global features . The local feature transformer transforms into locally guided representations. More specifically, is first fed into a multihead self-attention [48, 49] module implemented by a two-layer perceptron and attention maps are obtained. Given local features , attention maps are computed as follows:where , , and . Then, local features are multiplied by the attention map and the result is taken into a nonlinear module to obtain locally guided features :where , , and denotes sigmoid function.

Finally, the fusion block obtains the final embeddings by combining global features and locally guided features. The embedding vectors are computed aswhere represents repetitions of and denotes layer normalization [50].

3.4. Multimodal Fusion and Projector

As the cross-modal module is used to process text modality in the previous step, the obtained embeddings have learned a certain amount of image semantic information. Consequently, the fusion method “” adopts a simple concatenate operation. Given feature of view augmented from input image, embedding output from cross-modal block, it computes fused vector :

This simple concatenation operation does not establish a special semantic connection between different features, so it relies on subsequent network layers to adapt it. That is to say, the multimodal projector is not only a mapper but also a fusion adaptor.

The multimodal projector is a four-layer perceptron with batch normalization [51] applied to each fully connected layer, including the final output layer. The previous three-layer perceptron has no biases and takes the ReLU as the activation function. One of the multimodal metric embeddings is computed aswhere , , and represent the hidden state victor, weight matrix, and bias, respectively, of -layer and denotes batch normalization.

3.5. Predictor

The predictor module is only applied to the upper branch to make the architecture asymmetric, as in [24, 35], and this structure can prevent collapse to a certain extent. Like the multimodal projector, the predictor block is also a multilayer perception yet with two layers. The batch normalization is applied to hidden layer which has no bias. The output fully connected layer does not have and activation function. One of the final outputs is given:

In the above formula, is weight matrix of hidden layer, and and are parameters of output layer.

3.6. Inference

At test time, we use the inference model obtained by removing the upper branch and the loss function from the PTF-SimCM to measure similarity. Two image-text pairs and are given to measure the similarity. They are fed into the inference model and two embedding vectors and , respectively, are got. Then we find the best matching pair by comparing the cosine distances between all combinations of embeddings. At this point, the distance is as the similarity between and . The architecture and processing process in inference is shown in Figure 4.

3.7. Optimization Algorithm

The model adopted mini-batch gradient descent to update parameters, and the specific steps and details of model training and optimization are shown in Algorithm 1.

Input:
set of images with description and distributions of transformations;
initial parameters, encoder, multimodal projector, predictor;
cross-modal encoder;
the number of cross-modal embeddings;
optimizer, updates parameter using the loss gradient;
total number of optimization steps and batch size;
learning rate schedule;
(1)for to do
(2)   //sample a batch of N image-text pairs
(3)  for do
(4)    //sample image transformations
(5)   
(6)   for to do
(7)    
(8)    
(9)    
(10)   end
(11)    //compute the total loss
(12)  end
(13)   //compute the total loss gradient
(14)   //update parameters
(15)end
(16)return

4. Experiment

4.1. Datasets

We first introduce the three image-text retrieval datasets, that is, MSCOCO [25], Flickr 30k [26], and Pascal Sentence [27]. Figure 5 shows part data samples from the three datasets.

MSCOCO is one of the most prevalent cross-modal retrieval datasets in recent years. It contains 123,287 images and 616,767 descriptions. Every image contains roughly 5 text descriptions on average. This dataset is officially split into 113,287 images as training data, 5,000 images as validation data, and 5,000 images as testing data. As our proposed model requires input that includes both images and sentences, we disuse the testing set. In order to be able to test usually, we divided the validation set into 100 classes by the K-means algorithm [52].

Flickr 30k is another prevalent and large-scale image captioning dataset. It contains 31,783 images that are collected from Flickr. Every image is annotated with five textual descriptions. The average sentence length reaches 10.5 words after removing rare words. We adopt the protocol in [53] to split the dataset into 1,000 test images, 1,000 validation images, and 29,783 training images. In order to have more training data, we merged the validation set into the training set and got the 30783 images final training set. Again, we split the test set into 20 classes using the K-means algorithm.

Pascal Sentence dataset also is a widely used dataset for cross-media retrieval. It is a subset of Pascal Voc and contains 1,000 image-text pairs from 20 categories. So each category includes 50 pairs, and each pair has an image and several sentences. The number of samples in this dataset is not large enough to be used as training data for our model. Consequently, we only use this dataset as our test set.

4.2. Evaluation Metrics

We use three metrics, Recall at k (Recall@k), R-Precision, and Mean Average Precision at R (MAP@R), to evaluate the performance of models. According to [54], R-Precision and MAP@R are the most suitable metrics to measure model performance in metric learning. As Recall has been the most widely used measure in the past few years in metric learning, we also used it to evaluate metrics for comprehensive comparisons to other methods.

Recall@k in metric learning is defined as follows: first, for each sample in the dataset, get -nearest neighbors. If at least 1 of those nearest neighbors is a match, then the sample gets a of 1. Otherwise, it scores 0. Recall@k is the average of of all query sample:

Given a query, let be the total number of references that are the same class as the query. R-Precision is computed aswhere is the number of -nearest references that are the same class as the query.

MAP@R is Mean Average Precision with the number of nearest neighbors for each sample set to R. For a single query,

4.3. Implementation Details

The model proposed in this study employs the same contrastive learning structure as SimSiam [24]. We used momentum SGD as the optimizer and used initial learning rate is . Learning rate is computed as , which followed a cosine decay schedule [55], and the specific parameter settings are shown in Table 2. In addition, to acquire significant computational speedup and less GPU memory consumption, we utilize mixed-precision training [56] in the model training.

4.4. Baseline Methods

To verify the performance and validity of the model proposed in this study, we select four models that are quite popular in the relevant fields and show good performance in past studies as the baselines. In terms of supervised metric learning, we compare the following models: Siamese Network [29], Triplet Network [5], SoftTriple [33], MemVir [57], and Label Relaxation [58]. The name of the selected Siamese Network is SimNet [29], and it is trained on pairs of positive and negative images like the conventional Siamese model. The improvement of this model is that it uses a novel online pair mining strategy based on curriculum learning and adopts a multiscale convolutional neural network (CNN) as the feature extractor. FaceNet [5] was initially used to distinguish face images and directly learn a mapping from input face images to a metric space where distances directly correspond to a measure of face similarity. It used a triplet architecture with triplet loss and also has a good performance in the similarity detection of other nonface images. SimNet and FaceNet are contrastive-based DML approaches that use the specific loss function that directly pulls together the embeddings of samples with the same label and pushes away the embeddings of dissimilar samples. SoftTriple is representative of noncontrastive approaches and can learn the embeddings without the sampling stage by lightly increasing the size of the last fully connected layer. MemVir and Label Relaxation are two state-of-the-art methods in metric learning. MemVir exploits the virtual classes by maintaining memory queues to utilize augmented information for training and alleviates a strong focus on seen classes for better generalization. To improve metric learning performance, Label Relaxation employs a relaxed contrastive loss based on embedding transfer to enable more crucial pairs to contribute more to training. All of those five methods have intensely outstanding performance in the presence of a large number of labeled datasets. With regard to unsupervised visual-metric representation, we compare our method to SimSiam [24], the simplest SSL model without negative samples and no need for a large batch size. It is one of the best SSL models, and the contrastive structure of our model refers to it.

4.5. Performance Comparison

In this section, we conduct a series of performance comparison experiments. In order to simulate the unsupervised scene where we cannot obtain relevant labeled datasets, we use an additional dataset Mini ImageNet (a subset of ImageNet [59]) to train the above models. For unsupervised metric learning, as the contrastive structure of our model is based on SimSiam, we employ it as the baseline. SimSiam generally uses the feature extraction module without the projector block for downstream tasks. As our model has an additional mapping module in structure, in order to make the comparison fairer, we add a model “SimSiam + proj” that retains the projector for comparison. The performances of the proposed model and baselines on MSCOCO dataset are illustrated in Table 3.

From Table 3, we can observe that our model achieves positive results in experiments. The proposed model with setting of and (dimensions of metric embedding = 64) achieves significant improvement in terms of MAP@R with at least 0.2, and PTF-SimCM with obtains the best overall performance in recall rate. SoftTriple has the best R-Precision of 0.307 and the highest value of Recall@1. MemVir gets the second-highest MAP@R value. It is noteworthy that the performance of Label Relaxation is inferior to that of SoftTriple. In our judgment, the primary cause is that Label Relaxation is an embedding transfer model which relies on a well-knowledged source embedding model. However, it can hardly obtain a well-trained source model in the unsupervised scenario. We can only train on a labeled dataset from another domain and then test on data from the target domain. Consequently, the Label Relaxation lost its advantage in this particular scenario and achieved a nonremarkably good performance in the above experiments. In the same way, it is also tricky for MemVir to exert its advantages when the training and test data are not in the same domain. Compared to SimSiam and “SimSiam + proj,” our model achieves better performance in all three metrics. Furthermore, it is worth noting that the overall results of the experiments are not very superior. One of the key factors is that MSCOCO is not a dataset for classification or image retrieval. Consequently, there is a relatively large intraclass variation between the positive samples of this dataset. Besides, it also depends on the effect of the cluster method.

With the purpose of making the experimental comparison results more comprehensive, we conducted experiments on the Flickr 30k dataset. The results are summarized in Table 4. We can see that our proposed method with setting of and got the best MAP@R scores, and it is 0.3 percentage points higher than the second place. The proposed model with and achieved the best Recall@8 value. SoftTriple obtained the best R-Precision of 0.296, the highest value of Recall@1, the highest value of Recall@4, and the second-highest MAP@R value. MemVir got 69.1% on Recall@2, the highest value. In general, the performance of our method on Flickr 30k is close to those of supervised methods SoftTriple and MemVir without exceeding them by much. The main reason is that the training data is not enough. Moreover, the classes of test sets were still divided by K-means, and this played an important role in the accuracy of the experiment.

In order to evaluate the universality of the proposed model, we also conducted experiments on the Pascal Sentence dataset. In this set of experiments, MSCOCO is still used as training data, and Pascal Sentence is merely used as the test data for models. The results are shown in Table 5.

The above results, in which the proposed model has better overall performance, are similar to those shown in Table 3 and 4. The validation values from Table 5 have improved, yet they are still not excellent. We believe the reason behind that is that there is not enough training data for a generic image metric model. Accordingly, it did not learn the distribution closer to the ground truth. Overall, our model performs favorably in the experimental results, outperforming recent popular supervised learning methods under certain conditions, and, especially compared to SimSiam with the same contrast structure, our model improves significantly.

4.6. Ablation Study

To demonstrate the effectiveness of some settings in the proposed model, we execute a series of ablation experiments on the Pascal Sentence dataset. In those experiments, we denote the number and the dimensions of cross-modal embeddings as , , respectively, and the dimensions of metric embeddings as .

The Number of Embeddings. Figure 6 shows the effect of the PTF-SimCM model with different , which is the number of the cross-modal embeddings, on the Pascal Sentence dataset. According to the “Proposed Method” section, represents different sentence semantics. We set and . To better comprehend the effect of the number , we vary it from 1 to 4 and compare the experimental results of metric Recall@1 (also known as “Precision@1”) under different epochs. When , it means that the cross-modal encoder is a typical one-to-one model. From Figure 6, we can see a significant improvement from to ; this shows the necessity to consider ambiguity. When , the performance of almost all epochs reaches its peak.

The Dimensions of Cross-Modal Embeddings. We employ experiments on on the Pascal Sentence dataset with and . Figure 7 shows the result on different based on the Recall@k metric. From the result, we can find that as the parameter increases gradually, the Recall will increase, and when and , the performances are significantly better than those when . Considering all Recall values, is the best. This conclusion shows that the dimensions of the cross-modal embedding cannot be too small, which may lose important information and not be too large, which may cause redundancy.

The Dimensions of Metric Embedding. Figure 8 shows the Recall values on the Pascal Sentence dataset when we vary while and and . The results show that there is noticeable difference between the performances of and . When , the experiment can get the best Recall, and when , it obtains the worst. It shows that the dimensions of metric embedding play an important role in our model.

5. Conclusions

In this paper, we present a Simple Contrastive Model with Polysemous Text Fusion (PTF-SimCM) for visual similarity metric. PTF-SimCM is composed of two branches of neural networks and a middle cross-modal encoder. Two branches adopt an asymmetric contrastive structure and shared weights to address the unsupervised visual representation issue. The pretrained cross-modal encoder is used for polysemous expression embedding, and a multimodal fusion operation is designed for feature fusion. To become a simpler and more efficient image similarity measurement, PTF-SimCM straight learns an embedding space where distances directly correspond to a measure of image similarity.

Experimental results on MSCOCO, Flickr 30k, and Pascal Sentence datasets show that PTF-SimCM utilized the information of text modalities more comprehensively, fully considered sentence polysemous, and had better results compared to the baselines. Following the observational evidence, despite the fact that the Recall@k is not the best in general, the MAP@R value reaches the highest value, surpassing the supervised learning model. Moreover, we conducted ablation studies on the numbers and dimensions of cross-modal embeddings, and the results advise us that when , the performances of the proposed model on the overall Recall@k reach the best. For the effect of metric embedding dimensions, even though Recall@k is better when , MAP@R touches the peak at .

Nevertheless, a limitation of our study is that the proposed method only works if there exists a correlation between the two modalities of image and text. When the two modalities are uncorrelated or less correlated, our approach will be ineffective and even negatively affect the similarity metric of images. Accordingly, in future work, we will explore a new strategy of multicomplicated modalities fusion, such as considering both temporal information and spatial information [60], thereby further improving the robustness and performance of the model. Furthermore, we can also explore the use of semisupervised approaches [61, 62] to address the problem of model training without labeled data.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

This research was funded by the Hunan Provincial Natural Science Foundation of China (Grant nos. 2022JJ30231 and 2022JJ50051), the National Natural Science Foundation of China (Grant no. 62072166), and the Innovation Foundation for Postgraduate of Hunan University of Technology (Grant no. CX2213).