Abstract

Recently, benefitting from the storage and retrieval efficiency of hashing and the powerful discriminative feature extraction capability of deep neural networks, deep cross-modal hashing retrieval has drawn more and more attention. To preserve the semantic similarities of cross-modal instances during the hash mapping procedure, most existing deep cross-modal hashing methods usually learn deep hashing networks with a pairwise loss or a triplet loss. However, these methods may not fully explore the similarity relation across modalities. To solve this problem, in this paper, we introduce a quadruplet loss into deep cross-modal hashing and propose a quadruplet-based deep cross-modal hashing (termed QDCMH) method. Extensive experiments on two benchmark cross-modal retrieval datasets show that our proposed method achieves state-of-the-art performance and demonstrate the efficiency of the quadruplet loss in cross-modal hashing.

1. Introduction

With the advent of the era of big data, there are surging massive multimedia data on the Internet, such as images, videos, and texts. These data usually exist in diversified modalities, for example, there may exist a textual data and an audio data describing a video data or an image data. As data from different modalities may have compact semantic relevance, cross-modal retrieval [1, 2] is proposed to retrieve semantic similar data from one modality while the querying data is from a distinct modality. Benefitting from the high efficiency and low cost, hashing-based cross-modal retrieval (cross-modal hashing) [36] has drew extensive attention. The goal of cross-modal hashing is to map the modal heterogeneous data into a common binary space and ensure that semantic similar/dissimilar cross-modal data have similar/dissimilar hash codes. Cross-modal hashing methods can usually achieve superior performance; nonetheless, most of existing cross-modal hashing methods (such as cross-modal similarity sensitive hashing (CMSSH) [7], semantic correlation maximization (SCM) [8], semantics-preserving hashing (SePH) [9], and generalized semantic preserving hashing (GSPH) [10]) are based on handcrafted feature learning, which cannot effectively capture the heterogeneous relevance between different modalities and thus may result in inferior performance.

In the last decade, deep convolutional neural networks [11, 12] have been successfully utilized in many computer vision tasks, and therefore, some researchers also deploy it in cross-modal hashing, such as deep cross-modal hashing (DCMH) [13], pairwise relationship guided deep hashing (PRDH) [14], self-supervised adversarial hashing (SSAH) [15], and triplet-based deep hashing (TDH) [16]. Cross-modal hashing methods with deep neural networks efficiently integrate the hash representation learning and the hash function learning into an end-to-end framework, which can capture heterogeneous cross-modal relevance more effectively and thus acquire better cross-modal retrieval performance.

To date, most deep cross-modal hashing methods utilize the pairwise loss (such as [1315]) or the triplet loss (such as [16]) to preserve semantic relevance during the hash representation learning procedure. Nevertheless, the pairwise loss- and triplet loss-based hash methods suffer from a weak generalization capacity from the training set to the testing set [17, 18], as shown in Figure 1(a). On the contrary, quadruplet loss is proposed and has been utilized in image hashing retrieval [17] and person reidentification [18], and in these works, it has been proved that the quadruplet loss-based model can enhance the generalization capability. Therefore, cross-modal hashing combines quadruplet loss as a natural solution to enhance the performance of cross-modal hashing, as shown in Figure 1(b).

To this end, in this paper, we introduce quadruplet loss into cross-modal hashing and propose a quadruplet-based deep cross-modal hashing method (QDCMH). Specifically, QDCMH firstly defines a quadruplet-based cross-modal semantic preserving module. Afterwards, QDCMH integrates this module, hash representation learning, and hash code generation into an end-to-end framework. Finally, experiments on two benchmark cross-modal retrieval datasets are conducted to validate the performance of the proposed method. The main contributions of our proposed QDCMH include the following:(i)We introduce quadruplet loss into cross-modal retrieval and propose a novel deep cross-modal hashing method. To the best of our knowledge, this is the first work to introduce quadruplet loss into cross-modal hashing retrieval.(ii)We conduct extensive experiments on benchmark cross-modal retrieval datasets to investigate the performance of our proposed QDCMH.

The remainder of this paper is organized as follows. Section 2 elaborates our proposed quadruplet-based deep cross-modal hashing method. Section 3 presents the learning algorithm of QDCMH. Section 4 is the experimental results and the corresponding analysis. Section 5 concludes our work.

2. Proposed Method

In this section, we elaborate our proposed quadruplet-based deep cross-modal hashing (QDCMH) method with the following sections: notations, quadruplet-based cross-modal semantic preserving module, feature learning networks, and hash function learning. Figure 2 presents the flowchart of our proposed QDCMH, which cooperates quadruplet-based cross-modal semantic preserving module, hash representation learning, and hash codes generation into an end-to-end framework. In our proposed QDCMH method, we assume that each instance has two modalities, i.e., an image modality and a text modality, but they can be easily applied to multimodalities.

2.1. Notations

Assume that the training data comprises image-text pairs, i.e., the original image features and the original text features . Besides, there is a label vector associated with each image-text pair and label vectors for all training instances constitute a label matrix . and are the corresponding original dimensions of image features and text features, respectively, and is the total number of class categories. If image-text pair attaches to the th category, then , otherwise . The quadruplet denotes that is a query instance from the image modality, and are three retrieval instances from the text modality, where and have at least one common categories, while and , and , and and are three pairwise instances and the two instances in each pairwise have no common label.

With the known quadruplet , the target of our proposed QDCMH is to learn the corresponding hash codes , where are the hash codes of instances , respectively. To learn the above hash codes, we first learn the hash representations from the quadruplet with deep neural networks, where and are the hash representations of instance and , respectively. and are the hash representation learning functions for the image modality and the text modality, respectively. and are the parameters of deep neural networks to extract features for the image modality and for the text modality, respectively. Secondly, we can utilize the following sign function to approximately map the hash representations into the corresponding hash codes, i.e., and . In the same way, we can learn the hash codes of quadruplet . For convenience, we denote the hash codes of all training image-text pairs, the hash representations of all training image instances, and the hash representations of all training text instances as , , and , respectively, where is the length of hash codes:

2.2. Quadruplet-Based Cross-Modal Semantic Preserving Module

In cross-modal hashing retrieval, given an image instance and a text instance , it is intractable to preserve the semantic relativity during the hash code learning procedure as the huge semantic gap across modalities. To solve this, DCMH [13] defines pairwise loss to map similar/dissimilar image-text pairs into similar/dissimilar hash codes. TDH [16] utilizes triplet loss to learn similar hash codes for similar cross-modal instances and generate distinct hash codes for semantic irrelevant cross-modal instances. Both pairwise loss and triplet loss can preserve the relevance in the original instance space; however, pairwise loss- and triplet loss-based hashing methods often suffer from a weaker generalization capability from the training set to the testing set [17, 18]. To solve this problem, in this section, a quadruplet-based cross-modal semantic preserving module is proposed to boost the generalization capability and better preserve the semantic relevance for cross-modal hashing.

For a quadruplet , we should keep the semantic relevance unchanged during the hash representation learning, i.e., should be similar to , should be distinct to and , and should be dissimilar with . Thus, we can define the following quadruplet loss for cross-modal hashing:where is a query instance from the image modality, , , and are three retrieval instances from the text modality, and and are semantic similar. While and , and , and and are three pairwise instances, and the two instances in each pairwise have distinct semantics. Equation (2) denotes that the distance of hash representations of similar cross-modal pairwise instances should be smaller than that of dissimilar pairwise instances (both from intermodalities and from intramodalities) with a positive margin . This can ensure that similar cross-modal instances have similar hash representations while dissimilar instances have distinct hash representations. By this quadruplet loss, the cross-modal semantic relevance can be preserved during the hash representation learning stage.

Similarly, given a quadruplet , we can have the following cross-modal quadruplet loss:where is a query instance from the text modality, , , and are three retrieval instances from the image modality, , , , and are hash representations for instances , , , and , respectively, and and are two positive margins. Equation (3) is distinct to equation (2) as the modality of query instance and the modality of retrieval instances are inverse.

2.3. Hash Representation Learning and Hash Code Learning

For each quadruplet from training set, it is easy to learn their hash representations and fully protect the semantic similarity with the above quadruplet-based cross-modal semantic relevance preserving module, so we have the following hash representation learning loss:where is the number of quadruplets for utilizing image to retrieve text, is the number of quadruplets for utilizing text to retrieve images, and is a hyperparameter to balance the two parts.

Additionally, to learn high-quality hash codes, we generate hash codes from the learned hash representations with the sign function in equation (1), and the final hash codes matrix for all training image-text pairs are generated as follows:

As and are real-valued features, to decrease the information loss from and to in equation (5), it is necessary to force and to be as close as possible to ; thus, we introduce the following quantization loss:

Integrating the hash representation loss and the quantization loss together, the whole loss function is as follows:where is a hyperparameter to balance the hash representation loss and the quantization loss.

2.4. Feature Extraction Networks

In QDCMH, feature extraction includes two deep neural networks: a classical convolutional neural network is used to extract the features of images and a multiscale fusion model is utilized to learn features from texts. Specifically, for image modality, we deploy AlexNet [11] pretrained on the ImageNet [19] dataset. We then fine-tune the last layer using a new fully connected hash layer which consists of hidden nodes. Therefore, the learned deep features have been embedded into a -dimensional Hamming space. For text modality, the TxtNet in SSAH [15] is used, which comprises a three-layer feedforward neural network and a multiscale (MS) fusion model .

3. Learning Algorithm of QDCMH

For QDCMH, we utilize alternating strategy to learn parameters of deep neural networks for image modality and parameters of deep neural networks for text modality and hash codes matrix for all training image-text pairs. When we learn one of , , and , we keep the other two fixed. The specific algorithm for QDCMH is depicted in Algorithm 1.

Input:
   training data set: . The maximal number of epoches of the algorithm is . Mini-batch size .
Output:
  Parameters , of the deep neural networks, and corresponding hash codes matrix .
(1) Generating quadruplets (named ) from training set, generating quadruplets (named ) from training set.
(2) Initialize the deep neural network parameters , , the whole training image hash representations , the whole training text hash representations , the hash codes matrix , and the epoch numbers .
(3) repeat
(4)  fordo
(5)   Randomly sample images from to construct a mini-batch of images.
(6)   For each instance in the mini-batch, calculate by forward propagation.
(7)   Update .
(8)   Calculate the derivative of in equation (7).
(9)   Update the network parameters by utilizing backpropagation.
(10)  end for
(11)  fordo
(12)   Randomly sample texts from to construct a mini-batch of texts.
(13)   For each instance in the mini-batch, calculate by forward propagation.
(14)   Update .
(15)   Calculate the derivative of in equation (7).
(16)   Update the network parameters by using backpropagation.
(17)  end for
(18)  Update using equation (5).
(19) until the max epoch number .
3.1. Update with and Fixed

When and are maintained fixed, we utilize stochastic gradient descent and backpropagation to optimize the deep neural network parameters .

3.2. Update with and Fixed

When we fix the values of and , we use stochastic gradient descent and backpropagation to learn the deep neural network parameters .

3.3. Update with and Fixed

When the deep neural networks’ parameters and are kept unchanged, the hash codes matrix can be optimized with equation (5).

4. Experiments

4.1. Datasets

To investigate the performance of QDCMH, we conduct experiments on two benchmark cross-modal retrieval datasets: MIRFLICKR-25K [20] and Microsoft COCO2014 [21], and the brief descriptions of the datasets are listed in Table 1.

4.2. Evaluation Metrics

In our experiments, we utilize mean average precision (MAP), top -precision curves (top Curves), and precision-recall curves (PR Curves) as evaluation metrics; for the detailed description of these evaluation metrics, refer to [22, 23].

4.3. Baselines and Implementation Details

We compare our proposed QDCMH method with eight state-of-the-art cross-modal hashing methods, including four handcrafted ones, i.e., cross-modal similarity sensitive hashing (CMSSH) method [7], semantics-preserving hashing (SePH) [9] method, semantic correlation maximization (SCM) method [8], and generalized semantic preserving hashing (GSPH) method [10] and four deep feature-based ones, i.e., deep cross-modal hashing (DCMH) method [13], pairwise relationship guided deep hashing (PRDH) method [14], self-supervised adversarial hashing (SSAH) method [15], and triplet-based deep hashing (TDH) method [16]. Most baseline methods are carefully implemented based on the codes provided by the authors. A few baseline methods are implemented by us following the suggestions and descriptions of the original papers.

All the experiments are executed by using the open source deep learning framework pytorch and running on an NVIDIA GTX Titan XP GPU server. In our experiments, we set , , and and the learning rate is initialized to and gradually decreased to in 500 epochs. For those handcrafted feature-based baselines, each image in the two datasets is represented by a bag of words (BoW) histogram or feature vector having 512 dimensions. For the whole experiment, we use to denote using a querying image while returning text and to denote using a querying text while returning an image.

4.4. Performance Evaluation and Discussion

Firstly, we investigate the performance of QDCMH with different hyperparameters and . To this goal, we experiment on MIRFLICKR-25K with the hash code length and record the corresponding MAPs under different values of and , as shown in Figure 3. We find that high performance can be acquired when and .

Secondly, to validate the performance of QDCMH, we perform the experiment to compare QDCMH with baseline methods in terms of MAP on datasets MIRFLICKR-25K and MS-COCO2014. Table 2 presents the MAPs of each method for different hash code lengths, i.e., 16, 32, and 64. DSePH represents the SePH method whose features of the original images are extracted by CNN–F. From Table 2, we can see that the following. (1) The MAPs of our proposed QDCMH are higher than the MAPs of most baseline methods in most cases, which demonstrates the superiority of QDCMH. We can also observe that SSAH outperforms than our proposed QDCMH in most cases, which is partly because SSAH takes self-supervised learning and generative adversarial networks into account during hash representation learning procedure. (2) The MAPs of QDCMH is always higher than the MAPs of TDH, which shows that quadruplet loss can better preserve semantic relevance than triplet loss in cross-modal hashing retrieval. (3) The MAPs of DSePH is always higher than the MAPs of SePH, which demonstrates that deep neural networks have powerful features learning capacity. (4) Our proposed QDCMH can achieve better performance on MS-COCO 2014 dataset than on MIRFlickr-25K dataset, which is partly because the instances in MS-COCO 2014 dataset belong to 80 categories while the instances in MIRFlickr-25K dataset belong to 24 categories, and this makes the quadruplets generated from the MS-COCO 2014 dataset have better generalization ability than the quadruplets generated from the MIRFlickr-25K dataset.

Thirdly, to further investigate the performance of QDCMH, we plot the precision-recall curves and top -precision curves of QDCMH and baseline methods with hash code lengths 64 on datasets MIRFLICKR-25K, Microsoft COCO2014, respectively, as presented in Figures 4 and 5. From this figure, we can see that the precision-recall curves and top -precision curves are nearly consistent with the MAPs in Table 2.

5. Conclusions

In this paper, we introduce a quadruplet loss into deep cross-modal hashing to fully preserve semantic relevance of original cross-modal quadruple instances and propose a quadruplet based deep cross-modal hashing method (QDCMH). QDCMH integrates quadruplet-based cross-modal semantic relevance preserving module, hash representation learning, and hash code generation into an end-to-end framework. Experiments on two benchmark cross-modal retrieval datasets demonstrate the efficiency of our proposed QDCMH.

Data Availability

The experimental datasets and the related settings can be found in https://github.com/SWU-CS-MediaLab/MLSPH. The experimental codes used to support the findings of this study will been deposited in the github repository after the publication of this paper or can be provided by xitaozou@sanxiau.edu.cn.

Conflicts of Interest

The authors declare that they have no conflicts of interest.