Abstract

Nowadays, the heterogeneity gap of different modalities is the key problem for cross-modal retrieval. In order to overcome heterogeneity gaps, potential correlations of different modalities need to be mined. At the same time, the semantic information of class labels is used to reduce the semantic gaps between different modalities data and realize the interdependence and interoperability of heterogeneous data. In order to fully exploit the potential correlation of different modalities, we propose a cross-modal retrieval framework based on graph regularization and modality dependence (GRMD). Firstly, considering the potential feature correlation and semantic correlation, different projection matrices are learned for different retrieval tasks, such as image query text (I2T) or text query image (T2I). Secondly, utilizing the internal structure of original feature space constructs an adjacent graph with semantic information constraints which can make different labels of heterogeneous data closer to the corresponding semantic information. The experimental results on three widely used datasets demonstrate the effectiveness of our method.

1. Introduction

With the rapid growth of multimedia information, the representing form of information becomes rich day by day in the era of big data. The ways people obtained information have also evolved to include newspapers, websites, Weibo, and WeChat. The rapid development of mobile network provides a convenient resource platform for people. People can search a lot of information by using search engines of various websites on mobile devices according to their own needs. The structures of modal data which can be used in the mobile network are various, making it difficult to display the information needed in mobile devices accurately. Most of the retrieval methods, such as text [13], image [47], and video [811] retrieval, focus on single-modality retrieval [1215], in which the query sample and retrieve sample must be performed on the same data type. Nowadays, the same thing can be expressed in different ways, and there is a growing demand for diversified forms of information expression. For example, when tourists are sightseeing, they record a wonderful journey by taking photos or recording videos. These photos and videos present the same range of content although they represent different types of media objects. Similarly, information about singers and album images is used to search for the corresponding songs, so as to obtain more information about the songs. People retrieve image data or video data related to its semantic information through text data, but different dimensions and attributes of multimedia data lead to obvious feature heterogeneity between different modalities. So the practical application of large-scale data similarity retrieval needs more effective solutions. To solve this problem, the features of different modal data need to be extracted effectively, and the retrieval method is used effectively to get more accurate information in a large amount of information.

To solve the heterogeneous problem of cross-modal retrieval [1620], subspace learning methods have been proposed. Although different modalities have different original feature spaces, we can project such modalities into a common potential space [21]. Specifically, the most traditional feature learning method called canonical correlation analysis (CCA) [22] maximized the correlation between two couples of different modalities’ features and obtained low-dimensional expressions with high correlations of different modalities in a common potential space. CCA is a simple algorithm for realizing the feature space association. Based on CCA, Hwang et al. proposed kernel canonical correlation analysis (KCCA) [19], which obtains the correlation between image and text through cross-view retrieval in a high-dimensional feature space. The partial least squares (PLS) [23] method measured the similarity between different modalities through visual feature space to text feature space. The potential correlation of the cross-modal data obtained by the above methods through linear projection is limited, and it cannot effectively improve the performance of cross-modal retrieval. The unsupervised cross-media retrieval method only obtains pairwise information of different modalities during the subspace learning process without obtaining accurate information of high-level semantics. Another method called T-V CCA [20] obtained high-level semantics by considering the semantic class view as the third view. The correlation between different modalities is enhanced by learning semantic information. Therefore, the linear regression term is applied to a cross-modal retrieval framework, and the semantic structure is maintained. So the regression error of different modalities data is minimized.

The deep learning method has a strong ability of nonlinear learning. The deep canonical correlation analysis (DCCA) [24] combines DNN and CCA to learn more complex nonlinear transformation between different modalities data. Peng et al. proposed cross-media multiple deep networks (CMDNs) [25], which use hierarchical structures to hierarchically combine independent representations of different modalities. In addition, Wei et al. proposed deep semantic matching (deep-SM) [26] to use the CNN feature for deep semantic matching to improve the retrieve accuracy. The above method makes use of the neural network to measure the similarity of different modal data well but ignores the similarity within single modality and the similarity between the modalities. The complex latent correlations of different modalities data can be well learned by using graph regularization. The application of graph regularization [27, 28] in cross-modal retrieval lies in the construction of the graph model, maintaining the similarity between the projected data through the edges of the graph model. The graph regularization not only enhances semantic relevance but also learns intramodality and intermodality similarity. The cross-modal retrieval models we have mentioned are learned through joint distribution in a common space. On the basis of subspace learning, the correlation between multimodal data is further mined to improve the performance of cross-media retrieval.

In this paper, we propose a cross-modal retrieval framework (Figure 1) based on graph regularization and modality dependence (GRMD). The method measures the distances between different modalities’ projection matrices in the semantic subspace and obtains the similarity of different modalities. The projection matrices of different modalities belonging to the same label should be as similar as possible. In the process of feature mapping, two different projection matrices are mapped into their respective semantic spaces through two linear regressions. Correlation analysis can project original data into a potential subspace, and multimodal data of the same labels can be correlated.

The main advantages of our method can be summarized as follows:(i)The construction of the label graph enhances the consistency of the internal structure of the heterogeneous data feature space and the semantic space. The graph model of different modal data is constructed for different retrieval tasks, which not only maintains the similarity between different modal data after projection but also deepens the correlation between multimodal data and corresponding semantic information.(ii)Heterogeneous data are projected into the semantic space of different modalities in different retrieval tasks. In different cross-modal tasks learning, different transformation matrices are obtained by combining semantic correlation and feature clustering. The mapping of media data of different modalities is achieved from the underlying features to high-level semantics, and the accuracy of subspace learning is improved by using semantic information. This approach not only retains the similarity relationship of multimodal samples but also makes the semantic information more accurately understood in the projection process.(iii)The results of experiments that we carried out on three datasets indicate that the proposed framework is superior to other advanced methods.

We briefly introduce several related methods in this section. Most cross-modal retrieval methods focus on joint modeling of different modalities. Image and text retrieval are the main subjects of cross-modal retrieval research. The representation features of different modalities are not only inconsistent but also located in different feature spaces. By learning potential common subspaces, data of different modalities are mapped to common isomorphic subspaces for retrieval from traditional heterogeneous spaces.

Subspace learning plays an important role in cross-modal problems, and the most traditional unsupervised method is canonical correlation analysis (CCA) [22], which maps heterogeneous data into isomorphic subspaces, maximizing the correlation of the two couples of features. It only uses the information of the multimodal pair, ignoring the importance of labels’ information, and the result of the search is not optimal. Heterogeneous data with the same semantics are interrelated in a common semantic space. After the data have been projected into the isomorphic feature space, the supervised method (SCM) [22], which combines CCA and SM, generates a common semantic space for CCA representation learning by linear regression to improve retrieval performance. In addition to CCA, Sharma et al. proposed a generalized multiview analysis (GMA) [29] for learning a common subspace through a supervised extension of CCA for cross-modal retrieval.

It is limited to improve the retrieval performance by learning the potential relationship between different modalities data. The retrieval method [30] based on deep learning can better combine the feature extraction of samples with the learning of common space, to obtain better retrieval result. Andrew et al. proposed deep canonical correlation analysis (DCCA) [24] nonlinear learning of CCA to learn complex nonlinear transformations of different modalities, through the corresponding constraints of the corresponding subnetworks to make data highly linearly related. Srivastava et al. proposed deep Boltzmann machines (DBMs) [31], which is an algorithm that learns generalization models, and thus enhances the effectiveness of retrieval. In addition, other deep models are used for cross-modal retrieval by exploiting the relevance of enhanced multimedia data. Peng et al. [32] proposed constructing a multipathway network, using coarse-grained instances and fine-grained patches to improve cross-modal correlation and achieve the best performance. The cross-modal retrieval method based on DNN uses DNN to learn the nonlinear relationship of different modalities, and the training data play a key role in the learning process. In [33], Huang et al. proposed the modal-adversarial hybrid transfer network (MHTN), an end-to-end architecture with a modal-sharing knowledge transfer subnetwork, and a modal-adversarial semantic learning subnetwork. It enhances the semantic consistency of the data, making the different modalities aligned with each other. Yu et al. proposed the graph in network (GIN) [34], which learns text representation to get more semantically related words through the graph convolution network. In the learning process, the semantic information is promoted significantly; the data information is extracted effectively; and the retrieval accuracy is improved better.

In addition, different feature representations of different modalities data cause the problem that cross-modal data cannot be effectively established. The uniform sparse representations of different modalities data are obtained through dictionary learning, but accurate semantic relationships cannot be obtained through dictionary learning alone. Semantic differences are reduced by using semantic constraints. Therefore, semantic differences should be reduced through semantic constraint methods. Semantic information is used to project sparse representations of different modalities in the semantic space to perform cross-modal matching for more accurate understanding and retrieval. A dictionary learning algorithm [35, 36] proposed by Xu et al. uses the learning of a coupled dictionary to update the dictionary that optimizes different modalities and obtains the sparse representation corresponding to different modalities data. With the rapidly increasing availability of high-dimensional data, hash learning for cross-modal retrieval has emerged. The hash learning method not only projects high-dimensional data into Hamming space but also preserves the original structure of data features as much as possible. Multiscale correlation sequential cross-modal hashing learning (MCSCH) [37] is a multiscale feature-guided sequential hashing learning method that can mine multiscale correlations among multiscale features of different modalities. In the process of cross-modal hash learning, the correlation of similar data is maximized and the correlation of dissimilar data is minimized.

Complex correlation between different modalities cannot be fully considered, but cross-modal retrieval method [38] based on graph regularization can learn complex potential correlation of different modalities data by building graph models. The graph regularization [39] is used to maintain intrapair and interpair correlations and perform feature selection for different feature spaces. Zhai et al. proposed a joint representation learning algorithm (JGRHML) [27] to consider heterogeneous relationships in a joint graph regularization. The algorithm optimizes the correlation and complementarity of different modalities data and obtains related information between heterogeneous data through nearest neighbors. To improve the JGRHML algorithm, joint representation learning (JRL) [28] proposed by Zhai et al. maintains the structural information between the original data through k-nearest neighbors, and it added the semantic regularization term to integrate the semantic information of the original data. The cross-modal retrieval methods we have mentioned that use adjacent graphs to learn the potential space and maintaining multimodal feature correlation, simultaneously maintaining local relationship, also significantly improve the retrieval performance.

We propose a method based on modality-dependence and graph regularization. In a common semantic subspace, data with the same semantics are similar to each other through potential relationships. Wei et al. proposed a modality-dependent cross-media retrieval method [40]. The method focuses on the retrieval direction and uses the semantic information of the query modality to project the data into the semantic space of the query modality. It considers not only the direct correlation between different modalities but also the low-level features that do not combine well with the nonlinear association. Although this method cannot fully describe the complex correlation between different modalities data, inspired by this method, we can use graph regularization to further analyze the potential correlation of data. Compared with the abovementioned methods, we maintain the correlation between data structure information and semantic information by integrating modal data information into a semantic graph and learning different projection matrices and semantic spaces for different retrieval tasks. Readers can learn more about our methods from the following explanation of how we have achieved good retrieval results.

The paper is organized as follows. Section 2 briefly introduces the relevant methods of cross-modal retrieval. In Section 3, the method we propose is described in detail. Section 4 presents our experimental results and the analysis of a comparison with other methods. Section 5 concludes this paper.

3. Modality-Dependent Cross-Modal Retrieval Based on Graph Regularization

In this section, we first introduce the notation and problem definitions associated with the objective function and then propose the overall cross-modal learning framework for GRMD. Finally, an effective iterative approach is proposed to complete this framework.

3.1. Notation and Problem Definition

Let and denote the feature matrices of image data and text data, respectively. represents a semantic matrix with a number of labels C. The i-th row of the semantic matrix is the semantic vector corresponding to and , ; otherwise, . The image projection matrix and the text projection matrix in I2T are represented by and . The descriptions of important notations frequently used in this paper are listed in Table 1.

3.2. Objective Function

Our goal is to keep the semantic consistency of multimodal data in the process of mapping different patterns of data to a common potential space. In different retrieval tasks, there are three important factors, semantic information, data correlation, and data structure distribution, each of which interacts on the other two. Therefore, semantic subspace is used as a common potential space in this paper. Through the association of potential space and semantic space, semantic information enables samples of the same category to be mapped to nearby locations:where consists of four terms. is a correlation analysis term that keeps samples of the same class close to each other. is a linear regression that maps data of different modalities into the semantic space. is a graph regularization term that uses the modal graph to enhance the intramodal similarity. is a regularization term that preserves the stability of projection matrices.

3.2.1. The First Term

The first term is a correlation analysis term that minimizes the difference between multimodal data in a potential subspace. Different modality data need to remain close to each other in potential subspaces. The representations of the paired heterogeneous data in the common subspace should be as similar as possible, and thus, the distance between the two should be as small as possible:

This term reduces the distance between multimodal data of the same label, thus improving the correlation between them.

3.2.2. The Second Term

The second term is a linear regression, which transforms the feature space of query modality into semantic space. This term only considers the query modality semantic, which is more pertinent and effective than that of considering both the query modality semantics and the retrieval modality semantics. The improvement in the accuracy of the mapping of query modality data can ensure the accuracy of subsequent retrieval. Once the label of the query modality data has been incorrectly predicted, it is difficult to ensure that other related modalities data are retrieved in subsequent steps:

This term focuses on the differences between different retrieval tasks and learns two different projection matrices for different retrieval tasks. It transforms the query modality data from the original feature space into the corresponding semantic space, and similar data are centrally distributed in the semantic subspace.

3.2.3. The Third Term

Here, we preserve the original distribution of different modalities data in the common subspace as much as possible by adding a graph regularization term in the objective function. The neighboring data points are as close as possible to each other in the common subspace. We define an undirected symmetric graph , where is the set of data in X and is the similarity matrix. Element of is defined as follows:where represents k neighbors of that are obtained by calculating the distance between data pairs in the original space and selecting the nearest k neighbors.where L is a symmetric semidefinite matrix, D is a diagonal matrix, and the diagonal elements are .

By constructing a local label graph for each modality through semantic information, the structure of the feature space can be made consistent with that of the label space. In the shift between different modalities, the internal structure of modalities is preserved so that different modalities data in the same label should as near as possible after mapping:

Similarly, we calculate the similarity matrix W, the symmetric matrix D, and the Laplacian matrix L of the text, and the regularization terms of the text are defined as follows:

3.2.4. The Forth Term

The fourth term is the regularization term that controls the complexity of the projection matrix and prevents overfitting. Therefore, the constraints of the term can control the stability of the obtained values. Parameters and balance the regularization term:For I2T:The algorithm we present learns a pair of projection matrices and through the image query text (I2T), and our final objective function is specifically expressed as follows:For T2I:Similarly, the objective function of T2I is expressed as follows:

As expressed by (11), a cross-modal retrieval problem retrieves related image modalities based on the text modality. In contrast to (3), our linear regression term is a text feature space conversion to a semantic text space, rather than a semantic image space in I2T. The image projection matrix and the text projection matrix in T2I are represented by and .

3.3. Iterative Optimization for the Proposed Algorithm

In this section, both (10) and (11) are nonconvex optimization problems, so we design an algorithm to find fixed points. We observe that if another item is fixed, equation (10) is convex to the other item. Similarly, equation (11) is fixed while the other item is fixed, and the other item is also convex. Therefore, by using the gradient descent method, we can achieve the minimization of the other term by fixing one of or .

First, we compute the partial derivative of with respect to and set it to 0:

Similarly, we compute the partial derivative of with respect to and set it to 0:

According to the above formula, the resulting solutions are, respectively, as follows:

Similarly, for T2I, is biased for and , respectively. and are updated iteratively until the results converge:

The main optimization procedure of the method we present for I2T is given in Algorithm 1, and the T2I task is similar to the I2T task.

Input: training image datasets ;
Training text datasets ;
Semantic sets
Balancing parameters λ, α, ,
Output: projection matrices and .
1: calculate the graph Laplacian matrix ;
2: initialize and to be identity matrices;
3: repeat
4: fix and update according to (14);
5: fix and update according to (15);
6: until convergence
7: end for

4. Experiments

The methods we present in this section are tested experimentally on three datasets. We evaluate our proposed method by comparison with other advanced methods.

4.1. Datasets

Three datasets detailed below are chosen for the experiment.

4.1.1. Wikipedia

The Wikipedia dataset [22] consists of 2,866 different image-text pairs belonging to 10 semantic categories selected from 2,700 “feature articles.” This dataset is randomly divided into a training set with 2,173 image-text pairs and a test set with 693 image-text pairs, and these two sets are marked by 10 semantic class words. Image features are represented by 4096-dimensional CNN visual features, while the representation of text features is 100-dimensional LDA text features.

4.1.2. Pascal Sentence

The Pascal sentence dataset [26] consists of 1000 image-text pairs from 20 semantic categories. In each semantic category, there are 50 image-text pairs, 30 of which are selected as training pairs, and the rest are used as test pairs for each class. We represent image features by extracting 4096-dimensional CNN visual features and represent text features by 100-dimensional LDA text features.

4.1.3. INRIA-Websearch

The INRIA-Websearch dataset [41] has 71478 image-text pairs from 353 semantic categories, formed with 14698 image-text pairs built by selecting the largest 100 categories. This dataset is randomly divided into 70% of pairs used as a training set and 30% used as a test set. Each image and text are represented by a 4096-dimensional CNN visual feature and a 1000-dimensional LDA feature, respectively.

4.2. Experimental Settings

We assume that the Euclidean distance is used to compute the similarity of data features when multimedia data are projected into a common subspace. In this part, to evaluate the results of cross-modal retrieval, we consider the widely used mean average precision (MAP) [22] scores and precision recall (PR) curves. Specifically, the average precision (AP) of each query is obtained, and their average values are calculated to obtain a MAP score:where n is the size of the test set and R is the number of related items. Condition means that the item with level k is relevant. Otherwise, ; is the number of related items in the top k returns. To evaluate the performance of the proposed GRMD retrieval method, we compare GRMD with the canonical correlation analysis (CCA) [22], kernel canonical correlation analysis (KCCA) [19], semantic matching (SM) [22], semantic correlation matching (SCM) [22], three-view canonical correlation analysis (T-V CCA) [42], generalized multiview linear discriminant analysis (GMLDA) [29], generalized multiview canonical correlation analysis (GMMFA) [29], modality-dependent cross-media retrieval (MDCR) [40], joint feature selection and subspace learning (JFSSL) [43], joint latent subspace learning and regression (JLSLR) [44], generalized semisupervised structured subspace learning (GSSSL) [45], a cross-media retrieval algorithm based on the consistency of collaborative representation (CRCMR) [46], cross-media retrieval based on linear discriminant analysis (CRLDA) [47], and cross-modal online low-rank similarity (CMOLRS) function learning method [48]. The descriptions and characteristics of the above comparison methods used in the whole experiment are summarized in Table 2.

4.3. Experimental Results

The experiment is a cross-media retrieval of two subtasks: I2T and T2I. The traditional distance metrics are used to measure the similarity of different modalities’ objects. The experiment was carried out on three datasets. Tables 35 show the experimental results of different datasets. Later, we will study the effects of different parameter settings on the performance of GRMD.

In the experiment on the Wikipedia dataset, we set various parameters as follows: for I2T, , , , and ; for T2I, , , , and . MAP scores we obtained on I2T and T2I tasks are shown in Table 3. Figures 2(a) and 2(b) show the MAP scores on the Wikipedia dataset for different retrieval tasks, and Figure 2(c) shows the MAP scores for different labels as an indication of average performance. Figures 3(a) and 3(b) show the precision-recall curves for two retrieval tasks, I2T and T2I. The results show that CCA and KCCA do not use semantic information, and its retrieval performance is poor. SM only consider the semantic information and does not consider the related data. Our approach combines data correlation and semantic information to learn heterogeneous data problems so that good retrieval performance can be achieved.

In the experiment on the Pascal Sentence dataset, we set various parameters as follows: for I2T, , , , and ; for T2I, , ,, and . The MAP scores that we obtained on I2T tasks and T2I tasks are shown in Table 4. Figures 2(d) and 2(e) show the MAP scores on the Pascal sentence dataset for different retrieval tasks, and Figure 2(f) shows the MAP scores for different labels as an indication of average performance. Figures 3(c) and 3(d) show the precision-recall curves for two retrieval tasks, I2T and T2I. It can be concluded from the experimental results of SCM, T-V CCA, GMLDA, GMMFA, CMOLRS, and MDCR that although they all consider data correlation and semantic information, the MAP scores of MDCR are higher because it learns different semantic subspaces for different retrieval tasks. These methods do not fully understand the complex correlation of heterogeneous data. Therefore, our method is projected not only in different semantic subspaces but also the similarity between heterogeneous data projected can be well maintained by constructing adjacent graphs. The results show that our approach is necessary for considering different retrieval tasks and maintaining the similarity of heterogeneous data.

In the experiment on the INRIA-Websearch dataset, we set various parameters as follows: for I2T, , , , and ; for T2I, , , , and . The MAP scores that we obtained on I2T tasks and T2I tasks are shown in Table 5. After the semantic category is increased, the retrieval performance of our method is still very good. CRLDA only considers the discriminability of text features. JFSSL, JLSLR, and GSSSL validate the validity of adjacent graphs by considering the complex similarity of heterogeneous data. Our method not only considered the semantic information when constructing the adjacent graphs but also constructed the corresponding semantic graphs for different query objects. We observe that the MAP scores on T2I tasks of JFSSL is higher than that of our method. This result may be due to feature selection for heterogeneous data. Figures 3(e) and 3(f) show the precision-recall curves for two retrieval tasks I2T and T2I. A comparison with other methods shows that our method has a certain stability and performs well on retrieval tasks.

All the tables and figures below show our experimental results. We introduce two aspects of effectiveness of our method. On the one hand, the relationship between the image texts is taken into account, and only the semantics of the query object are considered. On the other hand, the semantic correlation improves retrieval precision by utilizing the local correlation of the feature map. Additionally, semantic constraints make better use of the local correlation of the feature graph and thus improve the retrieval accuracy.

4.4. Parameter Sensitivity

In this subsection, we evaluate the robustness of our approach. Our approach consists of four parameters: λ and α are balance parameters, while and are regularization parameters. In the experiment, it is observed that, with the variation in parameter λ, the retrieval performance of different retrieval tasks is stable within a wide range. Considering the results on the Pascal sentence dataset as an example, we set parameters α, , and to different values during different retrieval tasks to test the sensitivity to parameter values. We tune three parameters, considering values of . In the experiment, one parameter is fixed to observe the performance variations with other two parameters. Figures 4(a), 4(c), and 4(e) show the performance variations for I2T, and Figures 4(b), 4(d), and 4(f) show the performance variations for T2I. The figures show that our method is insensitive to these three parameters, and its performance is relatively stable.

4.5. Convergence Experiment

In this subsection, we propose an iterative optimization approach for the objective function. It is important to test its convergence during iterations. Figures 5(a) and 5(b) show convergence curves for the Pascal sentence dataset for I2T and T2I, respectively. The corresponding MAP scores tend to be stable as the number of iterations increases. The proposed approach can achieve nearly stable values within approximately seven iterations. Therefore, our approach can converge effectively and offers a stable performance.

4.6. Ablation Experiment

In Table 6, method “A” removes the graph regularization term in our approach. It means that the method uses only correlation analysis and linear regression for the features of image data and text data. The samples of different modalities are mapped to a common semantic subspace so that the multimodal data with the same label can be aggregated. Method “B” removes the correlation analysis term in our approach. It means that the paired data without a sufficient consideration of the same label should be close in a potential space. This method maintains the internal structure information of heterogeneous features.

The experimental results show the effectiveness of our method. First, to determine the corresponding projection, the data of different modalities are correlated by using the correlation between such modalities. Second, the construction of label graphs can preserve the internal structural information of the original data very well. The heterogeneous features of multimodal data are projected into a common subspace, and the multimodal data of the same label are aggregated.

5. Conclusions

In this paper, we propose a cross-modal retrieval method based on graph regularization (GRMD). This method combines the internal structure of feature space and semantic space to construct label graphs of heterogeneous data, which makes the features of different modalities closer to real labels, thus enriching the semantic information of similar data features. In addition, our method learns different projection matrices for different query tasks and also takes into account the feature correlation and semantic correlation between isomorphic and heterogeneous data features. The experimental results show that GRMD performs better than other advanced methods for cross-modal retrieval tasks. In the future, we devote to focus on the local and global structure of heterogeneous data feature distribution and to improve the retrieval framework continuously.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was partially supported by the National Natural Science Foundation of China (grant nos. 61772322, 61572298, 61702310, and 61873151).