|
Checklist issues references | SFI: original representations of multimodals are mapped into the same feature space, respectively, to generate single features for multimodals | JFI: joint features are generated for multimodals from their original representations |
|
Rasiwasia et al. [2] | (1) Text and images are mapped into a correlative space by using canonical correlation analysis. | — |
(2) Text and images are mapped into a semantic space by constructing vectors of posterior probabilities between text (images) and document class labels. |
|
Ngiam et al. [6] | Original representations of audios and videos are input to a bimodal autoencoder simultaneously to generate single deep features for audios and videos. | — |
|
Srivastava and Salakhutdinov [7] | — | The joint features for text and images are learned by using a deep Boltzmann machine. |
|
Jia et al. [8] | — | The joint features for two modals of images are generated by utilizing double broad learning and canonical correlation analysis. |
|
Xiong et al. [11] | Original representations of images and audios are mapped into a semantic topic space by using probabilistic latent Semantic Analysis (PLSA). | — |
|
Tang et al. [12] | — | Original representation of signals and images are concatenated directly to generate joint features for support vector machine. |
|
Lin et al. [13] | Networks based on attention mechanism are used to generate single features for text and images. | Single features of text and images are fused by using tensor fusion method to generate joint features. |
|
Our method | (1) Single features of images are the image class labels generated by using convolutional neural network. | (1) Single features of images and text are fused to generate a fusion matrix with elements of “image class label - keyword” pairs. |
(2) Single features of text are keyword vectors generated by using keyword extraction method. | (2) The fusion matrix is transformed to the word embedding matrix by using Word2Vec. |
|