Based on the encoder-decoder framework, it aims to guide the model to generate a more descriptive sentence for a given image by introducing reference information.
The sentences it generates sound more natural.
Sentence expressions are on the rigid side and performance is weak.
In order to extract the keywords from the original documents, it proposes a keyword extraction algorithm based on a probabilistic neural network and visual attention mechanism.
It has a strong ability to extract context-rich information about keywords.
The algorithm is time-consuming, and the algorithm complexity needs to be optimized.
Proposed
Perceptual text information is generated using visual information and text information. A generic machine translation model is implemented by controlling the proportion of visual information in the overall multimodal information.
The multimodal text information is fully utilized and the accuracy of identifying semantic information of new words is higher.
The dataset it uses suffers from a small size, and the available data needs to be expanded to enhance the expressive power of the model.