Research Article
RDMMFET: Representation of Dense Multimodality Fusion Encoder Based on Transformer
Figure 1
The overall structure of the RDMMFET model for learning visual and linguistic multimodality representations. The RDMMFET model consists of three parts: problem and image representation (a), encoder (b), and output representation (c).