Mobile Information Systems

Research Article

RDMMFET: Representation of Dense Multimodality Fusion Encoder Based on Transformer

Figure 1

The overall structure of the RDMMFET model for learning visual and linguistic multimodality representations. The RDMMFET model consists of three parts: problem and image representation (a), encoder (b), and output representation (c).