Research Article

Research on Video Captioning Based on Multifeature Fusion

Figure 2

The architecture of the MM-V2T (multimodal video content text generation model). Specifically, the MM-V2T is composed of three parts as follows: video preprocessing, single-modal feature extraction, coding (single-modal information embedding, multimodal information fusion), and decoding.