Research Article
Two-Level Multimodal Fusion for Sentiment Analysis in Public Security
Figure 1
Overall architecture of TlMF. The first stage is data preparation, which turns the raw data into a unimodal sequence for text, audio, and video modalities. Once the unimodal sequence is obtained, the unimodal features can be learned by the second stage, which can extract features from each modality. Then, the tensor fusion layer is used to fuse the text-based audio feature  and the text-based video feature . Finally, a decision fusion layer is employed to improve the accuracy of classification and prediction in the sentiment analysis task.