Research Article

Hierarchical Attention-Based Multimodal Fusion Network for Video Emotion Recognition

Table 5

Accuracy of emotion recognition of different modals on Ekman and VideoEmotion-8 datasets.

EkmanVideoEmotion-8
Convolution layersEvent (%)Object (%)Scene (%)Event (%)Object (%)Scene (%)

No attention42.4536.4340.9548.1046.4546.33
L144.1441.4244.4151.3449.8849.14
L245.7841.1444.6953.1849.6349.39
L345.2340.3343.6052.8148.9049.02