Research Article
Hierarchical Attention-Based Multimodal Fusion Network for Video Emotion Recognition
Table 5
Accuracy of emotion recognition of different modals on Ekman and VideoEmotion-8 datasets.
| Ekman | VideoEmotion-8 | Convolution layers | Event (%) | Object (%) | Scene (%) | Event (%) | Object (%) | Scene (%) |
| No attention | 42.45 | 36.43 | 40.95 | 48.10 | 46.45 | 46.33 | L1 | 44.14 | 41.42 | 44.41 | 51.34 | 49.88 | 49.14 | L2 | 45.78 | 41.14 | 44.69 | 53.18 | 49.63 | 49.39 | L3 | 45.23 | 40.33 | 43.60 | 52.81 | 48.90 | 49.02 |
|
|