Research Article
Integrating Temporal and Spatial Attention for Video Action Recognition
Table 2
Comparisons with other methods on UCF101 dataset.
| Model | Pretraining dataset | Accuracy (%) | GFLOPs |
| C3D [20] | Sports-1M | 82.3 | 38.57 | TRN [21] | — | 83.5 | 83.83 | Res3D [22] | Sports-1M | 85.8 | — | P3D [23] | Imagenet + Sports-1M | 88.6 | 18.51 | T3D [24] | Kinetics-400 | 90.3 | — | TSN [8] | Imagenet + Kinetics-400 | 91.1 | 80 | R(2 + 1)D [25] | Sports-1M | 93.6 | 41.69 | TSM [26] | Kinetics-400 | 95.5 | 32.88 | I3D RGB [27] | Imagenet + Kinetics-400 | 95.6 | 108 | T-CNN [12] | Kinetics-400 | 95.3 | 15.78 | T-CNN + spatial | Kinetics-400 | 96.7 | 52.3 |
|
|