Research Article

Integrating Temporal and Spatial Attention for Video Action Recognition

Table 2

Comparisons with other methods on UCF101 dataset.

ModelPretraining datasetAccuracy (%)GFLOPs

C3D [20]Sports-1M82.338.57
TRN [21]83.583.83
Res3D [22]Sports-1M85.8
P3D [23]Imagenet + Sports-1M88.618.51
T3D [24]Kinetics-40090.3
TSN [8]Imagenet + Kinetics-40091.180
R(2 + 1)D [25]Sports-1M93.641.69
TSM [26]Kinetics-40095.532.88
I3D RGB [27]Imagenet + Kinetics-40095.6108
T-CNN [12]Kinetics-40095.315.78
T-CNN + spatialKinetics-40096.752.3