Research Article

Integrating Temporal and Spatial Attention for Video Action Recognition

Table 1

Network structure of T-CNN.

Input: 3 × 16 × 224 × 224

Stage 1Conv 3–32 + BN + ReLU
Conv 32–32 + BN + ReLU

Stage 2Conv 32–64 + BN + ReLU
Conv 64–64 + BN + ReLU

Stage 3Conv 64–96 + BN + ReLU
Conv 96–96 + BN + ReLU

Stage 4Conv 96–160 + BN + ReLU
Conv 160–160 + BN + ReLU
Conv 160–160 + BN + ReLU
Dilation Conv(4) 160–160 + BN + ReLU
Conv 160–160 + BN + ReLU
Dilation Conv(4) 160–160 + BN + ReLU
Conv 160–160 + BN + ReLU
Dilation Conv(4) 160–160 + BN + ReLU
Conv 160–160 + BN + ReLU
Conv 160–160 + BN + ReLU
Conv 160–160 + BN + ReLU
Conv 160–160 + BN + ReLU
Conv 160–160 + BN + ReLU
Dilation Conv(4) 160–160 + BN + ReLU

Stage 5Conv 160–224 + BN + ReLU
Conv 224–224 + BN + ReLU
Conv 224–224 + BN + ReLU
Dilation Conv(2) 224–224 + BN + ReLU
Conv 224–224 + BN + ReLU
Dilation Conv(4) 224–224 + BN + ReLU
Conv 224–224 + BN + ReLU
Dilation Conv(4) 224–224 + BN + ReLU

Stage 6Conv 224–288 + BN + ReLU
Conv 288–288 + BN + ReLU
Conv 288–288 + BN + ReLU
Dilation Conv(2) 288–288 + BN + ReLU
Conv 288–288 + BN + ReLU
Dilation Conv(2) 288–288 + BN + ReLU
Conv 288–288 + BN + ReLU
Dilation Conv(2) 288–288 + BN + ReLU
Conv 288–288 + BN + ReLU
Conv 288–288 + BN + ReLU

Stage 7Conv 288–512 + BN + ReLU
Conv 512–512 + BN + ReLU
Conv 512–512 + BN + ReLU
Dilation Conv(2) 512–512 + BN + ReLU
Global average pooling
Fully connected layer
Softmax
Classification result