Research Article
Integrating Temporal and Spatial Attention for Video Action Recognition
Table 1
Network structure of T-CNN.
| Input: 3 × 16 × 224 × 224 |
| Stage 1 | Conv 3–32 + BN + ReLU | Conv 32–32 + BN + ReLU |
| Stage 2 | Conv 32–64 + BN + ReLU | Conv 64–64 + BN + ReLU |
| Stage 3 | Conv 64–96 + BN + ReLU | Conv 96–96 + BN + ReLU |
| Stage 4 | Conv 96–160 + BN + ReLU | Conv 160–160 + BN + ReLU | Conv 160–160 + BN + ReLU | Dilation Conv(4) 160–160 + BN + ReLU | Conv 160–160 + BN + ReLU | Dilation Conv(4) 160–160 + BN + ReLU | Conv 160–160 + BN + ReLU | Dilation Conv(4) 160–160 + BN + ReLU | Conv 160–160 + BN + ReLU | Conv 160–160 + BN + ReLU | Conv 160–160 + BN + ReLU | Conv 160–160 + BN + ReLU | Conv 160–160 + BN + ReLU | Dilation Conv(4) 160–160 + BN + ReLU |
| Stage 5 | Conv 160–224 + BN + ReLU | Conv 224–224 + BN + ReLU | Conv 224–224 + BN + ReLU | Dilation Conv(2) 224–224 + BN + ReLU | Conv 224–224 + BN + ReLU | Dilation Conv(4) 224–224 + BN + ReLU | Conv 224–224 + BN + ReLU | Dilation Conv(4) 224–224 + BN + ReLU |
| Stage 6 | Conv 224–288 + BN + ReLU | Conv 288–288 + BN + ReLU | Conv 288–288 + BN + ReLU | Dilation Conv(2) 288–288 + BN + ReLU | Conv 288–288 + BN + ReLU | Dilation Conv(2) 288–288 + BN + ReLU | Conv 288–288 + BN + ReLU | Dilation Conv(2) 288–288 + BN + ReLU | Conv 288–288 + BN + ReLU | Conv 288–288 + BN + ReLU |
| Stage 7 | Conv 288–512 + BN + ReLU | Conv 512–512 + BN + ReLU | Conv 512–512 + BN + ReLU | Dilation Conv(2) 512–512 + BN + ReLU | Global average pooling | Fully connected layer | Softmax | Classification result |
|
|