Research Article
Semisupervised Deep Features of Time-Frequency Maps for Multimodal Emotion Recognition
Table 4
Structure of Inception-v3.
| Types | Patch size/stride (or remarks) | Input size |
| Convolution | 3 × 3/2 | 299 × 299 × 3 | Convolution | 3 × 3/1 | 149 × 149 × 32 | Convolution padded | 3 × 3/1 | 147 × 147 × 32 | Maximum pooling | 3 × 3/2 | 147 × 147 × 64 | Convolution | 3 × 3/1 | 73 × 73 × 64 | Convolution | 3 × 3/2 | 71 × 71 × 80 | Convolution | 3 × 3/1 | 35 × 35 × 192 | 3 × inception | As in Figure 3(a) | 35 × 35 × 288 | 5 × inception | As in Figure 3(b) | 17 × 17 × 768 | 2 × inception | As in Figure 3(c) | 8 × 8 × 1280 | Maximum pooling | 8 × 8 | 8 × 8 × 2048 | Linear | Logits (unnormalized log-probabilities) | 8 × 8 × 2048 | Softmax | Classifier | 8 × 8 × |
|
|