Research Article
Gated Object-Attribute Matching Network for Detailed Image Caption
Table 1
The performance of the state-of-the-art image captioning models on the Flickr30k and MSCOCO testing splits.
| Model | Flickr30k | MSCOCO | B@1 | B@2 | B@3 | B@4 | METEOR | CIDEr | B@1 | B@2 | B@3 | B@4 | METEOR | CIDEr |
| DeepV-SAlign [7] | 0.573 | 0.369 | 0.240 | 0.157 | 0.153 | 0.247 | 0.625 | 0.450 | 0.321 | 0.230 | 0.195 | 0.660 | Soft-Attention [26] | 0.667 | 0.434 | 0.288 | 0.191 | 0.185 | — | 0.707 | 0.492 | 0.344 | 0.243 | 0.239 | — | Hard-Attention [26] | 0.669 | 0.439 | 0.296 | 0.199 | 0.185 | — | 0.718 | 0.504 | 0.357 | 0.250 | 0.230 | — | Attribute-FCN [15] | 0.647 | 0.460 | 0.324 | 0.230 | 0.189 | — | 0.709 | 0.537 | 0.402 | 0.304 | 0.243 | — | Adaptive-Attention [11] | 0.677 | 0.494 | 0.354 | 0.251 | 0.204 | 0.531 | 0.742 | 0.580 | 0.439 | 0.332 | 0.266 | 1.085 | Attribute-CNN + LSTM [14] | — | — | — | — | — | — | — | 0.56 | 0.42 | 0.31 | 0.26 | 0.94 | NBT [13] | 0.720 | — | — | 0.285 | 0.231 | 0.575 | 0.759 | — | — | 0.349 | 0.274 | 1.089 | Up-Down [12] | — | — | — | — | — | — | 0.802 | 0.641 | 0.491 | 0.369 | 0.276 | 1.179 | Ours (Box proposed) | 0.711 | 0.507 | 0.393 | 0.266 | 0.211 | 0.630 | 0.753 | 0.592 | 0.462 | 0.341 | 0.266 | 0.954 | Ours (RL) | 0.735 | 0.522 | 0.401 | 0.297 | 0.219 | 0.674 | 0.772 | 0.620 | 0.476 | 0.352 | 0.270 | 1.098 |
|
|