Research Article

Gated Object-Attribute Matching Network for Detailed Image Caption

Table 1

The performance of the state-of-the-art image captioning models on the Flickr30k and MSCOCO testing splits.

ModelFlickr30kMSCOCO
B@1B@2B@3B@4METEORCIDErB@1B@2B@3B@4METEORCIDEr

DeepV-SAlign [7]0.5730.3690.2400.1570.1530.2470.6250.4500.3210.2300.1950.660
Soft-Attention [26]0.6670.4340.2880.1910.1850.7070.4920.3440.2430.239
Hard-Attention [26]0.6690.4390.2960.1990.1850.7180.5040.3570.2500.230
Attribute-FCN [15]0.6470.4600.3240.2300.1890.7090.5370.4020.3040.243
Adaptive-Attention [11]0.6770.4940.3540.2510.2040.5310.7420.5800.4390.3320.2661.085
Attribute-CNN + LSTM [14]0.560.420.310.260.94
NBT [13]0.7200.2850.2310.5750.7590.3490.2741.089
Up-Down [12]0.8020.6410.4910.3690.2761.179
Ours (Box proposed)0.7110.5070.3930.2660.2110.6300.7530.5920.4620.3410.2660.954
Ours (RL)0.7350.5220.4010.2970.2190.6740.7720.6200.4760.3520.2701.098