Research Article

Visual-Text Reference Pretraining Model for Image Captioning

Table 2

The performance of the published state of the art and our model on the test sets of Visual Genome. In dense image captioning, the model receives a single image and generates a set of regions, each annotated with confidence and a caption.

ApproachB@4MRCS

ST [20]11.117.034.5139.931.1
UP-DOWN [45]10.916.934.5139.431.4
ASG [27]17.622.144.7202.440.6
VTR-PTM (ours)20.527.845.3185.950.9