Research Article
Visual-Text Reference Pretraining Model for Image Captioning
Table 2
The performance of the published state of the art and our model on the test sets of Visual Genome. In dense image captioning, the model receives a single image and generates a set of regions, each annotated with confidence and a caption.