Research Article
Visual-Text Reference Pretraining Model for Image Captioning
Figure 4
Visual presentation of image captions and corresponding visual areas on MS COCO. We use Faster RCNN to detect the objects in images and generate the corresponding keywords. In the prediction captions, we have highlighted the keywords in the color font.