Research Article

Visual-Text Reference Pretraining Model for Image Captioning

Figure 4

Visual presentation of image captions and corresponding visual areas on MS COCO. We use Faster RCNN to detect the objects in images and generate the corresponding keywords. In the prediction captions, we have highlighted the keywords in the color font.