Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models

A dataset for multimodal learning tasks, focusing on region-to-phrase correspondences for image-to-sentence models.

BibTex: