-
Image Captioning and Visual Question Answering
The dataset is used for image captioning and visual question answering. -
Flickr 8k Dataset
The Flickr 8k dataset is a large-scale benchmark for image captioning. It contains 8,000 images annotated with 5 human captions each. -
Learning to Evaluate Image Captioning
Evaluation metrics for image captioning face two challenges. Firstly, commonly used metrics such as CIDEr, METEOR, ROUGE and BLEU often do not correlate well with human... -
VizioMetrics
VizioMetrics is a dataset of 5,000 bitmap images and 30,000 textual annotations. -
Microsoft COCO 2014 and 2017
Microsoft COCO 2014 and 2017 datasets for object detection, segmentation, and captioning -
Visual Storytelling Dataset (VIST)
The Visual Storytelling Dataset (VIST) consists of 10,117 Flickr albums and 210,819 unique images. Each sample is one sequence of 5 photos selected from the same album paired... -
TextCaps: A dataset for image captioning with reading comprehension
TextCaps: A dataset for image captioning with reading comprehension. -
Twitter Alt-Text Dataset
A dataset of 371k images paired with alt-text and tweets scraped from Twitter, used for alt-text generation. -
Crisscrossed Captions
Crisscrossed Captions (CxC) dataset is a multimodal learning dataset used for training and evaluation of the MURAL model. -
UMIC: An unreferenced metric for image captioning via contrastive learning
UMIC: An unreferenced metric for image captioning via contrastive learning -
LIUM-CVC Submissions for WMT18 Multimodal Translation Task
Multimodal Neural Machine Translation systems developed by LIUM and CVC for WMT18 Shared Task on Multimodal Translation. -
Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automa...
Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. -
Flickr30K-EE
Explicit Caption Editing (ECE) — refining reference image captions through a sequence of explicit edit operations (e.g., KEEP, DELETE) — has raised significant attention due to... -
CrowdCaption Dataset
The CrowdCaption dataset contains 11,161 images with 21,794 group region and 43,306 group captions. Each group has an average of 2 captions. -
Conceptual Captions 12M
The Conceptual Captions 12M (CC-12M) dataset consists of 12 million diverse and high-quality images paired with descriptive captions and titles. -
Conceptual Caption 3M
The Conceptual Caption 3M (CC-3M) dataset is a large-scale image captioning dataset.