-
Remote Sensing Image Captioning
Remote Sensing Image Captioning Dataset (RSICD) and UCM-captions dataset for remote sensing image captioning -
Conceptual Captions 3.3M
Conceptual Captions 3.3M is a large-scale dataset of image captions, where each image is accompanied by 5 different captions. -
SBU Captioned Photos
The SBU Captioned Photos (SBU) dataset, consisting of 1M images with associated visually relevant captions. -
ClipCap: CLIP Prefix for Image Captioning
Image captioning is a fundamental task in vision-language understanding, where the model predicts a textual informative caption to a given input image. -
Good News, everyone!
The dataset used in the paper to evaluate the effectiveness of context-driven entity-aware captioning. -
Winoground
The Winoground dataset consists of 400 items, each containing two image-caption pairs (I0, C0), (I1, C1). -
Twitter Alt-Text Dataset
A dataset of 371k images paired with alt-text and tweets scraped from Twitter, used for alt-text generation. -
CLIP-DIFFUSION-LM: APPLY DIFFUSION MODEL ON IMAGE CAPTIONING
Image captioning task has been extensively researched by previous work. However, limited experiments focus on generating captions based on non-autoregressive text decoder.... -
Crisscrossed Captions
Crisscrossed Captions (CxC) dataset is a multimodal learning dataset used for training and evaluation of the MURAL model. -
UMIC: An unreferenced metric for image captioning via contrastive learning
UMIC: An unreferenced metric for image captioning via contrastive learning -
CC12M dataset
CC12M dataset is used for training and testing the proposed method. It contains 12 million images with 12 million captions. -
Flickr8K-Expert dataset
Flickr8K-Expert dataset is used for evaluating the proposed method. It contains 8,000 images with 8,000 captions. -
Composite Dataset
The Composite dataset, containing 11,985 human judgments over Flickr 8K, Flickr 30K, and COCO captions. -
Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automa...
Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. -
Flickr30K Entities
The Flickr30K Entities dataset consists of 31,783 images each matched with 5 captions. The dataset links distinct sentence entities to image bounding boxes, resulting in 70K... -
Microsoft COCO: common objects in context
The COCO dataset is a large-scale dataset for object detection and image classification. -
Flickr30K-EE
Explicit Caption Editing (ECE) — refining reference image captions through a sequence of explicit edit operations (e.g., KEEP, DELETE) — has raised significant attention due to... -
CrowdCaption Dataset
The CrowdCaption dataset contains 11,161 images with 21,794 group region and 43,306 group captions. Each group has an average of 2 captions.