Dataset - LDM

Image Captioning and Visual Question Answering

The dataset is used for image captioning and visual question answering.
- Dataset
- JSON
Flickr 8k Dataset

The Flickr 8k dataset is a large-scale benchmark for image captioning. It contains 8,000 images annotated with 5 human captions each.
- Dataset
- JSON
Learning to Evaluate Image Captioning

Evaluation metrics for image captioning face two challenges. Firstly, commonly used metrics such as CIDEr, METEOR, ROUGE and BLEU often do not correlate well with human...
- Dataset
- JSON
VizioMetrics

VizioMetrics is a dataset of 5,000 bitmap images and 30,000 textual annotations.
- Dataset
- JSON
VisImages

VisImages is a dataset of 12,267 images with captions from 1,397 papers in IEEE InfoVis and VAST. It includes 35,096 visualizations and their bounding boxes.
- Dataset
- JSON
Microsoft COCO 2014 and 2017

Microsoft COCO 2014 and 2017 datasets for object detection, segmentation, and captioning
- Dataset
- JSON
Visual Storytelling Dataset (VIST)

The Visual Storytelling Dataset (VIST) consists of 10,117 Flickr albums and 210,819 unique images. Each sample is one sequence of 5 photos selected from the same album paired...
- Dataset
- JSON
TextCaps: A dataset for image captioning with reading comprehension

TextCaps: A dataset for image captioning with reading comprehension.
- Dataset
- JSON
VQA 1.0

The VQA 1.0 dataset is a large-scale dataset for visual question answering, containing 15,000 images with 50,000 questions.
- Dataset
- JSON
Twitter Alt-Text Dataset

A dataset of 371k images paired with alt-text and tweets scraped from Twitter, used for alt-text generation.
- Dataset
- JSON
Crisscrossed Captions

Crisscrossed Captions (CxC) dataset is a multimodal learning dataset used for training and evaluation of the MURAL model.
- Dataset
- JSON
Mult30K

Mult30K dataset is a multilingual image captioning dataset used for training and evaluation of the M3P model.
- Dataset
- JSON
UMIC: An unreferenced metric for image captioning via contrastive learning

UMIC: An unreferenced metric for image captioning via contrastive learning
- Dataset
- JSON
LIUM-CVC Submissions for WMT18 Multimodal Translation Task

Multimodal Neural Machine Translation systems developed by LIUM and CVC for WMT18 Shared Task on Multimodal Translation.
- Dataset
- JSON
Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automa...

Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning.
- Dataset
- JSON
Flickr30K-EE

Explicit Caption Editing (ECE) — refining reference image captions through a sequence of explicit edit operations (e.g., KEEP, DELETE) — has raised significant attention due to...
- Dataset
- JSON
COCO-EE

Explicit Caption Editing (ECE) — refining reference image captions through a sequence of explicit edit operations (e.g., KEEP, DELETE) — has raised significant attention due to...
- Dataset
- JSON
CrowdCaption Dataset

The CrowdCaption dataset contains 11,161 images with 21,794 group region and 43,306 group captions. Each group has an average of 2 captions.
- Dataset
- JSON
Conceptual Captions 12M

The Conceptual Captions 12M (CC-12M) dataset consists of 12 million diverse and high-quality images paired with descriptive captions and titles.
- Dataset
- JSON
Conceptual Caption 3M

The Conceptual Caption 3M (CC-3M) dataset is a large-scale image captioning dataset.
- Dataset
- JSON

You can also access this registry using the API (see API Docs).

38 datasets found