Dataset - LDM

Image Captioning Task

The dataset used in the paper is a image captioning task.
- Dataset
- JSON
High Quality Image-Text Pairs (HQITP)

High Quality Image-Text Pairs (HQITP) dataset contains 134M high-quality image-caption pairs.
- Dataset
- JSON
ClipCap: CLIP Preﬁx for Image Captioning

Image captioning is a fundamental task in vision-language understanding, where the model predicts a textual informative caption to a given input image.
- Dataset
- JSON
Good News, everyone!

The dataset used in the paper to evaluate the effectiveness of context-driven entity-aware captioning.
- Dataset
- JSON
Winoground

The Winoground dataset consists of 400 items, each containing two image-caption pairs (I0, C0), (I1, C1).
- Dataset
- JSON
VQA

The VQA dataset is a large-scale visual question answering dataset that consists of pairs of images that require natural language answers.
- Dataset
- JSON
CC12M dataset

CC12M dataset is used for training and testing the proposed method. It contains 12 million images with 12 million captions.
- Dataset
- JSON
Flickr8K-Expert dataset

Flickr8K-Expert dataset is used for evaluating the proposed method. It contains 8,000 images with 8,000 captions.
- Dataset
- JSON
Concept Conjunction 500 (CC-500)

The Concept Conjunction 500 (CC-500) dataset is a benchmark for text-to-image synthesis, consisting of 500 images with 500 corresponding text descriptions.
- Dataset
- JSON
Attribute Binding Contrast (ABC-6K)

The Attribute Binding Contrast (ABC-6K) dataset is a benchmark for text-to-image synthesis, consisting of 6,000 images with 6,000 corresponding text descriptions.
- Dataset
- JSON
Composite Dataset

The Composite dataset, containing 11,985 human judgments over Flickr 8K, Flickr 30K, and COCO captions.
- Dataset
- JSON
Flickr30K Entities

The Flickr30K Entities dataset consists of 31,783 images each matched with 5 captions. The dataset links distinct sentence entities to image bounding boxes, resulting in 70K...
- Dataset
- JSON
Microsoft COCO: common objects in context

The COCO dataset is a large-scale dataset for object detection and image classification.
- Dataset
- JSON
GQA

The GQA dataset is a visual question answering dataset that characterizes in compositional question answering and visual reasoning about real-world images.
- Dataset
- JSON
LAION-Improved-Aesthetics (v1.2)

The LAION-Improved-Aesthetics (v1.2) dataset used for training the Stable Diffusion model, which includes images with captions.
- Dataset
- JSON
Conceptual Captions 12M

The Conceptual Captions 12M (CC-12M) dataset consists of 12 million diverse and high-quality images paired with descriptive captions and titles.
- Dataset
- JSON
COCO-captions dataset

The COCO-captions dataset contains ∼120k RGB images with text captions.
- Dataset
- JSON
SS1M

The dataset used in the paper for text-only image captioning with synthetic pairs.
- Dataset
- JSON
MS COCO dataset

The MS COCO dataset is a large benchmark for image captioning, containing 328K images with 5 caption descriptions each.
- Dataset
- JSON
ReferIt

Referring image segmentation aims at localizing all pixels of the visual objects described by a natural language sentence.
- Dataset
- JSON

You can also access this registry using the API (see API Docs).

72 datasets found