Dataset - LDM

COCOQA

The dataset used in the paper is a set of sequential vision-and-language tasks, where each task consists of an image and a text input.
- Dataset
- JSON
National Diet Library Dataset

A dataset containing 10,000 digitally archived images from various genres.
- Dataset
- JSON
Image–Text Pair Dataset from Books

A dataset constructed from book images using an optical character reader (OCR), an object detector, and a layout analyzer for the autonomous extraction of image–text pairs.
- Dataset
- JSON
High Quality Image-Text Pairs (HQITP)

High Quality Image-Text Pairs (HQITP) dataset contains 134M high-quality image-caption pairs.
- Dataset
- JSON
ZeroVL dataset

The dataset used for training the ZeroVL model, consisting of 14.23M image-text pairs from various domains.
- Dataset
- JSON
MARIO-OpenLibrary

The MARIO-OpenLibrary dataset is a subset of the LAION-400M dataset, containing 523,684 book covers with corresponding titles.
- Dataset
- JSON
MARIO-TMDB

The MARIO-TMDB dataset is a subset of the LAION-400M dataset, containing 343,423 English posters with corresponding titles.
- Dataset
- JSON
MARIO-LAION

The MARIO-LAION dataset is a subset of the LAION-400M dataset, containing 9,194,613 high-quality text images with corresponding captions.
- Dataset
- JSON
MARIO-10M

The MARIO-10M dataset is a collection of about 10 million high-quality and diverse image-text pairs from various data sources such as natural images, posters, and book covers.
- Dataset
- JSON
CAD

The CAD dataset is a photorealistic 3D generation dataset conditioned on a single image and a text prompt.
- Dataset
- JSON
CLIP-S

The dataset used in the paper is CLIP-S, a dataset for bimodal contrastive learning.
- Dataset
- JSON
RS5M

A large-scale dataset containing 5 million RS images with English descriptions by filtering the image-text pair dataset and generating captions for RS images.
- Dataset
- JSON
R2R

The dataset used in the paper for vision-and-language navigation tasks.
- Dataset
- JSON
Multimodal Learning (MLM) dataset

The MLM dataset is a collection of images and captions that represent different cultures from around the world.
- Dataset
- JSON
RAMM: Retrieval-augmented Biomedical Visual Question Answering

A retrieval-augmented pretrain-and-ﬁnetune paradigm for biomedical VQA which includes a high-quality image-text pairs PMCPM, a pre-trained multi-modal model, and a novel...
- Dataset
- JSON
General-context dataset

General-context dataset containing diverse image-text pairs (top three rows), and DVP presented images with targeted translation of the RoI (bottom two rows).
- Dataset
- JSON
Laion-20M

The dataset used for pre-training the MS-CLIP model, which consists of 20 million image-text pairs filtered from Laion-400M.
- Dataset
- JSON
LAION-Face

The LAION-Face dataset consists of 50 million image-text pairs to ensure diversity.
- Dataset
- JSON
Visual Spatial Reasoning

Visual Spatial Reasoning (VSR) is a controlled probing dataset for testing vision-language models' capabilities of recognizing and reasoning about spatial relations in natural...
- Dataset
- JSON
W200M

The dataset used in this paper is a large-scale web sourced image-text paired dataset.
- Dataset
- JSON

You can also access this registry using the API (see API Docs).

50 datasets found