Dataset - LDM

MSCOCO caption challenge dataset

The MSCOCO caption challenge dataset is a subset of the MSCOCO caption dataset, containing 113,287 training images, 5,000 validation images, and 5,000 test images.
- Dataset
- JSON
MSCOCO caption dataset

The MSCOCO caption dataset is a large-scale image captioning dataset. It consists of 123,000 images with 5 captions each.
- Dataset
- JSON
Show-and-Tell

Visual language grounding is widely studied in modern neural networks, which typically adopts an encoder-decoder framework consisting of a convolutional neural network (CNN) for...
- Dataset
- JSON
Augmented Flickr-8K Dataset

A dataset of images annotated with captions and semantic tuples, created by training a model to predict semantic tuples from image captions.
- Dataset
- JSON
Flickr30K and MSCOCO

The dataset used in the paper is Flickr30K and MSCOCO, which are used for image-text matching and image captioning tasks.
- Dataset
- JSON
Flickr 30k Dataset

The Flickr 30k dataset is a large-scale image captioning dataset containing 30,000 images with 30 captions each.
- Dataset
- JSON
MSCOCO 2014 Captions Dataset

The MSCOCO 2014 captions dataset contains 123,293 images, split into a 82,783 image training set and a 40,504 image validation set. Each image is labeled with five...
- Dataset
- JSON
High Quality Image Text Pairs

The High Quality Image Text Pairs (HQITP-134M) dataset consists of 134 million diverse and high-quality images paired with descriptive captions and titles.
- Dataset
- JSON
COCO dataset (Brazilian Portuguese)

The dataset used for training the Brazilian Portuguese version of the GRIT model, a translation of the COCO dataset.
- Dataset
- JSON
Visual Storytelling Dataset (VIST)

The Visual Storytelling Dataset (VIST) consists of 10,117 Flickr albums and 210,819 unique images. Each sample is one sequence of 5 photos selected from the same album paired...
- Dataset
- JSON
Semantic Communication Dataset

The dataset used in this paper for semantic communication, consisting of images and their corresponding captions.
- Dataset
- JSON
OK-VQA

The OK-VQA dataset is a visual question answering benchmark requiring external knowledge.
- Dataset
- JSON
Pascal Flickr dataset

The Pascal Flickr dataset is a collection of captions for images from Flickr.
- Dataset
- JSON
Flickr8K

Flickr8K dataset is a collection of 8,000 images with 5 sentences describing each image content
- Dataset
- JSON
Image Captioning Task

The dataset used in the paper is a image captioning task.
- Dataset
- JSON
High Quality Image-Text Pairs (HQITP)

High Quality Image-Text Pairs (HQITP) dataset contains 134M high-quality image-caption pairs.
- Dataset
- JSON
ClipCap: CLIP Preﬁx for Image Captioning

Image captioning is a fundamental task in vision-language understanding, where the model predicts a textual informative caption to a given input image.
- Dataset
- JSON
Good News, everyone!

The dataset used in the paper to evaluate the effectiveness of context-driven entity-aware captioning.
- Dataset
- JSON
Winoground

The Winoground dataset consists of 400 items, each containing two image-caption pairs (I0, C0), (I1, C1).
- Dataset
- JSON
VQA

The VQA dataset is a large-scale visual question answering dataset that consists of pairs of images that require natural language answers.
- Dataset
- JSON

You can also access this registry using the API (see API Docs).

66 datasets found