-
MSCOCO caption challenge dataset
The MSCOCO caption challenge dataset is a subset of the MSCOCO caption dataset, containing 113,287 training images, 5,000 validation images, and 5,000 test images. -
MSCOCO caption dataset
The MSCOCO caption dataset is a large-scale image captioning dataset. It consists of 123,000 images with 5 captions each. -
Show-and-Tell
Visual language grounding is widely studied in modern neural networks, which typically adopts an encoder-decoder framework consisting of a convolutional neural network (CNN) for... -
Augmented Flickr-8K Dataset
A dataset of images annotated with captions and semantic tuples, created by training a model to predict semantic tuples from image captions. -
Flickr30K and MSCOCO
The dataset used in the paper is Flickr30K and MSCOCO, which are used for image-text matching and image captioning tasks. -
Flickr 30k Dataset
The Flickr 30k dataset is a large-scale image captioning dataset containing 30,000 images with 30 captions each. -
MSCOCO 2014 Captions Dataset
The MSCOCO 2014 captions dataset contains 123,293 images, split into a 82,783 image training set and a 40,504 image validation set. Each image is labeled with five... -
High Quality Image Text Pairs
The High Quality Image Text Pairs (HQITP-134M) dataset consists of 134 million diverse and high-quality images paired with descriptive captions and titles. -
COCO dataset (Brazilian Portuguese)
The dataset used for training the Brazilian Portuguese version of the GRIT model, a translation of the COCO dataset. -
Visual Storytelling Dataset (VIST)
The Visual Storytelling Dataset (VIST) consists of 10,117 Flickr albums and 210,819 unique images. Each sample is one sequence of 5 photos selected from the same album paired... -
Semantic Communication Dataset
The dataset used in this paper for semantic communication, consisting of images and their corresponding captions. -
Pascal Flickr dataset
The Pascal Flickr dataset is a collection of captions for images from Flickr. -
Image Captioning Task
The dataset used in the paper is a image captioning task. -
High Quality Image-Text Pairs (HQITP)
High Quality Image-Text Pairs (HQITP) dataset contains 134M high-quality image-caption pairs. -
ClipCap: CLIP Prefix for Image Captioning
Image captioning is a fundamental task in vision-language understanding, where the model predicts a textual informative caption to a given input image. -
Good News, everyone!
The dataset used in the paper to evaluate the effectiveness of context-driven entity-aware captioning. -
Winoground
The Winoground dataset consists of 400 items, each containing two image-caption pairs (I0, C0), (I1, C1).