50 datasets found

Tags: image-text pairs

Filter Results
  • COCOQA

    The dataset used in the paper is a set of sequential vision-and-language tasks, where each task consists of an image and a text input.
  • National Diet Library Dataset

    A dataset containing 10,000 digitally archived images from various genres.
  • Image–Text Pair Dataset from Books

    A dataset constructed from book images using an optical character reader (OCR), an object detector, and a layout analyzer for the autonomous extraction of image–text pairs.
  • High Quality Image-Text Pairs (HQITP)

    High Quality Image-Text Pairs (HQITP) dataset contains 134M high-quality image-caption pairs.
  • ZeroVL dataset

    The dataset used for training the ZeroVL model, consisting of 14.23M image-text pairs from various domains.
  • MARIO-OpenLibrary

    The MARIO-OpenLibrary dataset is a subset of the LAION-400M dataset, containing 523,684 book covers with corresponding titles.
  • MARIO-TMDB

    The MARIO-TMDB dataset is a subset of the LAION-400M dataset, containing 343,423 English posters with corresponding titles.
  • MARIO-LAION

    The MARIO-LAION dataset is a subset of the LAION-400M dataset, containing 9,194,613 high-quality text images with corresponding captions.
  • MARIO-10M

    The MARIO-10M dataset is a collection of about 10 million high-quality and diverse image-text pairs from various data sources such as natural images, posters, and book covers.
  • CAD

    The CAD dataset is a photorealistic 3D generation dataset conditioned on a single image and a text prompt.
  • CLIP-S

    The dataset used in the paper is CLIP-S, a dataset for bimodal contrastive learning.
  • RS5M

    A large-scale dataset containing 5 million RS images with English descriptions by filtering the image-text pair dataset and generating captions for RS images.
  • R2R

    The dataset used in the paper for vision-and-language navigation tasks.
  • Multimodal Learning (MLM) dataset

    The MLM dataset is a collection of images and captions that represent different cultures from around the world.
  • RAMM: Retrieval-augmented Biomedical Visual Question Answering

    A retrieval-augmented pretrain-and-finetune paradigm for biomedical VQA which includes a high-quality image-text pairs PMCPM, a pre-trained multi-modal model, and a novel...
  • General-context dataset

    General-context dataset containing diverse image-text pairs (top three rows), and DVP presented images with targeted translation of the RoI (bottom two rows).
  • Laion-20M

    The dataset used for pre-training the MS-CLIP model, which consists of 20 million image-text pairs filtered from Laion-400M.
  • LAION-Face

    The LAION-Face dataset consists of 50 million image-text pairs to ensure diversity.
  • Visual Spatial Reasoning

    Visual Spatial Reasoning (VSR) is a controlled probing dataset for testing vision-language models' capabilities of recognizing and reasoning about spatial relations in natural...
  • W200M

    The dataset used in this paper is a large-scale web sourced image-text paired dataset.
You can also access this registry using the API (see API Docs).