10 datasets found

Tags: vision-language

Filter Results
  • Chinese CLIP

    A vision-language pre-training dataset, Chinese CLIP, which consists of 100 million image-text pairs.
  • BLIP2

    A vision-language pre-training dataset, BLIP2, which consists of 100 million image-text pairs.
  • ALADIN

    The ALADIN dataset is a custom dataset created for the ALADIN paper.
  • WebLI Dataset

    The WebLI dataset used for training and evaluation of the CoBIT model.
  • JFT-4B Dataset

    The JFT-4B dataset used for training and evaluation of the CoBIT model.
  • ALIGN Dataset

    The ALIGN dataset used for training and evaluation of the CoBIT model.
  • CoBIT Dataset

    The dataset used for training and evaluation of the CoBIT model, which consists of image-text pairs from large-scale noisy web-crawled data and image annotation data.
  • VQAv2

    Visual Question Answering (VQA) has achieved great success thanks to the fast development of deep neural networks (DNN). On the other hand, the data augmentation, as one of the...
  • MixGen: A New Multi-Modal Data Augmentation

    MixGen: a joint data augmentation for vision-language representation learning to further improve data efficiency.
  • MS-COCO

    Large scale datasets [18, 17, 27, 6] boosted text conditional image generation quality. However, in some domains it could be difficult to make such datasets and usually it could...
You can also access this registry using the API (see API Docs).