7 datasets found

Groups: Vision-Language

Filter Results
  • Chinese CLIP

    A vision-language pre-training dataset, Chinese CLIP, which consists of 100 million image-text pairs.
  • BLIP2

    A vision-language pre-training dataset, BLIP2, which consists of 100 million image-text pairs.
  • WebLI Dataset

    The WebLI dataset used for training and evaluation of the CoBIT model.
  • JFT-4B Dataset

    The JFT-4B dataset used for training and evaluation of the CoBIT model.
  • ALIGN Dataset

    The ALIGN dataset used for training and evaluation of the CoBIT model.
  • CoBIT Dataset

    The dataset used for training and evaluation of the CoBIT model, which consists of image-text pairs from large-scale noisy web-crawled data and image annotation data.
  • R2R-CE and RxR-CE

    The R2R-CE and RxR-CE datasets are used for vision-language navigation tasks in continuous environments.