Vision-Language - Groups

Chinese CLIP

A vision-language pre-training dataset, Chinese CLIP, which consists of 100 million image-text pairs.
- Dataset
- JSON
BLIP2

A vision-language pre-training dataset, BLIP2, which consists of 100 million image-text pairs.
- Dataset
- JSON
WebLI Dataset

The WebLI dataset used for training and evaluation of the CoBIT model.
- Dataset
- JSON
JFT-4B Dataset

The JFT-4B dataset used for training and evaluation of the CoBIT model.
- Dataset
- JSON
ALIGN Dataset

The ALIGN dataset used for training and evaluation of the CoBIT model.
- Dataset
- JSON
CoBIT Dataset

The dataset used for training and evaluation of the CoBIT model, which consists of image-text pairs from large-scale noisy web-crawled data and image annotation data.
- Dataset
- JSON
R2R-CE and RxR-CE

The R2R-CE and RxR-CE datasets are used for vision-language navigation tasks in continuous environments.
- Dataset
- JSON

7 datasets found