-
Chinese CLIP
A vision-language pre-training dataset, Chinese CLIP, which consists of 100 million image-text pairs. -
WebLI Dataset
The WebLI dataset used for training and evaluation of the CoBIT model. -
JFT-4B Dataset
The JFT-4B dataset used for training and evaluation of the CoBIT model. -
ALIGN Dataset
The ALIGN dataset used for training and evaluation of the CoBIT model. -
CoBIT Dataset
The dataset used for training and evaluation of the CoBIT model, which consists of image-text pairs from large-scale noisy web-crawled data and image annotation data. -
MixGen: A New Multi-Modal Data Augmentation
MixGen: a joint data augmentation for vision-language representation learning to further improve data efficiency.