Image-Text Pre-training - Groups

ZeroVL dataset

The dataset used for training the ZeroVL model, consisting of 14.23M image-text pairs from various domains.

Dataset
JSON

BLIP

The dataset used in the paper is a pre-trained diffusion backbone and a pre-trained vision-language guidance model.

Dataset
JSON

CLIP

The CLIP model and its variants are becoming the de facto backbone in many applications. However, training a CLIP model from hundreds of millions of image-text pairs can be...

Dataset
JSON

3 datasets found

ZeroVL dataset

BLIP

CLIP