3 datasets found

Tags: image-text pairs

Filter Results
  • ZeroVL dataset

    The dataset used for training the ZeroVL model, consisting of 14.23M image-text pairs from various domains.
  • BLIP

    The dataset used in the paper is a pre-trained diffusion backbone and a pre-trained vision-language guidance model.
  • CLIP

    The CLIP model and its variants are becoming the de facto backbone in many applications. However, training a CLIP model from hundreds of millions of image-text pairs can be...