-
MiniGPT-v2
MiniGPT-v2 is a vision-language model that uses a unified interface for multi-task learning. -
Perceptual grouping in contrastive vision-language models
Perceptual grouping in contrastive vision-language models. -
DataComp-10M
DataComp-10M is used as a pretraining dataset -
CC3M and CC12M
CC3M and CC12M are used as datasets for training and evaluation -
RANKCLIP: Ranking-Consistent Language-Image Pretraining
Self-supervised contrastive learning models, such as CLIP, have set new benchmarks for vision-language models in many downstream tasks. However, their dependency on rigid... -
Conceptual Captions 12M and RedCaps
The dataset used in the paper is Conceptual Captions 12M (CC12M) and RedCaps. -
Conceptual Captions 3M, Conceptual Captions 12M, RedCaps, and LAION-400M
The dataset used in the paper is Conceptual Captions 3M (CC3M), Conceptual Captions 12M (CC12M), RedCaps, and LAION-400M. -
Learning to prompt for vision-language models
A method for learning to prompt for vision-language models. -
Conceptual Captions (CC-3M)
Conceptual Captions (CC-3M) is a large-scale dataset of 300,000 image-caption pairs.