-
MiniGPT-v2
MiniGPT-v2 is a vision-language model that uses a unified interface for multi-task learning. -
Perceptual grouping in contrastive vision-language models
Perceptual grouping in contrastive vision-language models. -
InternLM-XC
InternLM-XC is a vision-language large model that supports images with any aspect ratio from 336 pixels up to 4K HD, facilitating its deployment in real-world contexts. -
InternLM-XComposer2
InternLM-XComposer2 is a vision-language large model that supports images with any aspect ratio from 336 pixels up to 4K HD, facilitating its deployment in real-world contexts. -
InternLM-XComposer2-4KHD
InternLM-XComposer2-4KHD is a vision-language large model that supports images with any aspect ratio from 336 pixels up to 4K HD, facilitating its deployment in real-world... -
DataComp-10M
DataComp-10M is used as a pretraining dataset -
CC3M and CC12M
CC3M and CC12M are used as datasets for training and evaluation -
RANKCLIP: Ranking-Consistent Language-Image Pretraining
Self-supervised contrastive learning models, such as CLIP, have set new benchmarks for vision-language models in many downstream tasks. However, their dependency on rigid... -
When and why Vision-Language Models behave like Bags-of-Words, and what to do...
When and why Vision-Language Models behave like Bags-of-Words, and what to do about it? -
Conceptual Captions 12M and RedCaps
The dataset used in the paper is Conceptual Captions 12M (CC12M) and RedCaps. -
Conceptual Captions 3M, Conceptual Captions 12M, RedCaps, and LAION-400M
The dataset used in the paper is Conceptual Captions 3M (CC3M), Conceptual Captions 12M (CC12M), RedCaps, and LAION-400M. -
Learning to prompt for vision-language models
A method for learning to prompt for vision-language models. -
Conceptual Captions (CC-3M)
Conceptual Captions (CC-3M) is a large-scale dataset of 300,000 image-caption pairs. -
Playhouse and AndroidEnv
The dataset used in this paper is the Playhouse and AndroidEnv environments.