CC4M

Large-scale image-text datasets for pre-training a collaborative two-stream vision-language model for cross-modal retrieval.

BibTex: