CC14M

Large-scale image-text dataset for pre-training a collaborative two-stream vision-language model for cross-modal retrieval.

BibTex: