VLMixer: Unpaired Vision-Language Pre-training via Cross-Modal CutMix

Unpaired vision-language pre-training via cross-modal CutMix.

BibTex: