Single-Stream Multi-Level Alignment for Vision-Language Pretraining

Self-supervised vision-language pretraining from pure images and text with a contrastive loss is effective, but ignores fine-grained alignment due to a dual-stream architecture that aligns image and text representations only on a global level.

Data and Resources

Cite this as

Zaid Khan, Vijay Kumar B G, Xiang Yu, Samuel Schulter, Manmohan Chandraker, Yun Fu (2024). Dataset: Single-Stream Multi-Level Alignment for Vision-Language Pretraining. https://doi.org/10.57702/n9fbhu1z

DOI retrieved: December 16, 2024

Additional Info

Field Value
Created December 16, 2024
Last update December 16, 2024
Defined In https://doi.org/10.48550/arXiv.2203.14395
Author Zaid Khan
More Authors
Vijay Kumar B G
Xiang Yu
Samuel Schulter
Manmohan Chandraker
Yun Fu
Homepage https://arxiv.org/abs/2203.12345