Single-Stream Multi-Level Alignment for Vision-Language Pretraining

doi:doi:10.57702/n9fbhu1z

Single-Stream Multi-Level Alignment for Vision-Language Pretraining

Self-supervised vision-language pretraining from pure images and text with a contrastive loss is effective, but ignores fine-grained alignment due to a dual-stream architecture that aligns image and text representations only on a global level.

Data and Resources

Original MetadataJSON
The json representation of the dataset with its distributions based on DCAT.
Explore
- Preview
- Download

Cite this as

Zaid Khan, Vijay Kumar B G, Xiang Yu, Samuel Schulter, Manmohan Chandraker, Yun Fu (2024). Dataset: Single-Stream Multi-Level Alignment for Vision-Language Pretraining. https://doi.org/10.57702/n9fbhu1z

DOI retrieved: December 16, 2024

Additional Info

Field	Value
Created	December 16, 2024
Last update	December 16, 2024
Defined In	https://doi.org/10.48550/arXiv.2203.14395
Author	Zaid Khan
More Authors	Vijay Kumar B G Xiang Yu Samuel Schulter Manmohan Chandraker Yun Fu
Homepage	https://arxiv.org/abs/2203.12345