RANKCLIP: Ranking-Consistent Language-Image Pretraining

Self-supervised contrastive learning models, such as CLIP, have set new benchmarks for vision-language models in many downstream tasks. However, their dependency on rigid one-to-one mappings overlooks the complex and often multifaceted relationships between and within texts and images. To this end, we introduce RANKCLIP, a novel pretraining method that extends beyond the rigid one-to-one matching framework of CLIP and its variants.

Data and Resources

Cite this as

Yiming Zhang, Zhuokai Zhao, Zhaorun Chen, Zhili Feng, Zenghui Ding, Yining Sun (2024). Dataset: RANKCLIP: Ranking-Consistent Language-Image Pretraining. https://doi.org/10.57702/tk8iqkz1

DOI retrieved: December 3, 2024

Additional Info

Field Value
Created December 3, 2024
Last update December 3, 2024
Defined In https://doi.org/10.48550/arXiv.2404.09387
Author Yiming Zhang
More Authors
Zhuokai Zhao
Zhaorun Chen
Zhili Feng
Zenghui Ding
Yining Sun
Homepage https://github.com/Jam1ezhang/RankCLIP