RANKCLIP: Ranking-Consistent Language-Image Pretraining

doi:doi:10.57702/tk8iqkz1

RANKCLIP: Ranking-Consistent Language-Image Pretraining

Self-supervised contrastive learning models, such as CLIP, have set new benchmarks for vision-language models in many downstream tasks. However, their dependency on rigid one-to-one mappings overlooks the complex and often multifaceted relationships between and within texts and images. To this end, we introduce RANKCLIP, a novel pretraining method that extends beyond the rigid one-to-one matching framework of CLIP and its variants.

BibTex:

@dataset{Yiming_Zhang_and_Zhuokai_Zhao_and_Zhaorun_Chen_and_Zhili_Feng_and_Zenghui_Ding_and_Yining_Sun_2024,
    abstract = {Self-supervised contrastive learning models, such as CLIP, have set new benchmarks for vision-language models in many downstream tasks. However, their dependency on rigid one-to-one mappings overlooks the complex and often multifaceted relationships between and within texts and images. To this end, we introduce RANKCLIP, a novel pretraining method that extends beyond the rigid one-to-one matching framework of CLIP and its variants.},
    author = {Yiming Zhang and Zhuokai Zhao and Zhaorun Chen and Zhili Feng and Zenghui Ding and Yining Sun},
    doi = {10.57702/tk8iqkz1},
    institution = {No Organization},
    keyword = {'contrastive learning', 'multimodal learning', 'vision-language models'},
    month = {dec},
    publisher = {TIB},
    title = {RANKCLIP: Ranking-Consistent Language-Image Pretraining},
    url = {https://service.tib.eu/ldmservice/dataset/rankclip--ranking-consistent-language-image-pretraining},
    year = {2024}
}