Vision-by-Language for Training-Free Compositional Image Retrieval

Compositional Image Retrieval through Vision-by-Language (CIReVL) is a training-free approach for Zero-Shot Compositional Image Retrieval (CIR). Utilizing off-the-shelf pre-trained models, CIReVL achieves strong performance across multiple CIR benchmarks.

Data and Resources

Cite this as

Shyamgopal Karthik, Karsten Roth, Massimiliano Mancini, Zeynep Akata (2024). Dataset: Vision-by-Language for Training-Free Compositional Image Retrieval. https://doi.org/10.57702/oz7xwa3m

DOI retrieved: December 2, 2024

Additional Info

Field Value
Created December 2, 2024
Last update December 2, 2024
Defined In https://doi.org/10.48550/arXiv.2310.09291
Author Shyamgopal Karthik
More Authors
Karsten Roth
Massimiliano Mancini
Zeynep Akata
Homepage https://github.com/ExplainableML/Vision by Language