Vision-by-Language for Training-Free Compositional Image Retrieval

doi:doi:10.57702/oz7xwa3m

Vision-by-Language for Training-Free Compositional Image Retrieval

Compositional Image Retrieval through Vision-by-Language (CIReVL) is a training-free approach for Zero-Shot Compositional Image Retrieval (CIR). Utilizing off-the-shelf pre-trained models, CIReVL achieves strong performance across multiple CIR benchmarks.

Data and Resources

Original MetadataJSON
The json representation of the dataset with its distributions based on DCAT.
Explore
- Preview
- Download

Cite this as

Shyamgopal Karthik, Karsten Roth, Massimiliano Mancini, Zeynep Akata (2024). Dataset: Vision-by-Language for Training-Free Compositional Image Retrieval. https://doi.org/10.57702/oz7xwa3m

DOI retrieved: December 2, 2024

Additional Info

Field	Value
Created	December 2, 2024
Last update	December 2, 2024
Defined In	https://doi.org/10.48550/arXiv.2310.09291
Author	Shyamgopal Karthik
More Authors	Karsten Roth Massimiliano Mancini Zeynep Akata
Homepage	https://github.com/ExplainableML/Vision by Language