Vision-by-Language for Training-Free Compositional Image Retrieval
Compositional Image Retrieval through Vision-by-Language (CIReVL) is a training-free approach for Zero-Shot Compositional Image Retrieval (CIR). Utilizing off-the-shelf pre-trained models, CIReVL achieves strong performance across multiple CIR benchmarks.
BibTex: