VGDiffZero: Text-to-Image Diffusion Models Can Be Zero-Shot Visual Grounders

VGDiffZero is a zero-shot visual grounding framework that leverages pre-trained text-to-image diffusion models' vision-language alignment abilities.

BibTex: