When and why Vision-Language Models behave like Bags-of-Words, and what to do about it?

When and why Vision-Language Models behave like Bags-of-Words, and what to do about it?

BibTex: