-
BURCHAK corpus
A new freely available human-human dialogue data set for interactive learning of visually grounded word meanings through ostensive definition by a tutor to a learner. -
3DVG-Transformer
A dataset for visual grounding on point clouds, focusing on relation modeling. -
SpeechCLIP
SpeechCLIP is a novel framework to integrate speech SSL models with a pre-trained vision and language model. -
VGDiffZero: Text-to-Image Diffusion Models Can Be Zero-Shot Visual Grounders
VGDiffZero is a zero-shot visual grounding framework that leverages pre-trained text-to-image diffusion models' vision-language alignment abilities. -
RefCOCO, RefCOCO+, and RefCOCOg
Visual Grounding is a task that aims to locate a target object according to a natural language expression. The dataset used in this paper is RefCOCO, RefCOCO+, and RefCOCOg. -
Visual Genome
The Visual Genome dataset is a large-scale visual question answering dataset, containing 1.5 million images, each with 15-30 annotated entities, attributes, and relationships.