10 datasets found

Organizations: No Organization

Filter Results
  • BURCHAK corpus

    A new freely available human-human dialogue data set for interactive learning of visually grounded word meanings through ostensive definition by a tutor to a learner.
  • 3DVG-Transformer

    A dataset for visual grounding on point clouds, focusing on relation modeling.
  • SWiG

    The SWiG dataset is a large-scale visual grounding dataset, where the task is to predict the object in an image.
  • ReferIt

    Referring image segmentation aims at localizing all pixels of the visual objects described by a natural language sentence.
  • SpeechCLIP

    SpeechCLIP is a novel framework to integrate speech SSL models with a pre-trained vision and language model.
  • RefCOCOg

    The RefCOCOg dataset is a reconstructed dataset of the MS-COCO dataset, containing 85,474 referring expressions for 54,822 objects in 26,711 images.
  • RefCOCO

    The dataset used in the paper is a benchmark for referring expression grounding, containing 142,210 referring expressions for 50,000 referents in 19,994 images.
  • VGDiffZero: Text-to-Image Diffusion Models Can Be Zero-Shot Visual Grounders

    VGDiffZero is a zero-shot visual grounding framework that leverages pre-trained text-to-image diffusion models' vision-language alignment abilities.
  • RefCOCO, RefCOCO+, and RefCOCOg

    Visual Grounding is a task that aims to locate a target object according to a natural language expression. The dataset used in this paper is RefCOCO, RefCOCO+, and RefCOCOg.
  • Visual Genome

    The Visual Genome dataset is a large-scale visual question answering dataset, containing 1.5 million images, each with 15-30 annotated entities, attributes, and relationships.