68 datasets found

Filter Results
  • OK-VQA

    The OK-VQA dataset is a visual question answering benchmark requiring external knowledge.
  • FVQA

    FVQA is a fact-based visual question answering dataset, containing 2190 images and 5826 (question, answer) pairs, with supporting facts selected from knowledge bases.
  • Visual7W

    The Visual7W dataset for Visual Question Answering (VQA). The dataset contains 7,000 images with 7,000 queries.
  • Mutan: Multimodal Tucker Fusion for Visual Question Answering

    The dataset used in the paper is a collection of images and corresponding referring expressions.
  • Visual ChatGPT

    Visual ChatGPT is a system that integrates different Visual Foundation Models to understand visual information and generation corresponding answers.
  • Super-CLEVR

    The Super-CLEVR dataset contains synthetic scenes of randomly placed vehicles from 5 categories (car, plane, bicycle, motorbike, bus) with various attributes (color, material,...
  • Super-CLEVR-3D

    The Super-CLEVR-3D dataset contains questions explicitly querying 3D understanding including object parts, 3D poses, and occlusions.
  • VQAvs

    VQAvs is a dataset for visual question answering, containing questions answerable using visual images.
  • VQA-CPv1 and VQA-CPv2

    VQA-CPv1 and VQA-CPv2 are datasets for visual question answering, containing questions answerable using visual images.
  • Object Attribute Matters in Visual Question Answering

    Visual question answering is a multimodal task that requires the joint comprehension of visual and textual information. The proposed approach utilizes object attributes to...
  • SpatialSense

    A dataset for visual spatial relationship classification (VSRC) with nine well-defined spatial relations.
  • Winoground

    The Winoground dataset consists of 400 items, each containing two image-caption pairs (I0, C0), (I1, C1).
  • VQA 1.0

    The VQA 1.0 dataset is a large-scale dataset for visual question answering, containing 15,000 images with 50,000 questions.
  • VQA

    The VQA dataset is a large-scale visual question answering dataset that consists of pairs of images that require natural language answers.
  • Semantic Equivalent Adversarial Data Augmentation for Visual Question Answering

    Visual Question Answering (VQA) has achieved great success thanks to the fast development of deep neural networks (DNN). On the other hand, the data augmentation, as one of the...
  • MovieQA, TVQA, AVSD, EQA, Embodied QA

    A collection of datasets for visual question answering, including MovieQA, TVQA, AVSD, EQA, and Embodied QA.
  • Visual Spatial Reasoning

    Visual Spatial Reasoning (VSR) is a controlled probing dataset for testing vision-language models' capabilities of recognizing and reasoning about spatial relations in natural...
  • VQA v2.0

    We use the VQA v2.0 dataset for the evaluation of our proposed joint model, where the answers are balanced in order to minimize the effectiveness of learning dataset priors.
  • GQA

    The GQA dataset is a visual question answering dataset that characterizes in compositional question answering and visual reasoning about real-world images.
  • TGIF-QA

    The TGIF-QA dataset consists of 165165 QA pairs chosen from 71741 animated GIFs. To evaluate the spatiotemporal reasoning ability at the video level, TGIF-QA dataset designs...