32 datasets found

Formats: JSON Tags: visual question answering

Filter Results
  • SMART-101 dataset

    The dataset for the SMART-101 challenge consists of 101 unique puzzles that require a mix of several elementary skills, including arithmetic, algebra, and spatial reasoning,...
  • High Quality Image Text Pairs

    The High Quality Image Text Pairs (HQITP-134M) dataset consists of 134 million diverse and high-quality images paired with descriptive captions and titles.
  • OK-VQA

    The OK-VQA dataset is a visual question answering benchmark requiring external knowledge.
  • Visual7W

    The Visual7W dataset for Visual Question Answering (VQA). The dataset contains 7,000 images with 7,000 queries.
  • Visual Question Answering as Reading Comprehension

    Visual question answering (VQA) demands simultaneous comprehension of both the image visual content and natural language questions. In some cases, the reasoning needs the help...
  • Mutan: Multimodal Tucker Fusion for Visual Question Answering

    The dataset used in the paper is a collection of images and corresponding referring expressions.
  • VQAvs

    VQAvs is a dataset for visual question answering, containing questions answerable using visual images.
  • VQA-CPv1 and VQA-CPv2

    VQA-CPv1 and VQA-CPv2 are datasets for visual question answering, containing questions answerable using visual images.
  • Object Attribute Matters in Visual Question Answering

    Visual question answering is a multimodal task that requires the joint comprehension of visual and textual information. The proposed approach utilizes object attributes to...
  • SpatialSense

    A dataset for visual spatial relationship classification (VSRC) with nine well-defined spatial relations.
  • Winoground

    The Winoground dataset consists of 400 items, each containing two image-caption pairs (I0, C0), (I1, C1).
  • Multimodal Visual Patterns (MMVP) Benchmark

    The Multimodal Visual Patterns (MMVP) benchmark is a dataset used to evaluate the visual question answering capabilities of multimodal large language models (MLLMs).
  • VQA

    The VQA dataset is a large-scale visual question answering dataset that consists of pairs of images that require natural language answers.
  • Composite Dataset

    The Composite dataset, containing 11,985 human judgments over Flickr 8K, Flickr 30K, and COCO captions.
  • MovieQA, TVQA, AVSD, EQA, Embodied QA

    A collection of datasets for visual question answering, including MovieQA, TVQA, AVSD, EQA, and Embodied QA.
  • VQA v2.0

    We use the VQA v2.0 dataset for the evaluation of our proposed joint model, where the answers are balanced in order to minimize the effectiveness of learning dataset priors.
  • GQA

    The GQA dataset is a visual question answering dataset that characterizes in compositional question answering and visual reasoning about real-world images.
  • Conceptual Captions 12M

    The Conceptual Captions 12M (CC-12M) dataset consists of 12 million diverse and high-quality images paired with descriptive captions and titles.
  • Sort-of-CLEVR

    The dataset used in the paper is Sort-of-CLEVR, a visual question answering dataset.
  • CLEVR dataset

    The CLEVR dataset is a dataset for visual question answering, where each image is annotated with a question.
You can also access this registry using the API (see API Docs).