Dataset - LDM

SMART-101 dataset

The dataset for the SMART-101 challenge consists of 101 unique puzzles that require a mix of several elementary skills, including arithmetic, algebra, and spatial reasoning,...
- Dataset
- JSON
High Quality Image Text Pairs

The High Quality Image Text Pairs (HQITP-134M) dataset consists of 134 million diverse and high-quality images paired with descriptive captions and titles.
- Dataset
- JSON
OK-VQA

The OK-VQA dataset is a visual question answering benchmark requiring external knowledge.
- Dataset
- JSON
Visual7W

The Visual7W dataset for Visual Question Answering (VQA). The dataset contains 7,000 images with 7,000 queries.
- Dataset
- JSON
Visual Question Answering as Reading Comprehension

Visual question answering (VQA) demands simultaneous comprehension of both the image visual content and natural language questions. In some cases, the reasoning needs the help...
- Dataset
- JSON
Mutan: Multimodal Tucker Fusion for Visual Question Answering

The dataset used in the paper is a collection of images and corresponding referring expressions.
- Dataset
- JSON
VQAvs

VQAvs is a dataset for visual question answering, containing questions answerable using visual images.
- Dataset
- JSON
VQA-CPv1 and VQA-CPv2

VQA-CPv1 and VQA-CPv2 are datasets for visual question answering, containing questions answerable using visual images.
- Dataset
- JSON
Object Attribute Matters in Visual Question Answering

Visual question answering is a multimodal task that requires the joint comprehension of visual and textual information. The proposed approach utilizes object attributes to...
- Dataset
- JSON
SpatialSense

A dataset for visual spatial relationship classification (VSRC) with nine well-defined spatial relations.
- Dataset
- JSON
Winoground

The Winoground dataset consists of 400 items, each containing two image-caption pairs (I0, C0), (I1, C1).
- Dataset
- JSON
Multimodal Visual Patterns (MMVP) Benchmark

The Multimodal Visual Patterns (MMVP) benchmark is a dataset used to evaluate the visual question answering capabilities of multimodal large language models (MLLMs).
- Dataset
- JSON
VQA

The VQA dataset is a large-scale visual question answering dataset that consists of pairs of images that require natural language answers.
- Dataset
- JSON
Composite Dataset

The Composite dataset, containing 11,985 human judgments over Flickr 8K, Flickr 30K, and COCO captions.
- Dataset
- JSON
MovieQA, TVQA, AVSD, EQA, Embodied QA

A collection of datasets for visual question answering, including MovieQA, TVQA, AVSD, EQA, and Embodied QA.
- Dataset
- JSON
VQA v2.0

We use the VQA v2.0 dataset for the evaluation of our proposed joint model, where the answers are balanced in order to minimize the effectiveness of learning dataset priors.
- Dataset
- JSON
GQA

The GQA dataset is a visual question answering dataset that characterizes in compositional question answering and visual reasoning about real-world images.
- Dataset
- JSON
Conceptual Captions 12M

The Conceptual Captions 12M (CC-12M) dataset consists of 12 million diverse and high-quality images paired with descriptive captions and titles.
- Dataset
- JSON
Sort-of-CLEVR

The dataset used in the paper is Sort-of-CLEVR, a visual question answering dataset.
- Dataset
- JSON
CLEVR dataset

The CLEVR dataset is a dataset for visual question answering, where each image is annotated with a question.
- Dataset
- JSON

You can also access this registry using the API (see API Docs).

32 datasets found