-
SMART-101 dataset
The dataset for the SMART-101 challenge consists of 101 unique puzzles that require a mix of several elementary skills, including arithmetic, algebra, and spatial reasoning,... -
High Quality Image Text Pairs
The High Quality Image Text Pairs (HQITP-134M) dataset consists of 134 million diverse and high-quality images paired with descriptive captions and titles. -
Visual Question Answering as Reading Comprehension
Visual question answering (VQA) demands simultaneous comprehension of both the image visual content and natural language questions. In some cases, the reasoning needs the help... -
Mutan: Multimodal Tucker Fusion for Visual Question Answering
The dataset used in the paper is a collection of images and corresponding referring expressions. -
VQA-CPv1 and VQA-CPv2
VQA-CPv1 and VQA-CPv2 are datasets for visual question answering, containing questions answerable using visual images. -
Object Attribute Matters in Visual Question Answering
Visual question answering is a multimodal task that requires the joint comprehension of visual and textual information. The proposed approach utilizes object attributes to... -
SpatialSense
A dataset for visual spatial relationship classification (VSRC) with nine well-defined spatial relations. -
Winoground
The Winoground dataset consists of 400 items, each containing two image-caption pairs (I0, C0), (I1, C1). -
Multimodal Visual Patterns (MMVP) Benchmark
The Multimodal Visual Patterns (MMVP) benchmark is a dataset used to evaluate the visual question answering capabilities of multimodal large language models (MLLMs). -
Composite Dataset
The Composite dataset, containing 11,985 human judgments over Flickr 8K, Flickr 30K, and COCO captions. -
MovieQA, TVQA, AVSD, EQA, Embodied QA
A collection of datasets for visual question answering, including MovieQA, TVQA, AVSD, EQA, and Embodied QA. -
Conceptual Captions 12M
The Conceptual Captions 12M (CC-12M) dataset consists of 12 million diverse and high-quality images paired with descriptive captions and titles. -
Sort-of-CLEVR
The dataset used in the paper is Sort-of-CLEVR, a visual question answering dataset. -
CLEVR dataset
The CLEVR dataset is a dataset for visual question answering, where each image is annotated with a question.