-
CLEVR-Humans
The CLEVR-Humans dataset consists of 32,164 questions asked by humans, containing words and reasoning steps that were unseen in CLEVR. -
Image Captioning and Visual Question Answering
The dataset is used for image captioning and visual question answering. -
LLaVA-Instruct-150k
Visual question answering dataset -
SMART-101 dataset
The dataset for the SMART-101 challenge consists of 101 unique puzzles that require a mix of several elementary skills, including arithmetic, algebra, and spatial reasoning,... -
CLEVR-CoGenT
The CLEVR-CoGenT dataset is a dataset for visual question answering, where the questions consist on comparing the position of two objects. -
GQA-OOD: Out-of-Domain VQA Benchmark
GQA-OOD is a benchmark dedicated to the out-of-domain VQA evaluation. -
GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question...
GQA is a new dataset for real-world visual reasoning and compositional question answering. -
High Quality Image Text Pairs
The High Quality Image Text Pairs (HQITP-134M) dataset consists of 134 million diverse and high-quality images paired with descriptive captions and titles. -
Mutan: Multimodal Tucker Fusion for Visual Question Answering
The dataset used in the paper is a collection of images and corresponding referring expressions. -
Visual ChatGPT
Visual ChatGPT is a system that integrates different Visual Foundation Models to understand visual information and generation corresponding answers. -
Super-CLEVR
The Super-CLEVR dataset contains synthetic scenes of randomly placed vehicles from 5 categories (car, plane, bicycle, motorbike, bus) with various attributes (color, material,... -
Super-CLEVR-3D
The Super-CLEVR-3D dataset contains questions explicitly querying 3D understanding including object parts, 3D poses, and occlusions. -
VQA-CPv1 and VQA-CPv2
VQA-CPv1 and VQA-CPv2 are datasets for visual question answering, containing questions answerable using visual images. -
Object Attribute Matters in Visual Question Answering
Visual question answering is a multimodal task that requires the joint comprehension of visual and textual information. The proposed approach utilizes object attributes to...