-
Mutan: Multimodal Tucker Fusion for Visual Question Answering
The dataset used in the paper is a collection of images and corresponding referring expressions. -
Visual ChatGPT
Visual ChatGPT is a system that integrates different Visual Foundation Models to understand visual information and generation corresponding answers. -
Super-CLEVR
The Super-CLEVR dataset contains synthetic scenes of randomly placed vehicles from 5 categories (car, plane, bicycle, motorbike, bus) with various attributes (color, material,... -
Super-CLEVR-3D
The Super-CLEVR-3D dataset contains questions explicitly querying 3D understanding including object parts, 3D poses, and occlusions. -
VQA-CPv1 and VQA-CPv2
VQA-CPv1 and VQA-CPv2 are datasets for visual question answering, containing questions answerable using visual images. -
Object Attribute Matters in Visual Question Answering
Visual question answering is a multimodal task that requires the joint comprehension of visual and textual information. The proposed approach utilizes object attributes to... -
SpatialSense
A dataset for visual spatial relationship classification (VSRC) with nine well-defined spatial relations. -
Winoground
The Winoground dataset consists of 400 items, each containing two image-caption pairs (I0, C0), (I1, C1). -
Semantic Equivalent Adversarial Data Augmentation for Visual Question Answering
Visual Question Answering (VQA) has achieved great success thanks to the fast development of deep neural networks (DNN). On the other hand, the data augmentation, as one of the... -
MovieQA, TVQA, AVSD, EQA, Embodied QA
A collection of datasets for visual question answering, including MovieQA, TVQA, AVSD, EQA, and Embodied QA. -
Visual Spatial Reasoning
Visual Spatial Reasoning (VSR) is a controlled probing dataset for testing vision-language models' capabilities of recognizing and reasoning about spatial relations in natural...