-
SpatialSense
A dataset for visual spatial relationship classification (VSRC) with nine well-defined spatial relations. -
Winoground
The Winoground dataset consists of 400 items, each containing two image-caption pairs (I0, C0), (I1, C1). -
Semantic Equivalent Adversarial Data Augmentation for Visual Question Answering
Visual Question Answering (VQA) has achieved great success thanks to the fast development of deep neural networks (DNN). On the other hand, the data augmentation, as one of the... -
MovieQA, TVQA, AVSD, EQA, Embodied QA
A collection of datasets for visual question answering, including MovieQA, TVQA, AVSD, EQA, and Embodied QA. -
Visual Spatial Reasoning
Visual Spatial Reasoning (VSR) is a controlled probing dataset for testing vision-language models' capabilities of recognizing and reasoning about spatial relations in natural... -
Compressing and Debiasing Vision-Language Pre-Trained Models for Visual Quest...
This paper investigates whether a VLP can be compressed and debiased simultaneously by searching sparse and robust subnetworks. -
Conceptual Captions 12M
The Conceptual Captions 12M (CC-12M) dataset consists of 12 million diverse and high-quality images paired with descriptive captions and titles. -
Sort-of-CLEVR
The dataset used in the paper is Sort-of-CLEVR, a visual question answering dataset. -
VQA-CP v2 and VQA 2.0
The dataset used in the paper is VQA-CP v2 and VQA 2.0, which are two standard datasets for visual question answering. -
CLEVR dataset
The CLEVR dataset is a dataset for visual question answering, where each image is annotated with a question. -
Visual7W dataset
The Visual7W dataset is a visual question answering dataset, which consists of images and corresponding questions. -
Extended RSVQAxBEN
The extended RSVQAxBEN dataset is an extension of the RSVQAxBEN dataset, including all the spectral bands of Sentinel-2 images with 10m and 20m spatial resolution.