-
Compressing and Debiasing Vision-Language Pre-Trained Models for Visual Quest...
This paper investigates whether a VLP can be compressed and debiased simultaneously by searching sparse and robust subnetworks. -
Conceptual Captions 12M
The Conceptual Captions 12M (CC-12M) dataset consists of 12 million diverse and high-quality images paired with descriptive captions and titles. -
Sort-of-CLEVR
The dataset used in the paper is Sort-of-CLEVR, a visual question answering dataset. -
VQA-CP v2 and VQA 2.0
The dataset used in the paper is VQA-CP v2 and VQA 2.0, which are two standard datasets for visual question answering. -
CLEVR dataset
The CLEVR dataset is a dataset for visual question answering, where each image is annotated with a question. -
Visual7W dataset
The Visual7W dataset is a visual question answering dataset, which consists of images and corresponding questions. -
Extended RSVQAxBEN
The extended RSVQAxBEN dataset is an extension of the RSVQAxBEN dataset, including all the spectral bands of Sentinel-2 images with 10m and 20m spatial resolution. -
Conceptual Captions
The dataset used in the paper "Scaling Laws of Synthetic Images for Model Training". The dataset is used for supervised image classification and zero-shot classification tasks. -
Measuring Machine Intelligence through Visual Question Answering
Measuring machine intelligence through visual question answering. -
VQA: Visual Question Answering
Visual Question Answering (VQA) has emerged as a prominent multi-discipline research problem in both academia and industry. -
Hierarchical Question-Image Co-Attention for Visual Question Answering
A number of recent works have proposed attention models for Visual Question Answering (VQA) that generate spatial maps highlighting image regions relevant to answering the...