-
VizWiz-VQA
The VizWiz-VQA dataset is a large-scale visual question answering dataset that consists of 4,000 images with 10 crowd-worker answers each. -
VQAv2 dataset
The VQAv2 dataset, containing open-ended questions on 265k images, with 5.4 questions per image on average. -
CARETS: A Consistency And Robustness Evaluative Test Suite for VQA
CARETS is a systematic test suite to measure consistency and robustness of modern VQA models through a series of six fine-grained capability tests. -
CLEVR-Humans
The CLEVR-Humans dataset consists of 32,164 questions asked by humans, containing words and reasoning steps that were unseen in CLEVR. -
Image Captioning and Visual Question Answering
The dataset is used for image captioning and visual question answering. -
LLaVA-Instruct-150k
Visual question answering dataset -
SMART-101 dataset
The dataset for the SMART-101 challenge consists of 101 unique puzzles that require a mix of several elementary skills, including arithmetic, algebra, and spatial reasoning,... -
CLEVR-CoGenT
The CLEVR-CoGenT dataset is a dataset for visual question answering, where the questions consist on comparing the position of two objects. -
GQA-OOD: Out-of-Domain VQA Benchmark
GQA-OOD is a benchmark dedicated to the out-of-domain VQA evaluation. -
GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question...
GQA is a new dataset for real-world visual reasoning and compositional question answering. -
High Quality Image Text Pairs
The High Quality Image Text Pairs (HQITP-134M) dataset consists of 134 million diverse and high-quality images paired with descriptive captions and titles.