-
CLEVR-Humans
The CLEVR-Humans dataset consists of 32,164 questions asked by humans, containing words and reasoning steps that were unseen in CLEVR. -
LLaVA 158k
The LLaVA 158k dataset is a large-scale multimodal learning dataset, which is used for training and testing multimodal large language models. -
Multimodal Robustness Benchmark
The MMR benchmark is designed to evaluate MLLMs' comprehension of visual content and robustness against misleading questions, ensuring models truly leverage multimodal inputs... -
Modality-Aware Integration with Large Language Models for Knowledge-based Vis...
Knowledge-based visual question answering (KVQA) has been extensively studied to answer visual questions with external knowledge, e.g., knowledge graphs (KGs). -
GQA-OOD: Out-of-Domain VQA Benchmark
GQA-OOD is a benchmark dedicated to the out-of-domain VQA evaluation. -
GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question...
GQA is a new dataset for real-world visual reasoning and compositional question answering. -
VisualBERT
The VisualBERT dataset is a pre-trained model for vision-and-language tasks, which is built on top of PyTorch. -
Task Driven Image Understanding Challenge (TDIUC)
The Task Driven Image Understanding Challenge (TDIUC) dataset is a large VQA dataset with 12 more fine-grained categories proposed to compensate for the bias in distribution of... -
Visual Text Question Answering (VTQA)
A new challenge named Visual Text Question Answering (VTQA) along with a corresponding dataset, which includes 23,781 questions based on 10,124 image-text pairs. -
Measuring Machine Intelligence through Visual Question Answering
Measuring machine intelligence through visual question answering.