-
CLEVR-Humans
The CLEVR-Humans dataset consists of 32,164 questions asked by humans, containing words and reasoning steps that were unseen in CLEVR. -
Image Captioning and Visual Question Answering
The dataset is used for image captioning and visual question answering. -
LLaVA 158k
The LLaVA 158k dataset is a large-scale multimodal learning dataset, which is used for training and testing multimodal large language models. -
Multimodal Robustness Benchmark
The MMR benchmark is designed to evaluate MLLMs' comprehension of visual content and robustness against misleading questions, ensuring models truly leverage multimodal inputs... -
Modality-Aware Integration with Large Language Models for Knowledge-based Vis...
Knowledge-based visual question answering (KVQA) has been extensively studied to answer visual questions with external knowledge, e.g., knowledge graphs (KGs). -
LLaVA-Instruct-150k
Visual question answering dataset -
GQA-OOD: Out-of-Domain VQA Benchmark
GQA-OOD is a benchmark dedicated to the out-of-domain VQA evaluation. -
GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question...
GQA is a new dataset for real-world visual reasoning and compositional question answering. -
VisualBERT
The VisualBERT dataset is a pre-trained model for vision-and-language tasks, which is built on top of PyTorch. -
Task Driven Image Understanding Challenge (TDIUC)
The Task Driven Image Understanding Challenge (TDIUC) dataset is a large VQA dataset with 12 more fine-grained categories proposed to compensate for the bias in distribution of... -
Semantic Equivalent Adversarial Data Augmentation for Visual Question Answering
Visual Question Answering (VQA) has achieved great success thanks to the fast development of deep neural networks (DNN). On the other hand, the data augmentation, as one of the...