-
ActivityNet-QA
Video question answering (VideoQA) is an essential task in vision-language understanding, which has attracted numerous research attention recently. Nevertheless, existing works... -
LLaVA 158k
The LLaVA 158k dataset is a large-scale multimodal learning dataset, which is used for training and testing multimodal large language models. -
Multimodal Robustness Benchmark
The MMR benchmark is designed to evaluate MLLMs' comprehension of visual content and robustness against misleading questions, ensuring models truly leverage multimodal inputs... -
Youtube2Text-QA
Video question answering task, which requires machines to answer questions about videos in a natural language form. -
Conceptual Captions
The dataset used in the paper "Scaling Laws of Synthetic Images for Model Training". The dataset is used for supervised image classification and zero-shot classification tasks.