Question Answering - Groups

LLaVA 158k

The LLaVA 158k dataset is a large-scale multimodal learning dataset, which is used for training and testing multimodal large language models.

Dataset
JSON

Multimodal Robustness Benchmark

The MMR benchmark is designed to evaluate MLLMs' comprehension of visual content and robustness against misleading questions, ensuring models truly leverage multimodal inputs...

Dataset
JSON

Multimodal Visual Patterns (MMVP) Benchmark

The Multimodal Visual Patterns (MMVP) benchmark is a dataset used to evaluate the visual question answering capabilities of multimodal large language models (MLLMs).

Dataset
JSON

LLaMA-7B

A benchmark for evaluating the perception ability of Large Vision-Language Models (LVLMs) via various subtasks and scenarios.

Dataset
JSON

Dysca: A Dynamic and Scalable Benchmark for Evaluating Perception Ability of ...

Dysca is a dynamic and scalable benchmark for evaluating the perception ability of Large Vision-Language Models (LVLMs) via various subtasks and scenarios.

Dataset
JSON

MSRVTT-QA

Video question answering (VideoQA) requires systems to understand the visual information and infer an answer for a natural language question from it.

Dataset
JSON

Youtube2Text-QA

Video question answering task, which requires machines to answer questions about videos in a natural language form.

Dataset
JSON

MSVD-QA

The MSVD-QA dataset is a benchmark for video question answering, containing 1,970 videos with multiple-choice questions.

Dataset
JSON

TGIF-QA

The TGIF-QA dataset consists of 165165 QA pairs chosen from 71741 animated GIFs. To evaluate the spatiotemporal reasoning ability at the video level, TGIF-QA dataset designs...

Dataset
JSON

AVQA

The AVQA dataset contains 57,015 videos and 57,335 question-and-answer pairs.

Dataset
JSON

Music-AVQA

The Music-AVQA dataset contains multiple question-and-answer pairs, with 9,288 videos and 45,867 question-and-answer pairs.

Dataset
JSON

Audio-Visual Question Answering

Audio-visual question answering (AVQA) requires reference to video content and auditory information, followed by correlating the question to predict the most precise answer.

Dataset
JSON

Conceptual Captions

The dataset used in the paper "Scaling Laws of Synthetic Images for Model Training". The dataset is used for supervised image classification and zero-shot classification tasks.

Dataset
JSON

13 datasets found