Video Question Answering - Groups

FIBER

A dataset for fill-in-the-blank tasks, including video question answering.

Dataset
JSON

NExT-GQA

NExT-GQA is an extension of NExT-QA with 10.5K temporal grounding (or location) labels tied to the original QA pairs.

Dataset
JSON

ActivityNet-QA

Video question answering (VideoQA) is an essential task in vision-language understanding, which has attracted numerous research attention recently. Nevertheless, existing works...

Dataset
JSON

HealthVidQA-Prompt

The HealthVidQA-Prompt dataset is a large-scale medical instructional video question-answering dataset. It contains 52,771 video-question-answer triplets from 13,990 medical...

Dataset
JSON

HealthVidQA-CRF

The HealthVidQA-CRF dataset is a large-scale medical instructional video question-answering dataset. It contains 23,434 video-question-answer triplets from 11,708 medical videos.

Dataset
JSON

KnowIT VQA

A video story question answering dataset containing 24,282 questions about 207 episodes of The Big Bang Theory.

Dataset
JSON

LifeQA

A dataset for video question answering, including video evidence selection.

Dataset
JSON

Progressive Graph Attention Network for Video Question Answering

Progressive Graph Attention Network for Video Question Answering.

Dataset
JSON

TVQA

TVQA is a video question answering dataset collected from 6 long-running TV shows from 3 genres. There are 21,793 video clips in total for QA collection, accompanied with...

Dataset
JSON

Rethinking Multi-Modal Alignment in Multi-Choice VideoQA from Feature and Sam...

Reasoning about causal and temporal event relations in videos is a new destination of Video Question Answering (VideoQA). The major stumbling block to achieve this purpose is...

Dataset
JSON

MSRVTT-QA

Video question answering (VideoQA) requires systems to understand the visual information and infer an answer for a natural language question from it.

Dataset
JSON

Slot-VLM: SlowFast Slots for Video-Language Modeling

Video-Language Models (VLMs), powered by the advancements in Large Language Models (LLMs), are charting new frontiers in video understanding. A pivotal challenge is the...

Dataset
JSON

Youtube2Text-QA

Video question answering task, which requires machines to answer questions about videos in a natural language form.

Dataset
JSON

MSVD-QA

The MSVD-QA dataset is a benchmark for video question answering, containing 1,970 videos with multiple-choice questions.

Dataset
JSON

TGIF-QA

The TGIF-QA dataset consists of 165165 QA pairs chosen from 71741 animated GIFs. To evaluate the spatiotemporal reasoning ability at the video level, TGIF-QA dataset designs...

Dataset
JSON

Zero-shot video question answering via frozen bidirectional language models

Zero-shot video question answering via frozen bidirectional language models.

Dataset
JSON

ORBIT

The ORBIT dataset is a collection of videos recorded on cell phones by people who are blind or low-vision. The dataset consists of 3,822 videos with 486 object categories...

Dataset
JSON

EgoSchema

EgoSchema is a diagnostic benchmark for assessing very long-form video-language understanding capabilities of modern multimodal systems.

Dataset
JSON

Next-QA

A video question answering dataset that focuses on visually grounded video question answering.

Dataset
JSON

MSVD

Text-Video Retrieval (TVR) aims to align relevant video content with natural language queries. To date, most state-of-the-art TVR methods learn image-to-video transfer learning...

Dataset
JSON

21 datasets found