-
ActivityNet-QA
Video question answering (VideoQA) is an essential task in vision-language understanding, which has attracted numerous research attention recently. Nevertheless, existing works... -
Star: A Benchmark for Situated Reasoning in Real-World Videos
The STAR dataset provides 60K situated reasoning questions based on 22K trimmed situation video clips. -
Agqa: A Benchmark for Compositional Spatio-Temporal Reasoning
The AGQA benchmark is a visual dataset comprising 192M hand-crafted questions about 9.6K videos from the Charades dataset. -
Learning to Predict Situation Hyper-Graphs for Video Question Answering
The SHG-VQA model predicts a situation hyper-graph structure composed of existing actions and relations in the input video. -
HealthVidQA-Prompt
The HealthVidQA-Prompt dataset is a large-scale medical instructional video question-answering dataset. It contains 52,771 video-question-answer triplets from 13,990 medical... -
HealthVidQA-CRF
The HealthVidQA-CRF dataset is a large-scale medical instructional video question-answering dataset. It contains 23,434 video-question-answer triplets from 11,708 medical videos. -
KnowIT VQA
A video story question answering dataset containing 24,282 questions about 207 episodes of The Big Bang Theory. -
Progressive Graph Attention Network for Video Question Answering
Progressive Graph Attention Network for Video Question Answering. -
Dense-Caption Matching and Frame-Selection Gating for Temporal Localization i...
This paper proposes a video question answering model that effectively integrates multi-modal input sources and finds the temporally relevant information to answer questions. -
Slot-VLM: SlowFast Slots for Video-Language Modeling
Video-Language Models (VLMs), powered by the advancements in Large Language Models (LLMs), are charting new frontiers in video understanding. A pivotal challenge is the... -
Youtube2Text-QA
Video question answering task, which requires machines to answer questions about videos in a natural language form. -
Zero-shot video question answering via frozen bidirectional language models
Zero-shot video question answering via frozen bidirectional language models.