13 datasets found

Groups: Video Question Answering

Filter Results
  • TVQA

    TVQA is a video question answering dataset collected from 6 long-running TV shows from 3 genres. There are 21,793 video clips in total for QA collection, accompanied with...
  • Rethinking Multi-Modal Alignment in Multi-Choice VideoQA from Feature and Sam...

    Reasoning about causal and temporal event relations in videos is a new destination of Video Question Answering (VideoQA). The major stumbling block to achieve this purpose is...
  • MSRVTT-QA

    Video question answering (VideoQA) requires systems to understand the visual information and infer an answer for a natural language question from it.
  • Slot-VLM: SlowFast Slots for Video-Language Modeling

    Video-Language Models (VLMs), powered by the advancements in Large Language Models (LLMs), are charting new frontiers in video understanding. A pivotal challenge is the...
  • Youtube2Text-QA

    Video question answering task, which requires machines to answer questions about videos in a natural language form.
  • MSVD-QA

    The MSVD-QA dataset is a benchmark for video question answering, containing 1,970 videos with multiple-choice questions.
  • TGIF-QA

    The TGIF-QA dataset consists of 165165 QA pairs chosen from 71741 animated GIFs. To evaluate the spatiotemporal reasoning ability at the video level, TGIF-QA dataset designs...
  • Zero-shot video question answering via frozen bidirectional language models

    Zero-shot video question answering via frozen bidirectional language models.
  • ORBIT

    The ORBIT dataset is a collection of videos recorded on cell phones by people who are blind or low-vision. The dataset consists of 3,822 videos with 486 object categories...
  • EgoSchema

    EgoSchema is a diagnostic benchmark for assessing very long-form video-language understanding capabilities of modern multimodal systems.
  • Next-QA

    A video question answering dataset that focuses on visually grounded video question answering.
  • MSVD

    Text-Video Retrieval (TVR) aims to align relevant video content with natural language queries. To date, most state-of-the-art TVR methods learn image-to-video transfer learning...
  • MSR-VTT

    The dataset used in the paper is MSR-VTT, a large video description dataset for bridging video and language. The dataset contains 10k video clips with length varying from 10 to...
You can also access this registry using the API (see API Docs).