37 datasets found

Tags: video understanding

Filter Results
  • TimeIT: A Video-Centric Instruction-Tuning Dataset

    TimeIT is a video-centric instruction-tuning dataset designed for instruction tuning. It is composed of 6 diverse tasks, 12 widely-used academic benchmarks, and a total of 125K...
  • WildQA

    A video understanding dataset of videos recorded in outside settings, including video question answering and video evidence selection.
  • MVBench

    A comprehensive multi-modal video understanding benchmark.
  • InterVid-14M-aesthetics

    The dataset used in the paper is InterVid-14M-aesthetics, which is a subset of InterVid-14M used to remove watermarks from generated videos.
  • VideoVista

    VideoVista is a comprehensive video evaluation benchmark for Video-LLMs that covers both video understanding and reasoning across 27 tasks.
  • TVQA

    TVQA is a video question answering dataset collected from 6 long-running TV shows from 3 genres. There are 21,793 video clips in total for QA collection, accompanied with...
  • Ask-Anything

    A video-centric multimodal instruction dataset, composed of thousands of videos associated with detailed descriptions and conversations.
  • VidOR

    The VidOR dataset is a rich video dataset containing natural videos of daily life.
  • QVHighlights

    QVHighlights is a dataset for video highlight detection, which consists of over 10,000 videos annotated with human-written text queries.
  • UCF101-24 dataset

    The UCF101-24 dataset is a subset of the UCF101 dataset, containing 3207 videos with spatio-temporal annotations on 24 action categories.
  • MovieChat

    MovieChat: From dense token to sparse memory for long video understanding.
  • Video-Chat2

    Video-Chat2: From dense token to sparse memory for long video understanding.
  • Video-LLaVA

    Video-LLaVA: Learning united visual representation by alignment before projection.
  • Video-Chat

    Video-Chat: Chat-centric video understanding.
  • Video-LLaMA

    Video-LLaMA: An instruction-tuned audio-visual language model for video understanding.
  • Video-ChatGPT

    Video-ChatGPT: Towards detailed video understanding via large vision and language models.
  • High-Quality Fall Simulation Dataset (HQFSD)

    The High-Quality Fall Simulation Dataset (HQFSD) is a challenging dataset for human fall detection, including multi-person scenarios, changing lighting, occlusion, and...
  • CC3M-595K

    The dataset used for training the Chat-UniVi model.
  • MSRVTT

    The MSRVTT is a large-scale dataset for video captioning. It contains 10k video clips and each video clip is accompanied with 20 human-edited English sentence descriptions,...
  • TGIF-QA

    The TGIF-QA dataset consists of 165165 QA pairs chosen from 71741 animated GIFs. To evaluate the spatiotemporal reasoning ability at the video level, TGIF-QA dataset designs...
You can also access this registry using the API (see API Docs).