43 datasets found

Filter Results
  • Charades-STA dataset

    Temporal grounding of activities, the identification of specific time intervals of actions within a larger event context, is a critical task in video understanding.
  • TVQA

    TVQA is a video question answering dataset collected from 6 long-running TV shows from 3 genres. There are 21,793 video clips in total for QA collection, accompanied with...
  • Ask-Anything

    A video-centric multimodal instruction dataset, composed of thousands of videos associated with detailed descriptions and conversations.
  • VidOR

    The VidOR dataset is a rich video dataset containing natural videos of daily life.
  • VQ2D

    The VQ2D dataset is a subset of the Ego4D dataset, containing ground truth tracking annotations for the query object's last appearance.
  • EgoLoc

    The EgoLoc dataset is a reformulation of the VQ3D task and a modular pipeline that leads to significant improvements on the Ego4D VQ3D benchmark.
  • PLOT-TAL - Prompt Learning with Optimal Transport for Few-Shot Temporal Actio...

    Temporal Action Localization (TAL) in few-shot learning. Our work addresses the inherent limitations of conventional single-prompt learning methods that often lead to...
  • QVHighlights

    QVHighlights is a dataset for video highlight detection, which consists of over 10,000 videos annotated with human-written text queries.
  • Long Video Understanding Benchmark

    Towards long-form video understanding. We propose a two-stream spatio-temporal attention network for long video classification which combines the advantages of convolutional...
  • MMX-Trailer-20 Dataset

    Long form video understanding (LVU) is a sub-domain of video recognition concerned with understanding contextual information across contiguous shots which can contain multiple...
  • Open Vocabulary Multi-Label Video Classification

    Open vocabulary multi-label video classification dataset
  • MovieChat

    MovieChat: From dense token to sparse memory for long video understanding.
  • Video-Chat2

    Video-Chat2: From dense token to sparse memory for long video understanding.
  • Video-LLaVA

    Video-LLaVA: Learning united visual representation by alignment before projection.
  • Video-Chat

    Video-Chat: Chat-centric video understanding.
  • Video-LLaMA

    Video-LLaMA: An instruction-tuned audio-visual language model for video understanding.
  • Video-ChatGPT

    Video-ChatGPT: Towards detailed video understanding via large vision and language models.
  • ActivityNet, MSR-VTT, and MSVD

    The dataset used in the paper is ActivityNet, MSR-VTT, and MSVD. The authors used these datasets for text-to-video retrieval tasks.
  • High-Quality Fall Simulation Dataset (HQFSD)

    The High-Quality Fall Simulation Dataset (HQFSD) is a challenging dataset for human fall detection, including multi-person scenarios, changing lighting, occlusion, and...
  • MSRVTT

    The MSRVTT is a large-scale dataset for video captioning. It contains 10k video clips and each video clip is accompanied with 20 human-edited English sentence descriptions,...