50 datasets found

Organizations: No Organization

Filter Results
  • YouCook2

    YouCook2 consists of recipes containing labels that separate the long horizon trajectories of demonstrations into events - with explicit time stamps for the beginning and end of...
  • TimeIT: A Video-Centric Instruction-Tuning Dataset

    TimeIT is a video-centric instruction-tuning dataset designed for instruction tuning. It is composed of 6 diverse tasks, 12 widely-used academic benchmarks, and a total of 125K...
  • TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Und...

    TimeChat is a time-sensitive multimodal large language model specifically designed for long video understanding. It incorporates two key architectural contributions: a...
  • WildQA

    A video understanding dataset of videos recorded in outside settings, including video question answering and video evidence selection.
  • MVBench

    A comprehensive multi-modal video understanding benchmark.
  • InterVid-14M-aesthetics

    The dataset used in the paper is InterVid-14M-aesthetics, which is a subset of InterVid-14M used to remove watermarks from generated videos.
  • VideoVista

    VideoVista is a comprehensive video evaluation benchmark for Video-LLMs that covers both video understanding and reasoning across 27 tasks.
  • Charades-STA dataset

    Temporal grounding of activities, the identification of specific time intervals of actions within a larger event context, is a critical task in video understanding.
  • TVQA

    TVQA is a video question answering dataset collected from 6 long-running TV shows from 3 genres. There are 21,793 video clips in total for QA collection, accompanied with...
  • Ask-Anything

    A video-centric multimodal instruction dataset, composed of thousands of videos associated with detailed descriptions and conversations.
  • VidOR

    The VidOR dataset is a rich video dataset containing natural videos of daily life.
  • VQ2D

    The VQ2D dataset is a subset of the Ego4D dataset, containing ground truth tracking annotations for the query object's last appearance.
  • EgoLoc

    The EgoLoc dataset is a reformulation of the VQ3D task and a modular pipeline that leads to significant improvements on the Ego4D VQ3D benchmark.
  • PLOT-TAL - Prompt Learning with Optimal Transport for Few-Shot Temporal Actio...

    Temporal Action Localization (TAL) in few-shot learning. Our work addresses the inherent limitations of conventional single-prompt learning methods that often lead to...
  • QVHighlights

    QVHighlights is a dataset for video highlight detection, which consists of over 10,000 videos annotated with human-written text queries.
  • Long Video Understanding Benchmark

    Towards long-form video understanding. We propose a two-stream spatio-temporal attention network for long video classification which combines the advantages of convolutional...
  • MMX-Trailer-20 Dataset

    Long form video understanding (LVU) is a sub-domain of video recognition concerned with understanding contextual information across contiguous shots which can contain multiple...
  • Open Vocabulary Multi-Label Video Classification

    Open vocabulary multi-label video classification dataset
  • MovieChat

    MovieChat: From dense token to sparse memory for long video understanding.
  • Video-Chat2

    Video-Chat2: From dense token to sparse memory for long video understanding.