Dataset - LDM

TimeIT: A Video-Centric Instruction-Tuning Dataset

TimeIT is a video-centric instruction-tuning dataset designed for instruction tuning. It is composed of 6 diverse tasks, 12 widely-used academic benchmarks, and a total of 125K...
- Dataset
- JSON
WildQA

A video understanding dataset of videos recorded in outside settings, including video question answering and video evidence selection.
- Dataset
- JSON
MVBench

A comprehensive multi-modal video understanding benchmark.
- Dataset
- JSON
InterVid-14M-aesthetics

The dataset used in the paper is InterVid-14M-aesthetics, which is a subset of InterVid-14M used to remove watermarks from generated videos.
- Dataset
- JSON
VideoVista

VideoVista is a comprehensive video evaluation benchmark for Video-LLMs that covers both video understanding and reasoning across 27 tasks.
- Dataset
- JSON
TVQA

TVQA is a video question answering dataset collected from 6 long-running TV shows from 3 genres. There are 21,793 video clips in total for QA collection, accompanied with...
- Dataset
- JSON
Ask-Anything

A video-centric multimodal instruction dataset, composed of thousands of videos associated with detailed descriptions and conversations.
- Dataset
- JSON
VidOR

The VidOR dataset is a rich video dataset containing natural videos of daily life.
- Dataset
- JSON
QVHighlights

QVHighlights is a dataset for video highlight detection, which consists of over 10,000 videos annotated with human-written text queries.
- Dataset
- JSON
UCF101-24 dataset

The UCF101-24 dataset is a subset of the UCF101 dataset, containing 3207 videos with spatio-temporal annotations on 24 action categories.
- Dataset
- JSON
MovieChat

MovieChat: From dense token to sparse memory for long video understanding.
- Dataset
- JSON
Video-Chat2

Video-Chat2: From dense token to sparse memory for long video understanding.
- Dataset
- JSON
Video-LLaVA

Video-LLaVA: Learning united visual representation by alignment before projection.
- Dataset
- JSON
Video-Chat

Video-Chat: Chat-centric video understanding.
- Dataset
- JSON
Video-LLaMA

Video-LLaMA: An instruction-tuned audio-visual language model for video understanding.
- Dataset
- JSON
Video-ChatGPT

Video-ChatGPT: Towards detailed video understanding via large vision and language models.
- Dataset
- JSON
High-Quality Fall Simulation Dataset (HQFSD)

The High-Quality Fall Simulation Dataset (HQFSD) is a challenging dataset for human fall detection, including multi-person scenarios, changing lighting, occlusion, and...
- Dataset
- JSON
CC3M-595K

The dataset used for training the Chat-UniVi model.
- Dataset
- JSON
MSRVTT

The MSRVTT is a large-scale dataset for video captioning. It contains 10k video clips and each video clip is accompanied with 20 human-edited English sentence descriptions,...
- Dataset
- JSON
TGIF-QA

The TGIF-QA dataset consists of 165165 QA pairs chosen from 71741 animated GIFs. To evaluate the spatiotemporal reasoning ability at the video level, TGIF-QA dataset designs...
- Dataset
- JSON

You can also access this registry using the API (see API Docs).

37 datasets found