-
TimeIT: A Video-Centric Instruction-Tuning Dataset
TimeIT is a video-centric instruction-tuning dataset designed for instruction tuning. It is composed of 6 diverse tasks, 12 widely-used academic benchmarks, and a total of 125K... -
InterVid-14M-aesthetics
The dataset used in the paper is InterVid-14M-aesthetics, which is a subset of InterVid-14M used to remove watermarks from generated videos. -
VideoVista
VideoVista is a comprehensive video evaluation benchmark for Video-LLMs that covers both video understanding and reasoning across 27 tasks. -
Ask-Anything
A video-centric multimodal instruction dataset, composed of thousands of videos associated with detailed descriptions and conversations. -
QVHighlights
QVHighlights is a dataset for video highlight detection, which consists of over 10,000 videos annotated with human-written text queries. -
UCF101-24 dataset
The UCF101-24 dataset is a subset of the UCF101 dataset, containing 3207 videos with spatio-temporal annotations on 24 action categories. -
Video-Chat2
Video-Chat2: From dense token to sparse memory for long video understanding. -
Video-LLaVA
Video-LLaVA: Learning united visual representation by alignment before projection. -
Video-Chat
Video-Chat: Chat-centric video understanding. -
Video-LLaMA
Video-LLaMA: An instruction-tuned audio-visual language model for video understanding. -
Video-ChatGPT
Video-ChatGPT: Towards detailed video understanding via large vision and language models. -
High-Quality Fall Simulation Dataset (HQFSD)
The High-Quality Fall Simulation Dataset (HQFSD) is a challenging dataset for human fall detection, including multi-person scenarios, changing lighting, occlusion, and...