-
TimeIT: A Video-Centric Instruction-Tuning Dataset
TimeIT is a video-centric instruction-tuning dataset designed for instruction tuning. It is composed of 6 diverse tasks, 12 widely-used academic benchmarks, and a total of 125K... -
TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Und...
TimeChat is a time-sensitive multimodal large language model specifically designed for long video understanding. It incorporates two key architectural contributions: a... -
InterVid-14M-aesthetics
The dataset used in the paper is InterVid-14M-aesthetics, which is a subset of InterVid-14M used to remove watermarks from generated videos. -
VideoVista
VideoVista is a comprehensive video evaluation benchmark for Video-LLMs that covers both video understanding and reasoning across 27 tasks. -
Charades-STA dataset
Temporal grounding of activities, the identification of specific time intervals of actions within a larger event context, is a critical task in video understanding. -
Ask-Anything
A video-centric multimodal instruction dataset, composed of thousands of videos associated with detailed descriptions and conversations. -
PLOT-TAL - Prompt Learning with Optimal Transport for Few-Shot Temporal Actio...
Temporal Action Localization (TAL) in few-shot learning. Our work addresses the inherent limitations of conventional single-prompt learning methods that often lead to... -
QVHighlights
QVHighlights is a dataset for video highlight detection, which consists of over 10,000 videos annotated with human-written text queries. -
Long Video Understanding Benchmark
Towards long-form video understanding. We propose a two-stream spatio-temporal attention network for long video classification which combines the advantages of convolutional... -
MMX-Trailer-20 Dataset
Long form video understanding (LVU) is a sub-domain of video recognition concerned with understanding contextual information across contiguous shots which can contain multiple... -
Open Vocabulary Multi-Label Video Classification
Open vocabulary multi-label video classification dataset -
Video-Chat2
Video-Chat2: From dense token to sparse memory for long video understanding.