-
Charades-STA dataset
Temporal grounding of activities, the identification of specific time intervals of actions within a larger event context, is a critical task in video understanding. -
Ask-Anything
A video-centric multimodal instruction dataset, composed of thousands of videos associated with detailed descriptions and conversations. -
PLOT-TAL - Prompt Learning with Optimal Transport for Few-Shot Temporal Actio...
Temporal Action Localization (TAL) in few-shot learning. Our work addresses the inherent limitations of conventional single-prompt learning methods that often lead to... -
QVHighlights
QVHighlights is a dataset for video highlight detection, which consists of over 10,000 videos annotated with human-written text queries. -
Long Video Understanding Benchmark
Towards long-form video understanding. We propose a two-stream spatio-temporal attention network for long video classification which combines the advantages of convolutional... -
MMX-Trailer-20 Dataset
Long form video understanding (LVU) is a sub-domain of video recognition concerned with understanding contextual information across contiguous shots which can contain multiple... -
Open Vocabulary Multi-Label Video Classification
Open vocabulary multi-label video classification dataset -
Video-Chat2
Video-Chat2: From dense token to sparse memory for long video understanding. -
Video-LLaVA
Video-LLaVA: Learning united visual representation by alignment before projection. -
Video-Chat
Video-Chat: Chat-centric video understanding. -
Video-LLaMA
Video-LLaMA: An instruction-tuned audio-visual language model for video understanding. -
Video-ChatGPT
Video-ChatGPT: Towards detailed video understanding via large vision and language models. -
ActivityNet, MSR-VTT, and MSVD
The dataset used in the paper is ActivityNet, MSR-VTT, and MSVD. The authors used these datasets for text-to-video retrieval tasks. -
High-Quality Fall Simulation Dataset (HQFSD)
The High-Quality Fall Simulation Dataset (HQFSD) is a challenging dataset for human fall detection, including multi-person scenarios, changing lighting, occlusion, and...