-
Ask-Anything
A video-centric multimodal instruction dataset, composed of thousands of videos associated with detailed descriptions and conversations. -
QVHighlights
QVHighlights is a dataset for video highlight detection, which consists of over 10,000 videos annotated with human-written text queries. -
UCF101-24 dataset
The UCF101-24 dataset is a subset of the UCF101 dataset, containing 3207 videos with spatio-temporal annotations on 24 action categories. -
Video-Chat2
Video-Chat2: From dense token to sparse memory for long video understanding. -
Video-LLaVA
Video-LLaVA: Learning united visual representation by alignment before projection. -
Video-Chat
Video-Chat: Chat-centric video understanding. -
Video-LLaMA
Video-LLaMA: An instruction-tuned audio-visual language model for video understanding. -
Video-ChatGPT
Video-ChatGPT: Towards detailed video understanding via large vision and language models. -
High-Quality Fall Simulation Dataset (HQFSD)
The High-Quality Fall Simulation Dataset (HQFSD) is a challenging dataset for human fall detection, including multi-person scenarios, changing lighting, occlusion, and... -
Video-LLaMA: An instruction-tuned audio-visual language model for video under...
A video-LLaMA model for video understanding, comprising 100k videos with detailed captions. -
VideoChat: Chat-centric video understanding
A video-based instruction dataset for video understanding, comprising 100k videos with detailed captions. -
Valley: A Video Assistant with Large Language Model Enhanced Ability
A large multi-modal instruction-following dataset for video understanding, comprising 37k conversation pairs, 26k complex reasoning QA pairs and 10k detail description... -
Epic-Kitchens-100
The Epic-Kitchens-100 dataset contains 97 verb and 300 noun classes with actions defined by the combination of nouns and verbs.