-
ATLAS Dione
ATLAS Dione is a dataset for RAS video understanding, providing video data of ten surgeons performing six different surgical tasks on the daVinci Surgical System (dVSS... -
Video-LLaMA: An instruction-tuned audio-visual language model for video under...
A video-LLaMA model for video understanding, comprising 100k videos with detailed captions. -
VideoChat: Chat-centric video understanding
A video-based instruction dataset for video understanding, comprising 100k videos with detailed captions. -
Valley: A Video Assistant with Large Language Model Enhanced Ability
A large multi-modal instruction-following dataset for video understanding, comprising 37k conversation pairs, 26k complex reasoning QA pairs and 10k detail description... -
Temporally-Adaptive Convolutions for Video Understanding
Spatial convolutions are extensively used in numerous deep video models. It fundamentally assumes spatio-temporal invariance, i.e., using shared weights for every location in... -
Visual Semantic Role Labeling for Video Understanding
Visual Semantic Role Labeling for Video Understanding. -
FiGCLIP: Fine-Grained CLIP Adaptation via Densely Annotated Videos
Fine-grained adaptation of the popular CLIP model across multiple datasets. -
Motion-Guided Masking for Spatiotemporal Representation Learning
The authors used several video benchmarks, including Kinetics-400 and Something-Something V2, to evaluate their proposed motion-guided masking algorithm. -
Kinetics-400 and Kinetics-600
The Kinetics-400 and Kinetics-600 datasets are video understanding datasets used for learning rich and multi-scale spatiotemporal semantics from high-dimensional videos. -
VideoStreaming
A novel approach to tackle the complexities of long video understanding with large language models (LLMs). Our proposed memory-propagated streaming encoding architecture... -
Kinetics-400, UCF101, HMDB51, Something-Something V1, and Something-Something V2
The Kinetics-400, UCF101, HMDB51, Something-Something V1, and Something-Something V2 datasets are used for evaluating the performance of the Bi-Calibration Networks. -
Kinetics-400
Motion has shown to be useful for video understanding, where motion is typically represented by optical flow. However, computing flow from video frames is very time-consuming....