-
Something-Something-V1 and V2
The Something-Something-V1 and V2 dataset contains 174 human action categories with 108K and 220K videos. -
Temporally-Adaptive Convolutions for Video Understanding
Spatial convolutions are extensively used in numerous deep video models. It fundamentally assumes spatio-temporal invariance, i.e., using shared weights for every location in... -
Kinetics-400 and Kinetics-600
The Kinetics-400 and Kinetics-600 datasets are video understanding datasets used for learning rich and multi-scale spatiotemporal semantics from high-dimensional videos. -
Kinetics-400, UCF101, HMDB51, Something-Something V1, and Something-Something V2
The Kinetics-400, UCF101, HMDB51, Something-Something V1, and Something-Something V2 datasets are used for evaluating the performance of the Bi-Calibration Networks. -
Kinetics-400
Motion has shown to be useful for video understanding, where motion is typically represented by optical flow. However, computing flow from video frames is very time-consuming.... -
TOPA: Extend Large Language Models for Video Understanding via Text-Only Pre-...
TOPA is a text-only pre-alignment framework for extending large language models for video understanding without the need for pre-training on real video data.