32 datasets found

Tags: video understanding

Filter Results
  • Something-Something-V1 and V2

    The Something-Something-V1 and V2 dataset contains 174 human action categories with 108K and 220K videos.
  • Temporally-Adaptive Convolutions for Video Understanding

    Spatial convolutions are extensively used in numerous deep video models. It fundamentally assumes spatio-temporal invariance, i.e., using shared weights for every location in...
  • UniFormer

    The UniFormer dataset is a video understanding dataset used for learning rich and multi-scale spatiotemporal semantics from high-dimensional videos.
  • Kinetics-400 and Kinetics-600

    The Kinetics-400 and Kinetics-600 datasets are video understanding datasets used for learning rich and multi-scale spatiotemporal semantics from high-dimensional videos.
  • Kinetics-400, UCF101, HMDB51, Something-Something V1, and Something-Something V2

    The Kinetics-400, UCF101, HMDB51, Something-Something V1, and Something-Something V2 datasets are used for evaluating the performance of the Bi-Calibration Networks.
  • HMDB-51

    Motion has shown to be useful for video understanding, where motion is typically represented by optical flow. However, computing flow from video frames is very time-consuming....
  • Kinetics-400

    Motion has shown to be useful for video understanding, where motion is typically represented by optical flow. However, computing flow from video frames is very time-consuming....
  • MSVD

    Text-Video Retrieval (TVR) aims to align relevant video content with natural language queries. To date, most state-of-the-art TVR methods learn image-to-video transfer learning...
  • MSR-VTT

    The dataset used in the paper is MSR-VTT, a large video description dataset for bridging video and language. The dataset contains 10k video clips with length varying from 10 to...
  • TextVid

    The TextVid dataset is a textual video dataset automatically generated by advanced LLMs.
  • TOPA: Extend Large Language Models for Video Understanding via Text-Only Pre-...

    TOPA is a text-only pre-alignment framework for extending large language models for video understanding without the need for pre-training on real video data.
  • UCF101

    The UCF101 dataset contains 13320 videos distributed in 101 action categories. This dataset is different from the above ones in that it contains mostly coarse sports activities...
You can also access this registry using the API (see API Docs).