Video Understanding - Groups

ATLAS Dione

ATLAS Dione is a dataset for RAS video understanding, providing video data of ten surgeons performing six different surgical tasks on the daVinci Surgical System (dVSS...

Dataset
JSON

TGIF-QA

The TGIF-QA dataset consists of 165165 QA pairs chosen from 71741 animated GIFs. To evaluate the spatiotemporal reasoning ability at the video level, TGIF-QA dataset designs...

Dataset
JSON

Video-LLaMA: An instruction-tuned audio-visual language model for video under...

A video-LLaMA model for video understanding, comprising 100k videos with detailed captions.

Dataset
JSON

VideoChat: Chat-centric video understanding

A video-based instruction dataset for video understanding, comprising 100k videos with detailed captions.

Dataset
JSON

Valley: A Video Assistant with Large Language Model Enhanced Ability

A large multi-modal instruction-following dataset for video understanding, comprising 37k conversation pairs, 26k complex reasoning QA pairs and 10k detail description...

Dataset
JSON

CSM

A dataset for video understanding, containing images and videos.

Dataset
JSON

Movienet

A dataset for video understanding, containing images and videos.

Dataset
JSON

Temporally-Adaptive Convolutions for Video Understanding

Spatial convolutions are extensively used in numerous deep video models. It fundamentally assumes spatio-temporal invariance, i.e., using shared weights for every location in...

Dataset
JSON

Visual Semantic Role Labeling for Video Understanding

Visual Semantic Role Labeling for Video Understanding.

Dataset
JSON

FiGCLIP: Fine-Grained CLIP Adaptation via Densely Annotated Videos

Fine-grained adaptation of the popular CLIP model across multiple datasets.

Dataset
JSON

Motion-Guided Masking for Spatiotemporal Representation Learning

The authors used several video benchmarks, including Kinetics-400 and Something-Something V2, to evaluate their proposed motion-guided masking algorithm.

Dataset
JSON

UniFormer

The UniFormer dataset is a video understanding dataset used for learning rich and multi-scale spatiotemporal semantics from high-dimensional videos.

Dataset
JSON

Kinetics-400 and Kinetics-600

The Kinetics-400 and Kinetics-600 datasets are video understanding datasets used for learning rich and multi-scale spatiotemporal semantics from high-dimensional videos.

Dataset
JSON

VideoStreaming

A novel approach to tackle the complexities of long video understanding with large language models (LLMs). Our proposed memory-propagated streaming encoding architecture...

Dataset
JSON

Kinetics-400, UCF101, HMDB51, Something-Something V1, and Something-Something V2

The Kinetics-400, UCF101, HMDB51, Something-Something V1, and Something-Something V2 datasets are used for evaluating the performance of the Bi-Calibration Networks.

Dataset
JSON

HMDB-51

Motion has shown to be useful for video understanding, where motion is typically represented by optical flow. However, computing flow from video frames is very time-consuming....

Dataset
JSON

Kinetics-400

Motion has shown to be useful for video understanding, where motion is typically represented by optical flow. However, computing flow from video frames is very time-consuming....

Dataset
JSON

MSVD

Text-Video Retrieval (TVR) aims to align relevant video content with natural language queries. To date, most state-of-the-art TVR methods learn image-to-video transfer learning...

Dataset
JSON

MSR-VTT

The dataset used in the paper is MSR-VTT, a large video description dataset for bridging video and language. The dataset contains 10k video clips with length varying from 10 to...

Dataset
JSON

TextVid

The TextVid dataset is a textual video dataset automatically generated by advanced LLMs.

Dataset
JSON

43 datasets found