Video Understanding - Groups

TVQA

TVQA is a video question answering dataset collected from 6 long-running TV shows from 3 genres. There are 21,793 video clips in total for QA collection, accompanied with...

Dataset
JSON

Ask-Anything

A video-centric multimodal instruction dataset, composed of thousands of videos associated with detailed descriptions and conversations.

Dataset
JSON

VidOR

The VidOR dataset is a rich video dataset containing natural videos of daily life.

Dataset
JSON

QVHighlights

QVHighlights is a dataset for video highlight detection, which consists of over 10,000 videos annotated with human-written text queries.

Dataset
JSON

MovieChat

MovieChat: From dense token to sparse memory for long video understanding.

Dataset
JSON

Video-Chat2

Video-Chat2: From dense token to sparse memory for long video understanding.

Dataset
JSON

Video-LLaVA

Video-LLaVA: Learning united visual representation by alignment before projection.

Dataset
JSON

Video-Chat

Video-Chat: Chat-centric video understanding.

Dataset
JSON

Video-LLaMA

Video-LLaMA: An instruction-tuned audio-visual language model for video understanding.

Dataset
JSON

Video-ChatGPT

Video-ChatGPT: Towards detailed video understanding via large vision and language models.

Dataset
JSON

High-Quality Fall Simulation Dataset (HQFSD)

The High-Quality Fall Simulation Dataset (HQFSD) is a challenging dataset for human fall detection, including multi-person scenarios, changing lighting, occlusion, and...

Dataset
JSON

MSRVTT

The MSRVTT is a large-scale dataset for video captioning. It contains 10k video clips and each video clip is accompanied with 20 human-edited English sentence descriptions,...

Dataset
JSON

TGIF-QA

The TGIF-QA dataset consists of 165165 QA pairs chosen from 71741 animated GIFs. To evaluate the spatiotemporal reasoning ability at the video level, TGIF-QA dataset designs...

Dataset
JSON

Video-LLaMA: An instruction-tuned audio-visual language model for video under...

A video-LLaMA model for video understanding, comprising 100k videos with detailed captions.

Dataset
JSON

VideoChat: Chat-centric video understanding

A video-based instruction dataset for video understanding, comprising 100k videos with detailed captions.

Dataset
JSON

Valley: A Video Assistant with Large Language Model Enhanced Ability

A large multi-modal instruction-following dataset for video understanding, comprising 37k conversation pairs, 26k complex reasoning QA pairs and 10k detail description...

Dataset
JSON

Temporally-Adaptive Convolutions for Video Understanding

Spatial convolutions are extensively used in numerous deep video models. It fundamentally assumes spatio-temporal invariance, i.e., using shared weights for every location in...

Dataset
JSON

UniFormer

The UniFormer dataset is a video understanding dataset used for learning rich and multi-scale spatiotemporal semantics from high-dimensional videos.

Dataset
JSON

Kinetics-400 and Kinetics-600

The Kinetics-400 and Kinetics-600 datasets are video understanding datasets used for learning rich and multi-scale spatiotemporal semantics from high-dimensional videos.

Dataset
JSON

Kinetics-400, UCF101, HMDB51, Something-Something V1, and Something-Something V2

The Kinetics-400, UCF101, HMDB51, Something-Something V1, and Something-Something V2 datasets are used for evaluating the performance of the Bi-Calibration Networks.

Dataset
JSON

27 datasets found