Video Understanding - Groups

Charades-STA dataset

Temporal grounding of activities, the identification of specific time intervals of actions within a larger event context, is a critical task in video understanding.

Dataset
JSON

TVQA

TVQA is a video question answering dataset collected from 6 long-running TV shows from 3 genres. There are 21,793 video clips in total for QA collection, accompanied with...

Dataset
JSON

Ask-Anything

A video-centric multimodal instruction dataset, composed of thousands of videos associated with detailed descriptions and conversations.

Dataset
JSON

VidOR

The VidOR dataset is a rich video dataset containing natural videos of daily life.

Dataset
JSON

VQ2D

The VQ2D dataset is a subset of the Ego4D dataset, containing ground truth tracking annotations for the query object's last appearance.

Dataset
JSON

EgoLoc

The EgoLoc dataset is a reformulation of the VQ3D task and a modular pipeline that leads to significant improvements on the Ego4D VQ3D benchmark.

Dataset
JSON

PLOT-TAL - Prompt Learning with Optimal Transport for Few-Shot Temporal Actio...

Temporal Action Localization (TAL) in few-shot learning. Our work addresses the inherent limitations of conventional single-prompt learning methods that often lead to...

Dataset
JSON

QVHighlights

QVHighlights is a dataset for video highlight detection, which consists of over 10,000 videos annotated with human-written text queries.

Dataset
JSON

Long Video Understanding Benchmark

Towards long-form video understanding. We propose a two-stream spatio-temporal attention network for long video classification which combines the advantages of convolutional...

Dataset
JSON

MMX-Trailer-20 Dataset

Long form video understanding (LVU) is a sub-domain of video recognition concerned with understanding contextual information across contiguous shots which can contain multiple...

Dataset
JSON

Open Vocabulary Multi-Label Video Classification

Open vocabulary multi-label video classification dataset

Dataset
JSON

MovieChat

MovieChat: From dense token to sparse memory for long video understanding.

Dataset
JSON

Video-Chat2

Video-Chat2: From dense token to sparse memory for long video understanding.

Dataset
JSON

Video-LLaVA

Video-LLaVA: Learning united visual representation by alignment before projection.

Dataset
JSON

Video-Chat

Video-Chat: Chat-centric video understanding.

Dataset
JSON

Video-LLaMA

Video-LLaMA: An instruction-tuned audio-visual language model for video understanding.

Dataset
JSON

Video-ChatGPT

Video-ChatGPT: Towards detailed video understanding via large vision and language models.

Dataset
JSON

ActivityNet, MSR-VTT, and MSVD

The dataset used in the paper is ActivityNet, MSR-VTT, and MSVD. The authors used these datasets for text-to-video retrieval tasks.

Dataset
JSON

High-Quality Fall Simulation Dataset (HQFSD)

The High-Quality Fall Simulation Dataset (HQFSD) is a challenging dataset for human fall detection, including multi-person scenarios, changing lighting, occlusion, and...

Dataset
JSON

MSRVTT

The MSRVTT is a large-scale dataset for video captioning. It contains 10k video clips and each video clip is accompanied with 20 human-edited English sentence descriptions,...

Dataset
JSON

43 datasets found