Multimodal Learning - Groups

YouCook2

YouCook2 consists of recipes containing labels that separate the long horizon trajectories of demonstrations into events - with explicit time stamps for the beginning and end of...

Dataset
JSON

TimeIT: A Video-Centric Instruction-Tuning Dataset

TimeIT is a video-centric instruction-tuning dataset designed for instruction tuning. It is composed of 6 diverse tasks, 12 widely-used academic benchmarks, and a total of 125K...

Dataset
JSON

TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Und...

TimeChat is a time-sensitive multimodal large language model specifically designed for long video understanding. It incorporates two key architectural contributions: a...

Dataset
JSON

QVHighlights

QVHighlights is a dataset for video highlight detection, which consists of over 10,000 videos annotated with human-written text queries.

Dataset
JSON

TGIF-QA

The TGIF-QA dataset consists of 165165 QA pairs chosen from 71741 animated GIFs. To evaluate the spatiotemporal reasoning ability at the video level, TGIF-QA dataset designs...

Dataset
JSON

Video-LLaMA: An instruction-tuned audio-visual language model for video under...

A video-LLaMA model for video understanding, comprising 100k videos with detailed captions.

Dataset
JSON

VideoChat: Chat-centric video understanding

A video-based instruction dataset for video understanding, comprising 100k videos with detailed captions.

Dataset
JSON

Valley: A Video Assistant with Large Language Model Enhanced Ability

A large multi-modal instruction-following dataset for video understanding, comprising 37k conversation pairs, 26k complex reasoning QA pairs and 10k detail description...

Dataset
JSON

MSR-VTT

The dataset used in the paper is MSR-VTT, a large video description dataset for bridging video and language. The dataset contains 10k video clips with length varying from 10 to...

Dataset
JSON

9 datasets found