-
MAD: A Large-Scale Benchmark for Long-Form Video Temporal Grounding
MAD: A large-scale benchmark for long-form video temporal grounding, containing over 384K natural language queries that derived from high-quality audio description of mainstream... -
Quda: Natural Language Queries for Visual Data Analytics
A dataset of natural language queries for visual data analytics. -
Audio retrieval with natural language queries
The AudioCaps and Clotho datasets were used to build baselines for text-based audio retrieval. -
QVHighlights: Detecting moments and highlights in videos via natural language...
QVHighlights: Detecting moments and highlights in videos via natural language queries