MAD: A Large-Scale Benchmark for Long-Form Video Temporal Grounding

MAD: A large-scale benchmark for long-form video temporal grounding, containing over 384K natural language queries that derived from high-quality audio description of mainstream movies and grounded in over 1.2K hours of videos with very low coverage.

Data and Resources

Cite this as

Yulin Pan, Xiangteng He, Biao Gong, Yiliang Lv, Yujun Shen, Yuxin Peng, Deli Zhao (2025). Dataset: MAD: A Large-Scale Benchmark for Long-Form Video Temporal Grounding. https://doi.org/10.57702/9jkxnjlo

DOI retrieved: January 3, 2025

Additional Info

Field Value
Created January 3, 2025
Last update January 3, 2025
Defined In https://doi.org/10.1109/iccv51070.2023.01266
Author Yulin Pan
More Authors
Xiangteng He
Biao Gong
Yiliang Lv
Yujun Shen
Yuxin Peng
Deli Zhao
Homepage https://github.com/afcedf/SOONet