AVSD dataset

The AVSD dataset is a benchmark for audio-visual scene-aware dialog. It consists of 7659 training, 734 prototype validation, and 733 prototype testing dialog, where the Questioner has access to the first, middle, and last static frames of the video, while the Answerer has access to the entire video, including the audio stream and the original input descriptions.

Data and Resources

Cite this as

Ye Zhu, Yu Wu, Yi Yang, Yan Yan (2025). Dataset: AVSD dataset. https://doi.org/10.57702/izroz79p

DOI retrieved: January 3, 2025

Additional Info

Field Value
Created January 3, 2025
Last update January 3, 2025
Defined In https://doi.org/10.48550/arXiv.1905.02442
Citation
  • https://doi.org/10.48550/arXiv.2106.14069
Author Ye Zhu
More Authors
Yu Wu
Yi Yang
Yan Yan
Homepage https://arxiv.org/abs/1904.09635