ActivityNet-QA

Video question answering (VideoQA) is an essential task in vision-language understanding, which has attracted numerous research attention recently. Nevertheless, existing works mostly achieve promising performances on short videos of duration within 15 seconds. For VideoQA on minute-level long-term videos, those methods are likely to fail because of lacking the ability to deal with noise and redundancy caused by scene changes and multiple actions in the video.

Data and Resources

Cite this as

Zhu Zhang, Chang Zhou, Jianxin Ma, Zhijie Lin, Jingren Zhou, Hongxia Yang, Zhou Zhao (2025). Dataset: ActivityNet-QA. https://doi.org/10.57702/ou8www8x

DOI retrieved: January 3, 2025

Additional Info

Field Value
Created January 3, 2025
Last update January 3, 2025
Defined In https://doi.org/10.48550/arXiv.2404.11865
Citation
  • https://doi.org/10.5281/zenodo.8190086
  • https://doi.org/10.48550/arXiv.2210.02081
  • https://doi.org/10.48550/arXiv.2210.03941
  • https://doi.org/10.48550/arXiv.2106.01096
  • https://doi.org/10.48550/arXiv.2302.02136
Author Zhu Zhang
More Authors
Chang Zhou
Jianxin Ma
Zhijie Lin
Jingren Zhou
Hongxia Yang
Zhou Zhao
Homepage https://activitynet.org/