You're currently viewing an old version of this dataset. To see the current version, click here.

ActivityNet-QA

Video question answering (VideoQA) is an essential task in vision-language understanding, which has attracted numerous research attention recently. Nevertheless, existing works mostly achieve promising performances on short videos of duration within 15 seconds. For VideoQA on minute-level long-term videos, those methods are likely to fail because of lacking the ability to deal with noise and redundancy caused by scene changes and multiple actions in the video.

Data and Resources

This dataset has no data

Cite this as

Zhu Zhang, Chang Zhou, Jianxin Ma, Zhijie Lin, Jingren Zhou, Hongxia Yang, Zhou Zhao (2025). Dataset: ActivityNet-QA. https://doi.org/10.57702/ou8www8x

Private DOI This DOI is not yet resolvable.
It is available for use in manuscripts, and will be published when the Dataset is made public.

Additional Info

Field Value
Created January 3, 2025
Last update January 3, 2025
Defined In https://doi.org/10.48550/arXiv.2404.11865
Citation
  • https://doi.org/10.5281/zenodo.8190086
  • https://doi.org/10.48550/arXiv.2210.02081
  • https://doi.org/10.48550/arXiv.2210.03941
  • https://doi.org/10.48550/arXiv.2106.01096
  • https://doi.org/10.48550/arXiv.2302.02136
Author Zhu Zhang
More Authors
Chang Zhou
Jianxin Ma
Zhijie Lin
Jingren Zhou
Hongxia Yang
Zhou Zhao
Homepage https://activitynet.org/