ActivityNet-QA

doi:doi:10.57702/ou8www8x

ActivityNet-QA

Video question answering (VideoQA) is an essential task in vision-language understanding, which has attracted numerous research attention recently. Nevertheless, existing works mostly achieve promising performances on short videos of duration within 15 seconds. For VideoQA on minute-level long-term videos, those methods are likely to fail because of lacking the ability to deal with noise and redundancy caused by scene changes and multiple actions in the video.

Data and Resources

Original MetadataJSON
The json representation of the dataset with its distributions based on DCAT.
Explore
- Preview
- Download

Cite this as

Zhu Zhang, Chang Zhou, Jianxin Ma, Zhijie Lin, Jingren Zhou, Hongxia Yang, Zhou Zhao (2025). Dataset: ActivityNet-QA. https://doi.org/10.57702/ou8www8x

DOI retrieved: January 3, 2025

Additional Info

Field	Value
Created	January 3, 2025
Last update	January 3, 2025
Defined In	https://doi.org/10.48550/arXiv.2404.11865
Citation	https://doi.org/10.5281/zenodo.8190086 https://doi.org/10.48550/arXiv.2210.02081 https://doi.org/10.48550/arXiv.2210.03941 https://doi.org/10.48550/arXiv.2106.01096 https://doi.org/10.48550/arXiv.2302.02136
Author	Zhu Zhang
More Authors	Chang Zhou Jianxin Ma Zhijie Lin Jingren Zhou Hongxia Yang Zhou Zhao
Homepage	https://activitynet.org/