ActivityNet-QA

doi:doi:10.57702/ou8www8x

ActivityNet-QA

Video question answering (VideoQA) is an essential task in vision-language understanding, which has attracted numerous research attention recently. Nevertheless, existing works mostly achieve promising performances on short videos of duration within 15 seconds. For VideoQA on minute-level long-term videos, those methods are likely to fail because of lacking the ability to deal with noise and redundancy caused by scene changes and multiple actions in the video.

BibTex:

@dataset{Zhu_Zhang_and_Chang_Zhou_and_Jianxin_Ma_and_Zhijie_Lin_and_Jingren_Zhou_and_Hongxia_Yang_and_Zhou_Zhao_2025,
    abstract = {Video question answering (VideoQA) is an essential task in vision-language understanding, which has attracted numerous research attention recently. Nevertheless, existing works mostly achieve promising performances on short videos of duration within 15 seconds. For VideoQA on minute-level long-term videos, those methods are likely to fail because of lacking the ability to deal with noise and redundancy caused by scene changes and multiple actions in the video.},
    author = {Zhu Zhang and Chang Zhou and Jianxin Ma and Zhijie Lin and Jingren Zhou and Hongxia Yang and Zhou Zhao},
    doi = {10.57702/ou8www8x},
    institution = {No Organization},
    keyword = {'ActivityNet-QA', 'Multimodal Learning', 'Question Answering', 'Video Question Answering', 'VideoQA', 'activity recognition', 'activitynet-qa', 'long sequences', 'minute-level long-term videos', 'multiple actions', 'noise and redundancy', 'scene changes', 'temporal modeling', 'video question answering'},
    month = {jan},
    publisher = {TIB},
    title = {ActivityNet-QA},
    url = {https://service.tib.eu/ldmservice/dataset/activitynet-qa},
    year = {2025}
}