-
ActivityNet-QA
Video question answering (VideoQA) is an essential task in vision-language understanding, which has attracted numerous research attention recently. Nevertheless, existing works... -
InternVid: A Large-Scale Video-Text Dataset for Multimodal Understanding and ...
InternVid: A large-scale video-text dataset for multimodal understanding and generation. -
Youtube-8M
Youtube-8M is a large-scale video classification benchmark.