MSR-VTT and UCF-101
The dataset used in the paper is MSR-VTT and UCF-101, two public datasets for video-text generation. MSR-VTT contains 4,900 videos with 20 manually annotated captions for each video, while UCF-101 contains 13,320 videos with 101 action categories.
BibTex: