MSR-VTT and UCF-101

The dataset used in the paper is MSR-VTT and UCF-101, two public datasets for video-text generation. MSR-VTT contains 4,900 videos with 20 manually annotated captions for each video, while UCF-101 contains 13,320 videos with 101 action categories.

BibTex: