MSR-VTT
The dataset used in the paper is MSR-VTT, a large video description dataset for bridging video and language. The dataset contains 10k video clips with length varying from 10 to 32 seconds, and each video is provided with 20 related captions for training.
BibTex: