The MSRVTT is a large-scale dataset for video captioning. It contains 10k video clips and each video clip is accompanied with 20 human-edited English sentence descriptions,...
Text-Video Retrieval (TVR) aims to align relevant video content with natural language queries. To date, most state-of-the-art TVR methods learn image-to-video transfer learning...
The dataset used in the paper is MSR-VTT, a large video description dataset for bridging video and language. The dataset contains 10k video clips with length varying from 10 to...
The UCF101 dataset contains 13320 videos distributed in 101 action categories. This dataset is different from the above ones in that it contains mostly coarse sports activities...