MSRVTT

The MSRVTT is a large-scale dataset for video captioning. It contains 10k video clips and each video clip is accompanied with 20 human-edited English sentence descriptions, resulting in 200K video-caption pairs in total.

Data and Resources

Cite this as

Yitian Yuan, Lin Ma, Wenwu Zhu (2024). Dataset: MSRVTT. https://doi.org/10.57702/2sfaor1e

DOI retrieved: December 3, 2024

Additional Info

Field Value
Created December 3, 2024
Last update December 3, 2024
Defined In https://doi.org/10.48550/arXiv.2105.08276
Citation
  • https://doi.org/10.48550/arXiv.2112.01062
Author Yitian Yuan
More Authors
Lin Ma
Wenwu Zhu
Homepage https://github.com/yytzsy/Syntax-Customized-Video-Captioning