MSRVTT

doi:doi:10.57702/2sfaor1e

MSRVTT

The MSRVTT is a large-scale dataset for video captioning. It contains 10k video clips and each video clip is accompanied with 20 human-edited English sentence descriptions, resulting in 200K video-caption pairs in total.

Data and Resources

Original MetadataJSON
The json representation of the dataset with its distributions based on DCAT.
Explore
- Preview
- Download

Cite this as

Yitian Yuan, Lin Ma, Wenwu Zhu (2024). Dataset: MSRVTT. https://doi.org/10.57702/2sfaor1e

DOI retrieved: December 3, 2024

Additional Info

Field	Value
Created	December 3, 2024
Last update	December 3, 2024
Defined In	https://doi.org/10.48550/arXiv.2105.08276
Citation	https://doi.org/10.48550/arXiv.2112.01062
Author	Yitian Yuan
More Authors	Lin Ma Wenwu Zhu
Homepage	https://github.com/yytzsy/Syntax-Customized-Video-Captioning