MSR-VTT

doi:doi:10.57702/hi8ky096

MSR-VTT

The dataset used in the paper is MSR-VTT, a large video description dataset for bridging video and language. The dataset contains 10k video clips with length varying from 10 to 32 seconds, and each video is provided with 20 related captions for training.

BibTex:

@dataset{Jun_Xu_and_Tao_Mei_and_Ting_Yao_and_Yong_Rui_2024,
    abstract = {The dataset used in the paper is MSR-VTT, a large video description dataset for bridging video and language. The dataset contains 10k video clips with length varying from 10 to 32 seconds, and each video is provided with 20 related captions for training.},
    author = {Jun Xu and Tao Mei and Ting Yao and Yong Rui},
    doi = {10.57702/hi8ky096},
    institution = {No Organization},
    keyword = {'Captioning', 'Description', 'Language', 'Large-Scale Dataset', 'MSR-VTT', 'Natural Language', 'Text', 'Text Retrieval', 'Text-to-Video Generation', 'Video', 'Video Description', 'Video retrieval', 'Video-Text Retrieval', 'attention mechanisms', 'bridging video and language', 'captioning', 'captions', 'cross-modal learning', 'language', 'large-scale video dataset', 'multimodal learning', 'natural language', 'retrieval', 'speech recognition', 'text', 'text-based video retrieval', 'text-to-video retrieval', 'video', 'video analysis', 'video caption', 'video captioning', 'video classification', 'video description', 'video generation', 'video question answering', 'video understanding', 'video-language', 'video-text retrieval'},
    month = {dec},
    publisher = {TIB},
    title = {MSR-VTT},
    url = {https://service.tib.eu/ldmservice/dataset/msr-vtt},
    year = {2024}
}