You're currently viewing an old version of this dataset. To see the current version, click here.

MSVD

Text-Video Retrieval (TVR) aims to align relevant video content with natural language queries. To date, most state-of-the-art TVR methods learn image-to-video transfer learning based on large-scale pre-trained vision-language models (e.g., CLIP). However, fully fine-tuning these pre-trained models for TVR incurs prohibitively expensive computation costs.

Data and Resources

This dataset has no data

Cite this as

Meng Cao, Long Chen, Mike Zheng Shou, Can Zhang, Yuexian Zou (2024). Dataset: MSVD. https://doi.org/10.57702/x9xwguf4

Private DOI This DOI is not yet resolvable.
It is available for use in manuscripts, and will be published when the Dataset is made public.

Additional Info

Field	Value
Created	December 2, 2024
Last update	December 2, 2024
Defined In	https://doi.org/10.1109/TMM.2018.2832602
Citation	https://doi.org/10.48550/arXiv.2307.09972 https://doi.org/10.48550/arXiv.2111.12476 https://doi.org/10.48550/arXiv.2209.13853 https://doi.org/10.48550/arXiv.2306.11341 https://doi.org/10.48550/arXiv.1501.02530 https://doi.org/10.1609/aaai.v37i3.25483 https://doi.org/10.48550/arXiv.2105.08276
Author	Meng Cao
More Authors	Long Chen Mike Zheng Shou Can Zhang Yuexian Zou
Homepage	https://www.msvd.org/