You're currently viewing an old version of this dataset. To see the current version, click here.

MSVD

Text-Video Retrieval (TVR) aims to align relevant video content with natural language queries. To date, most state-of-the-art TVR methods learn image-to-video transfer learning based on large-scale pre-trained vision-language models (e.g., CLIP). However, fully fine-tuning these pre-trained models for TVR incurs prohibitively expensive computation costs.

Data and Resources

This dataset has no data

Cite this as

Meng Cao, Long Chen, Mike Zheng Shou, Can Zhang, Yuexian Zou (2024). Dataset: MSVD. https://doi.org/10.57702/x9xwguf4

Private DOI This DOI is not yet resolvable.
It is available for use in manuscripts, and will be published when the Dataset is made public.

Additional Info

Field Value
Created December 2, 2024
Last update December 2, 2024
Defined In https://doi.org/10.1109/TMM.2018.2832602
Citation
  • https://doi.org/10.48550/arXiv.2307.09972
  • https://doi.org/10.48550/arXiv.2111.12476
  • https://doi.org/10.48550/arXiv.2209.13853
  • https://doi.org/10.48550/arXiv.2306.11341
  • https://doi.org/10.48550/arXiv.1501.02530
  • https://doi.org/10.1609/aaai.v37i3.25483
  • https://doi.org/10.48550/arXiv.2105.08276
Author Meng Cao
More Authors
Long Chen
Mike Zheng Shou
Can Zhang
Yuexian Zou
Homepage https://www.msvd.org/