MSVD

doi:doi:10.57702/x9xwguf4

MSVD

Text-Video Retrieval (TVR) aims to align relevant video content with natural language queries. To date, most state-of-the-art TVR methods learn image-to-video transfer learning based on large-scale pre-trained vision-language models (e.g., CLIP). However, fully fine-tuning these pre-trained models for TVR incurs prohibitively expensive computation costs.

BibTex:

@dataset{Meng_Cao_and_Long_Chen_and_Mike_Zheng_Shou_and_Can_Zhang_and_Yuexian_Zou_2024,
    abstract = {Text-Video Retrieval (TVR) aims to align relevant video content with natural language queries. To date, most state-of-the-art TVR methods learn image-to-video transfer learning based on large-scale pre-trained vision-language models (e.g., CLIP). However, fully fine-tuning these pre-trained models for TVR incurs prohibitively expensive computation costs.},
    author = {Meng Cao and Long Chen and Mike Zheng Shou and Can Zhang and Yuexian Zou},
    doi = {10.57702/x9xwguf4},
    institution = {No Organization},
    keyword = {'Captioning', 'Description', 'English', 'Evaluation', 'MSVD', 'Paraphrase', 'Paraphrase Evaluation', 'Text Retrieval', 'Text Similarity', 'Video', 'Video Description', 'Video-Text', 'activity recognition', 'fill-in-the-blank', 'multimodal learning', 'multiple sentence descriptions', 'question answering', 'retrieval', 'text', 'video', 'video caption retrieval', 'video captioning', 'video description', 'video question answering', 'video understanding'},
    month = {dec},
    publisher = {TIB},
    title = {MSVD},
    url = {https://service.tib.eu/ldmservice/dataset/msvd},
    year = {2024}
}