CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval

Video-text retrieval plays an essential role in multi-modal research and has been widely used in many real-world web applications. The CLIP (Contrastive Language-Image Pre-training), an image-language pre-training model, has demonstrated the power of visual concepts learning from web collected image-text datasets.

Data and Resources

Cite this as

Huaishao Luo, Lei Ji, Ming Zhong, Yang Chen, Wen Lei, Nan Duan, Tianrui Li (2024). Dataset: CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval. https://doi.org/10.57702/ebb207d9

DOI retrieved: December 16, 2024

Additional Info

Field Value
Created December 16, 2024
Last update December 16, 2024
Defined In https://doi.org/10.48550/arXiv.2104.08860
Author Huaishao Luo
More Authors
Lei Ji
Ming Zhong
Yang Chen
Wen Lei
Nan Duan
Tianrui Li
Homepage https://github.com/ArrowLuo/CLIP4Clip