Fine-tuned CLIP Models are Efficient Video Learners

This work explores the capability of a simple baseline called ViFi-CLIP (Video Fine-tuned CLIP) for adapting image-based CLIP to video domain.

BibTex: