-
FiGCLIP: Fine-Grained CLIP Adaptation via Densely Annotated Videos
Fine-grained adaptation of the popular CLIP model across multiple datasets. -
Devil in the Number: Towards Robust Multi-modality Data Filter
The dataset used in the paper is a web-scale dataset for training a vision-language model. The dataset contains text-image pairs, and the authors propose a novel filter to...