-
Scaling (Down) CLIP: A Comprehensive Analysis of Data, Architecture, and Trai...
This paper investigates the performance of the Contrastive Language-Image Pre-training (CLIP) when scaled down to limited computation budgets. -
ClipCap: CLIP Prefix for Image Captioning
Image captioning is a fundamental task in vision-language understanding, where the model predicts a textual informative caption to a given input image. -
Random Word Data Augmentation for Zero-Shot Anomaly Detection
This paper presents a novel method that leverages a visual-language model, CLIP, as a data source for zero-shot anomaly detection. -
Learning Robust 3D Representation from CLIP via Dual Denoising
Learning robust 3D representation from CLIP via Dual Denoising -
CLIP-DIFFUSION-LM: APPLY DIFFUSION MODEL ON IMAGE CAPTIONING
Image captioning task has been extensively researched by previous work. However, limited experiments focus on generating captions based on non-autoregressive text decoder.... -
MOSAIC: Multi-Object Segmented Arbitrary Stylization Using CLIP
A dataset for multi-object segmented arbitrary stylization using CLIP -
GestureDiffuCLIP
GestureDiffuCLIP: Gesture Diffusion Model with CLIP Latents -
CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval
Video-text retrieval plays an essential role in multi-modal research and has been widely used in many real-world web applications. The CLIP (Contrastive Language-Image... -
DreamStone: Image as a Stepping Stone for Text-Guided 3D Shape Generation
Text-guided 3D shape generation approach using CLIP and pre-trained single-view reconstruction model -
MVP-SEG: Multi-View Prompt Learning for Open-Vocabulary Semantic Segmentation
Semantic segmentation performs pixel-level classifica- tion to localize objects from different classes in the input image. Open-vocabulary semantic segmentation aims to... -
Fine-tuned CLIP Models are Efficient Video Learners
This work explores the capability of a simple baseline called ViFi-CLIP (Video Fine-tuned CLIP) for adapting image-based CLIP to video domain. -
CAE v2: Context Autoencoder with CLIP Target
Masked image modeling (MIM) learns visual representation by masking and reconstructing image patches. Applying the reconstruction supervision on the CLIP representation has been... -
NeuroCLIP: Neuromorphic Data Understanding by CLIP and SNN
Neuromorphic data understanding by CLIP and SNN -
PartSeg: Few-shot Part Segmentation via Part-aware Prompt Learning
Few-shot part segmentation using few-shot support images and pre-trained image-language model CLIP. -
Training CLIP models on Data from Scientific Papers
Contrastive Language-Image Pretraining (CLIP) models are trained with datasets extracted from web crawls, which are of large quantity but limited quality. This paper explores... -
CLIP dataset
The CLIP dataset is used to train a contrastive learning model. -
AvatarCLIP
This dataset has no description