-
CAE v2: Context Autoencoder with CLIP Target
Masked image modeling (MIM) learns visual representation by masking and reconstructing image patches. Applying the reconstruction supervision on the CLIP representation has been... -
MST: Masked Self-Supervised Transformer for Visual Representation
The proposed method is a self-supervised learning approach for visual representation learning, which can explicitly capture the local context of an image while preserving the...