CAE v2: Context Autoencoder with CLIP Target

Masked image modeling (MIM) learns visual representation by masking and reconstructing image patches. Applying the reconstruction supervision on the CLIP representation has been proven effective for MIM. However, it is still under-explored how CLIP supervision in MIM influences performance.

Data and Resources

Cite this as

Xinyu Zhang, Jiahui Chen, Junkun Yuan, Qiang Chen, Jian Wang, Xiaodi Wang, Shumin Han, Xiaokang Chen, Jimin Pi, Kun Yao, Junyu Han, Errui Ding, Jingdong Wang (2024). Dataset: CAE v2: Context Autoencoder with CLIP Target. https://doi.org/10.57702/zytqwbtt

DOI retrieved: December 3, 2024

Additional Info

Field Value
Created December 3, 2024
Last update December 3, 2024
Author Xinyu Zhang
More Authors
Jiahui Chen
Junkun Yuan
Qiang Chen
Jian Wang
Xiaodi Wang
Shumin Han
Xiaokang Chen
Jimin Pi
Kun Yao
Junyu Han
Errui Ding
Jingdong Wang