CAE v2: Context Autoencoder with CLIP Target

doi:doi:10.57702/zytqwbtt

CAE v2: Context Autoencoder with CLIP Target

Masked image modeling (MIM) learns visual representation by masking and reconstructing image patches. Applying the reconstruction supervision on the CLIP representation has been proven effective for MIM. However, it is still under-explored how CLIP supervision in MIM influences performance.

BibTex:

@dataset{Xinyu_Zhang_and_Jiahui_Chen_and_Junkun_Yuan_and_Qiang_Chen_and_Jian_Wang_and_Xiaodi_Wang_and_Shumin_Han_and_Xiaokang_Chen_and_Jimin_Pi_and_Kun_Yao_and_Junyu_Han_and_Errui_Ding_and_Jingdong_Wang_2024,
    abstract = {Masked image modeling (MIM) learns visual representation by masking and reconstructing image patches. Applying the reconstruction supervision on the CLIP representation has been proven effective for MIM. However, it is still under-explored how CLIP supervision in MIM influences performance.},
    author = {Xinyu Zhang and Jiahui Chen and Junkun Yuan and Qiang Chen and Jian Wang and Xiaodi Wang and Shumin Han and Xiaokang Chen and Jimin Pi and Kun Yao and Junyu Han and Errui Ding and Jingdong Wang},
    doi = {10.57702/zytqwbtt},
    institution = {No Organization},
    keyword = {'CLIP', 'Masked Image Modeling', 'Visual Representation'},
    month = {dec},
    publisher = {TIB},
    title = {CAE v2: Context Autoencoder with CLIP Target},
    url = {https://service.tib.eu/ldmservice/dataset/cae-v2--context-autoencoder-with-clip-target},
    year = {2024}
}