EPIC: Leveraging Per Image-Token Consistency for Vision-Language Pre-training

The proposed EPIC method is a pre-training approach that leverages more text tokens for learning vision-language associations.

BibTex: