You're currently viewing an old version of this dataset. To see the current version, click here.

Training CLIP models on Data from Scientific Papers

Contrastive Language-Image Pretraining (CLIP) models are trained with datasets extracted from web crawls, which are of large quantity but limited quality. This paper explores whether limited amounts higher quality data in a specific domain improves the general performance of CLIP models.

Data and Resources

This dataset has no data

Cite this as

Calvin Metzger (2024). Dataset: Training CLIP models on Data from Scientific Papers. https://doi.org/10.57702/vaozuc3r

Private DOI This DOI is not yet resolvable.
It is available for use in manuscripts, and will be published when the Dataset is made public.

Additional Info

Field Value
Created December 2, 2024
Last update December 2, 2024
Defined In https://doi.org/10.48550/arXiv.2311.04711
Author Calvin Metzger
Homepage https://github.com/nopperl/clip_arxiv_pmc