C4

The dataset used for pre-training language models, containing a large collection of text documents.

Data and Resources

Cite this as

Jesse Dodge, Maarten Sap, Ana MarasoviĀ“c, William Agnew, Gabriel Ilharco, Dirk Groeneveld, Margaret Mitchell, Matt Gardner (2024). Dataset: C4. https://doi.org/10.57702/0wpldwvq

DOI retrieved: December 3, 2024

Additional Info

Field Value
Created December 3, 2024
Last update December 3, 2024
Defined In https://doi.org/10.48550/arXiv.2403.13485
Citation
  • https://doi.org/10.1145/3539618.3592030
  • https://doi.org/10.48550/arXiv.2309.03004
Author Jesse Dodge
More Authors
Maarten Sap
Ana MarasoviĀ“c
William Agnew
Gabriel Ilharco
Dirk Groeneveld
Margaret Mitchell
Matt Gardner
Homepage https://huggingface.co/datasets/C4