C4

doi:doi:10.57702/0wpldwvq

C4

The dataset used for pre-training language models, containing a large collection of text documents.

Data and Resources

Original MetadataJSON
The json representation of the dataset with its distributions based on DCAT.
Explore
- Preview
- Download

Cite this as

Jesse Dodge, Maarten Sap, Ana Marasovi´c, William Agnew, Gabriel Ilharco, Dirk Groeneveld, Margaret Mitchell, Matt Gardner (2024). Dataset: C4. https://doi.org/10.57702/0wpldwvq

DOI retrieved: December 3, 2024

Additional Info

Field	Value
Created	December 3, 2024
Last update	December 3, 2024
Defined In	https://doi.org/10.48550/arXiv.2403.13485
Citation	https://doi.org/10.1145/3539618.3592030 https://doi.org/10.48550/arXiv.2309.03004
Author	Jesse Dodge
More Authors	Maarten Sap Ana Marasovi´c William Agnew Gabriel Ilharco Dirk Groeneveld Margaret Mitchell Matt Gardner
Homepage	https://huggingface.co/datasets/C4