The Pile: An 800GB dataset of diverse text for language modeling

Pile is a dataset of text, consisting of 800GB of diverse text.

Data and Resources

Cite this as

Gao et al. (2025). Dataset: The Pile: An 800GB dataset of diverse text for language modeling. https://doi.org/10.57702/uoguhdys

DOI retrieved: January 2, 2025

Additional Info

Field Value
Created January 2, 2025
Last update January 2, 2025
Defined In https://doi.org/10.48550/arXiv.2403.08763
Author Gao et al.
Homepage https://arxiv.org/abs/2101.00027