You're currently viewing an old version of this dataset. To see the current version, click here.

The Pile

The Pile dataset contains 3.5 million samples of diverse text for language modeling.

Data and Resources

Cite this as

Jaap Jumelet, Lisa Bylinina, Willem Zuidema, Jakub Szymanik (2024). Dataset: The Pile. https://doi.org/10.57702/q45kb0rx

DOI retrieved: December 16, 2024

Additional Info

Field Value
Created December 16, 2024
Last update December 16, 2024
Defined In https://doi.org/10.48550/arXiv.2403.00952
Citation
  • https://doi.org/10.48550/arXiv.2407.02136
  • https://doi.org/10.48550/arXiv.2402.19406
Author Jaap Jumelet
More Authors
Lisa Bylinina
Willem Zuidema
Jakub Szymanik
Homepage https://github.com/jumelet/lm-adjorder