The Pile

The Pile dataset contains 3.5 million samples of diverse text for language modeling.

Data and Resources

Original MetadataJSON
The json representation of the dataset with its distributions based on DCAT.
Explore
- Preview
- Download

Jaap Jumelet, Lisa Bylinina, Willem Zuidema, Jakub Szymanik (2024). Dataset: The Pile. https://doi.org/10.57702/q45kb0rx

DOI retrieved: December 16, 2024

Field	Value
Created	December 16, 2024
Last update	December 16, 2024
Defined In	https://doi.org/10.48550/arXiv.2403.00952
Citation	https://doi.org/10.48550/arXiv.2407.02136 https://doi.org/10.48550/arXiv.2402.19406
Author	Jaap Jumelet
More Authors	Lisa Bylinina Willem Zuidema Jakub Szymanik
Homepage	https://github.com/jumelet/lm-adjorder