The Pile: An 800GB dataset of diverse text for language modeling

Organization

There is no description for this organization

No License Provided

Pile is a dataset of text, consisting of 800GB of diverse text.

Original MetadataJSON
The json representation of the dataset with its distributions based on DCAT.
Explore
- Preview
- Download

Gao et al. (2025). Dataset: The Pile: An 800GB dataset of diverse text for language modeling. https://doi.org/10.57702/uoguhdys

DOI retrieved: January 2, 2025

Before browse our site, please accept our cookies policy