The Pile

The Pile dataset contains 3.5 million samples of diverse text for language modeling.

BibTex: