-
The Pile: An 800GB dataset of diverse text for language modeling
Pile is a dataset of text, consisting of 800GB of diverse text. -
SlimPajama: A 627B token cleaned and deduplicated version of RedPajama
SlimPajama: A 627B token cleaned and deduplicated version of RedPajama.