Proof-Pile-2

The dataset used for continual pre-training of large language models, with a focus on balancing the text distribution and mitigating overfitting.

BibTex: