Dataset - LDM

Proof-Pile-2

The dataset used for continual pre-training of large language models, with a focus on balancing the text distribution and mitigating overfitting.
- Dataset
- JSON
Open-Orca

The dataset used for training large language models, with a focus on balancing the text distribution and mitigating overfitting.
- Dataset
- JSON

You can also access this registry using the API (see API Docs).

2 datasets found