Language Modeling - Groups

The Pile

The Pile dataset contains 3.5 million samples of diverse text for language modeling.

Dataset
JSON

OSCAR

The OSCAR corpus is a multilingual web corpus that is used for pre-training large generative language models. It is a document-oriented corpus that is comparable in size and...

Dataset
JSON

Common Crawl

The Common Crawl (CC) project browses and indexes all content available online. It generates 200-300 TiB of data per month (around 5% of which is in French), and constitutes the...

Dataset
JSON

3 datasets found

The Pile

OSCAR

Common Crawl