-
OSCAR 22.01
The OSCAR 22.01 corpus is a document-oriented corpus that is used for pre-training large generative language models. It is a multilingual corpus that contains documents holding... -
Common Crawl
The Common Crawl (CC) project browses and indexes all content available online. It generates 200-300 TiB of data per month (around 5% of which is in French), and constitutes the...