The OSCAR 22.01 corpus is a document-oriented corpus that is used for pre-training large generative language models. It is a multilingual corpus that contains documents holding lines in multiple languages.
BibTex:
Before browse our site, please accept our cookies policy