The OSCAR corpus is a multilingual web corpus that is used for pre-training large generative language models. It is a document-oriented corpus that is comparable in size and language size distribution to OSCAR 21.09.
BibTex:
Before browse our site, please accept our cookies policy