You're currently viewing an old version of this dataset. To see the current version, click here.

OSCAR 22.01

The OSCAR 22.01 corpus is a document-oriented corpus that is used for pre-training large generative language models. It is a multilingual corpus that contains documents holding lines in multiple languages.

Data and Resources

Cite this as

Julien Abadji, Pedro Ortiz Suarez, Laurent Romary, Benoît Sagot (2024). Dataset: OSCAR 22.01. https://doi.org/10.57702/x46uqzai

DOI retrieved: December 2, 2024

Additional Info

Field Value
Created December 2, 2024
Last update December 2, 2024
Defined In https://doi.org/10.48550/arXiv.2201.06642
Author Julien Abadji
More Authors
Pedro Ortiz Suarez
Laurent Romary
Benoît Sagot
Homepage https://oscar-corpus.com