OSCAR 22.01

doi:doi:10.57702/x46uqzai

You're currently viewing an old version of this dataset. To see the current version, click here.

OSCAR 22.01

The OSCAR 22.01 corpus is a document-oriented corpus that is used for pre-training large generative language models. It is a multilingual corpus that contains documents holding lines in multiple languages.

Data and Resources

Original MetadataJSON
The json representation of the dataset with its distributions based on DCAT.
Explore
- Preview
- Download

Cite this as

Julien Abadji, Pedro Ortiz Suarez, Laurent Romary, Benoît Sagot (2024). Dataset: OSCAR 22.01. https://doi.org/10.57702/x46uqzai

DOI retrieved: December 2, 2024

Additional Info

Field	Value
Created	December 2, 2024
Last update	December 2, 2024
Defined In	https://doi.org/10.48550/arXiv.2201.06642
Author	Julien Abadji
More Authors	Pedro Ortiz Suarez Laurent Romary Benoît Sagot
Homepage	https://oscar-corpus.com