-
Wikitext-103 and MusDB datasets
The dataset used in the paper is not explicitly mentioned, but it is mentioned that the authors trained a 16 layers transformer (Vaswani et al., 2017) based language model on... -
LibriSpeech LM
The LibriSpeech LM corpus used for pre-training speech-text models. -
Morfessor 2.0 dataset
Morfessor 2.0 dataset for English, Finnish and Turkish language models -
Den samiske tekstbanken dataset
Den samiske tekstbanken dataset for North S´ami language model -
Morpho Challenge 2010 dataset
Morpho Challenge 2010 dataset for English, Finnish and Turkish language models -
OpenWebText Corpus
A dataset for language modeling, where the goal is to predict the next word in a sequence given the previous words. -
One Billion Words Dataset
A dataset for language modeling, where the goal is to predict the next word in a sequence given the previous words. -
OSCAR 22.01
The OSCAR 22.01 corpus is a document-oriented corpus that is used for pre-training large generative language models. It is a multilingual corpus that contains documents holding... -
WikiText-103 dataset
The dataset used in this paper is the WikiText-103 dataset, which contains a large corpus of text. -
Common Crawl
The Common Crawl (CC) project browses and indexes all content available online. It generates 200-300 TiB of data per month (around 5% of which is in French), and constitutes the...